8 Abstract We have successfully ported an arbitrary high-9 order discontinuous Galerkin method for solving the three-10 dimensional isotropic elastic wave equation on unstruc-11 tured tetrahedral meshes to multiple GPU using CUDA and 12 MPI and obtained a speedup factor of about 28.3 for the 13 single-precision version of our codes and a speedup factor 14 of about 14.9 for the double-precision version. The GPU 15 used in the comparisons is NVIDIA Tesla C2070 Fermi, 16 and the CPU used is Intel Xeon W5660. To effectively 17 overlap inter-process communication with computation, we 18 separate the elements on each subdomain into inner and 19 outer elements and complete the computation on outer 20 elements and fill the MPI buffer first. While the MPI 21 messages travel across the network, the GPU performs 22 computation on inner elements, and all other calculations 23 that do not use information of outer elements from neigh-24 boring subdomains. A significant portion of the speedup 25 also comes from a customized matrix-matrix multiplica-26 tion kernel, which is used extensively throughout our 27 program. Preliminary performance analysis on our parallel 28 GPU codes shows favorable strong and weak scalabilities. 29 30
Introduction

33
Computer simulations of seismic wavefields have been 34 playing an important role in seismology in the past few 35 decades. However, the accurate and computationally effi-36 cient numerical solution of the three-dimensional elastic 37 seismic wave equation is still a very challenging task, 38 especially when the material properties are complex, and 39 the modeling geometry, such as surface topography and 40 subsurface fault structures, is irregular. In the past, several 41 numerical schemes have been developed to solve the 42 elastic seismic wave equation. The finite-difference (FD) 43 method was introduced to simulate SH and P-SV waves on 44 regular, staggered-grid, two-dimensional meshes in Ma-45 dariaga (1976) and Virieux (1984 Virieux ( , 1986 . The FD method 46 was later extended to three spatial dimensions and to 47 account for anisotropic, visco-elastic material properties 48 (e.g., Mora 1989; Igel et al. 1995; Tessmer 1995; Graves 49 1996; Moczo et al. 2002) . The spatial accuracy of the FD 50 method is mainly controlled by the number of grid points 51 required to accurately sample the wavelength. The pseudo-52 spectral (PS) method with Chebychev or Legendre poly-53 nomials (e.g., Carcione 1994; Tessmer and Kosloff 1994; 54 Igel 1999) partially overcomes some limitations of the FD 55 method and allows for highly accurate computations of 56 spatial derivatives. However, due to the global character of 57 its derivative operators, it is relatively cumbersome to 58 account for irregular modeling geometries, and efficient 59 and scalable parallelization on distributed-memory com-60 puter clusters is not as straightforward as in the FD method. 61 Another possibility is to consider the weak (i.e., varia-62 tional) form of the seismic wave equation. The finite-ele-63 ment (FE) method (e.g., Lysmer and Drake 1972; Bao et al. 64 1998) and the spectral-element (SE) method (e.g., Ko-65 matitsch and Vilotte 1998; Komatitsch and Tromp 1999, 119 scientific computing, there are some inevitable trends, for 120 instance, the increase in the number of cores in general-121 purpose CPUs and the adoption of many-core accelerators 122 (e.g., Field Programmable Gate Array, Graphic Processing 123 Unit, Cell Broadband Engine) due to their smaller footprints 124 and lower power consumptions than general-purpose CPUs.
125 The users who want to once again experience substantial 126 performance improvements as before need to learn how to 127 exploit multiple/many cores. The GPU is becoming an 128 attractive co-processor for general-purpose scientific com-129 puting due to its high arithmetic computation power, large 130 memory bandwidth, and relatively lower costs and power 131 consumptions per FLOP, when compared with a typical 132 CPU. A typical GPU (e.g., NVIDIA GeForce 9800) can 133 reach a peak processing rate of 700 GFLOPS (1 134 GFLOPS = 10 9 floating-point-operations per second) and a 135 peak memory bandwidth of 70 GB/s. Unlike in a conven-136 tional CPU, in a GPU, many more transistors are dedicated 137 for data processing rather than data caching or flow control, 138 which makes GPUs particularly well suited to address 139 problems that can be expressed as data-parallel computa-140 tions. Recent efforts by GPU vendors, in particular, NVI-141 DIA's CUDA (Compute Unified Device Architecture) 142 programming model, the OpenCL (Open Computing Lan-143 guage) framework, and the OpenACC compiler directives 144 and APIs, have significantly increased the programmability 145 of commodity GPUs. Using these tools, a programmer can 146 directly issue and manage data-parallel computations on 147 GPUs using high-level instructions without the need to map 148 them into a set of graphic-processing instructions.
149 With the rapid development of the GPU programming 150 tools, various numerical algorithms have been successfully 151 ported to GPUs, and GPU-CPU hybrid computing plat-152 forms and substantial speedups, compared with pure-CPU 153 implementations, have been achieved for applications in 154 different disciplines. The discontinuous Galerkin (DG) 155 method for solving the 3D Maxwell's equations, which are 156 linear, hyperbolic systems of conservation laws similar to 157 the seismic wave equation, has been successfully mapped 158 to GPU using NVIDIA's CUDA framework and achieved 159 more than an order of magnitude speed-up compared with 160 the same implementation on a single current-generation 161 CPU (Klöckner et al. 2009 ). The GPU implementation was 162 done on a single NVIDIA GTX 280, which costs around 163 $400. A significant portion of the speed-up came from the 164 high-order nature of the DG method. In the area of 165 acoustic/elastic seismic wave propagation simulations, the 166 finite-difference and the spectral-element methods have 167 been successfully implemented on CPU-GPU hybrid 168 clusters using the CUDA programming model (e.g., Abd-169 elkhalek et al. 2009; Komatitsch et al. 2009; Komatitsch 170 et al. 2010; Michéa and Komatitsch 2010; Okamoto et al. 171 2010; Wang et al. 2010) . The speedup obtained varies from 172 around 209 to around 609 depending on several factors, 173 e.g., whether a particular calculation is amenable to GPU 174 acceleration, how well the reference CPU code is opti-175 mized, the particular CPU and GPU architectures used in 176 the comparisons, and the specific compilers, as well as the 177 compiler options, used for generating the binary codes.
178
In Mu et al. (2013) , we successfully ported the ADER-DG 179 method for solving three-dimensional elastic seismic wave 180 equation to a single NVIDIA Tesla C2075 GPU using 181 CUDA and obtained a speedup factor of about 24 when 182 compared with the serial CPU code running on one Intel 183 Xeon W5880 core. In this article, we explore the potential of 184 accelerating the ADER-DG method using multiple NVIDIA 185 Fermi GPUs with CUDA and the Message-Passing Interface 186 (MPI). Our reference CPU code is a community code named 187 ''SeisSol.'' The ''SeisSol'' code was written in Fortran 90 188 and parallelized using the Message-Passing Interface (MPI). 204 2 CUDA programming model 205 For readers who are not familiar with CUDA or GPU 206 programming, we give a very brief introduction about the 207 programming model in this section. The CUDA software 208 stack is composed of several layers, including a hardware 209 driver, an application programming interface (API), and its 210 runtime environment. There are also two high-level, 211 extensively optimized CUDA mathematical libraries, the 212 fast Fourier transform library (CUFFT) and the basic linear 213 algebra subprograms (CUBLAS), which are distributed 214 together with the software stack. The CUDA API com-215 prises an extension to the C programming language for a 216 minimum learning curve. The complete CUDA program-217 ming toolkit is distributed free of charge and is regularly 218 maintained and updated by NVIDIA.
219
A CUDA program is essentially a C program with mul-220 tiple subroutines (i.e., functions). Some of the subroutines 221 may run on the ''host'' (i.e., the CPU), and others may run on 222 the ''device'' (the GPU). 239 The memory on a GPU is organized in a hierarchical 240 structure. Each thread has access to its own register, which 241 is very fast, but the amount is very limited. The threads 242 within the same block have access to a small pool of low-243 latency ''shared memory.'' The total amount of registers 244 and shared memory available on a GPU restricts the 245 maximum number of active warps on a multiprocessor (i.e., 246 the ''occupancy''), depending upon the amount of registers 247 and shared memory used by each warp. To maximize 248 occupancy, one should minimize the usage of registers and 249 shared memory in the kernel. The most abundant memory 250 type on a GPU is the ''global memory''; however, accesses 251 to the global memory have much higher latencies. To hide 252 the latency, one needs to launch a large number of thread 253 blocks so that the thread scheduler can effectively overlap 254 the global memory transactions for some blocks with the 255 arithmetic calculations on other blocks. To reduce the total 256 number of global memory transactions, each access needs 257 to be ''coalesced'' (i.e., consecutive threads accessing 258 consecutive memory addresses), otherwise the access will 259 be ''serialized'' (i.e., separated into multiple transactions), 260 which may heavily impact the performance of the code.
261 In addition to data-parallelism, GPUs are also capable of 262 task parallelism which is implemented as ''streams'' in 263 CUDA. Different tasks can be placed in different streams, 264 and the tasks will proceed in parallel despite the fact that they 265 may have nothing in common. Currently task parallelism on 266 GPUs is not yet as flexible as on CPUs. Current-generation 267 NVIDIA GPUs now support simultaneous kernel executions 268 and memory copies either to or from the device. 
270
The ADER-DG method for solving the seismic wave 271 equation is both flexible and robust. It allows unstructured 272 meshes and easy control of accuracy without compromising 273 simulation stability. Like the SE method, the solution inside 274 each element is approximated using a set of orthogonal basis 275 functions, which leads to diagonal mass matrices. These 276 types of basic functions exist for a wide range of element 277 types. Unlike the SE or typical FE schemes, the solution is 278 allowed to be discontinuous across element boundaries. The 279 discontinuities are treated using well-established ideas of 280 numerical flux functions from the high-order finite-volume 281 framework. The spatial approximation accuracy can be 282 easily adjusted by changing the order of the polynomial 283 basis functions within each element (i.e., p-adaptivity). The 284 ADER time-stepping scheme is composed of three major 285 ingredients, a Taylor expansion of the degree-of-freedoms 286 (DOFs, i.e., the coefficients of the polynomial basis func-287 tions in each element) in time, the solution of the Derivative 288 Riemann Problem (DRP) (Toro and Titarev 2002) that 289 approximates the space derivatives at the element bound-290 aries and the Cauchy-Kovalewski procedure for replacing 291 the temporal derivatives in the Taylor series with spatial 292 derivatives. We summarize major equations of the ADER-293 DG method for solving the three-dimensional isotropic 294 elastic wave equation on unstructured tetrahedral meshes in 295 the following. Please refer to Dumbser and Käser (2006) for 296 details of the numerical scheme.
297
The three-dimensional elastic wave equation for an iso-298 tropic medium can be expressed using a first-order velocity-299 stress formulation and written in a compact form as
301 301 where Q is a 9-vector consisting of the six independent 302 components of the symmetric stress tensor and the velocity 303 vector Q ¼ ðr xx ; r yy ; r zz ; r xy ; r yz ; r xz ; u; v; wÞ T , and A pq , B pq 304 and C pq are space-dependent 9 9 9 sparse matrices with the 305 nonzero elements given by the space-dependent Lamé 306 parameters and the buoyancy (i.e., the inverse of the density). 307 Summation for all repeated indices is implied in all equations. 308 The seismic source and the free-surface and absorbing 309 boundary conditions can be considered separately as shown in 310 Käser and Dumbser (2006) and Dumbser and Käser (2006 
316 316 whereQ ðmÞ pl ðtÞ are time-dependent DOFs, and n, g, f are 317 coordinates in the reference element T E . Explicit 318 expressions for the orthogonal basis functions U l ðn; g; fÞ 319 on a reference tetrahedral element are given in Cockburn 320 et al. (2000) and the appendix A of Käser et al. (2006) . (3) and convert all the integrals from the global 334 xyz-system to the ngf-system in the reference element T E 335 through a coordinate transformation, we obtain the semi-336 discrete discontinuous Galerkin formulation, 
356 356 and the m-th temporal derivative can be determined 357 recursively as
359
The Taylor expansion of the DOF at time t n is,
361 361 which can be integrated from t n to t nþ1 ,
363 363 where Dt ¼ t nþ1 À t n , and o m tQpn ðt n Þ can be computed 364 recursively using Eq. (8).
365
Considering Eq. (9), the fully discretized system can 366 then be obtained by integrating the semi-discrete system, 367 Eq. (5), from t n to t nþ1 ;
369
Equation (10), together with Eqs. (8) and (9), provides 370 the mathematical foundation for our GPU implementation 371 and optimization.
372 4 Implementation and optimization on multiple GPUs 373 Prior to running our wave-equation solver, a tetrahedral 374 mesh for the entire modeling domain was generated on a 375 CPU using the commercial mesh generation software 376 ''GAMBIT.'' The mesh generation process is fully auto-377 mated, and the generated tetrahedral mesh conforms to all 378 discontinuities built into the modeling geometry, including 379 irregular surface topography and subsurface fault struc-380 tures. The entire mesh was then split into subdomains, one 381 per GPU, using the open-source software ''METIS'' which 382 is a serial CPU program for partitioning finite-element 383 meshes in a way that minimizes inter-processor commu-384 nication cost while maintaining load balancing.
385
Pre-processing
386
In Fig. 1 , we listed the major steps in the reference parallel 387 CPU code, ''SeisSol'' la Puente De et al. (2009) , and those 388 in our parallel CPU-GPU hybrid implementation. In our 389 parallel CPU-GPU hybrid implementation, we assume that 390 each MPI process has access to only one device, and each 391 device is controlled by only one MPI process. At the start 392 of the calculation, a sequence of pre-processing steps is 393 executed on the CPUs. The pre-processing sequence 394 includes:
395
(1) reading and processing a control file; 396 (2) reading and processing geometric information, which 397 include the tetrahedral mesh, the boundary condi-398 tions, the material properties (i.e., density and Lamé 399 parameters) for each element, and the mesh parti-400 tioning information generated by METIS; 401 (3) for the elements in each subdomain, creating a list of 402 all the elements that are in contact with elements in 403 other subdomains, which we call the ''outer'' ele-404 ments, and those that are not, which we call the 405 ''inner'' elements; 406 (4) reading and processing the DG matrices, which 407 include the mass, stiffness, and flux matrices, which 408 were pre-computed and stored on the disk; and 409 (5) reading and processing the files describing the seismic 410 source and seismic receivers.
411
Our CUDA program adopts the typical CUDA pro-412 gramming model. After the pre-processing sequence is 413 carried out on the host CPUs, the arrays needed by the 414 CUDA kernels are then copied to the global memory of the 415 devices using ''cudaMemcpy.'' The hosts then call a 416 sequence of CUDA kernels in every time step. The results 417 of the simulation (e.g., the synthetic seismograms) are 418 stored on the devices during the time loop and copied back 419 to the hosts after all time steps are completed. 420 In our implementation, the calculation for each tetra-421 hedral element is carried out by one thread block. Within 422 each thread block, the number of threads depends upon the Fig. 1 The flowcharts of the major steps in the reference parallel CPU codes (left) and those in our CPU-GPU hybrid implementation (right). The whole calculation can be separated into three sections: pre-processing, time-stepping, and post-processing. The pre-processing section reads and calculates all the data that the time-stepping section will use. The time-stepping section updates the DOFs of each tetrahedral element according to Eqs. (8)- (10) and has been ported to the GPU. The post-processing section is in charge of writing out the DOFs and/or the seismograms at the pre-specified locations 428 element is 35. Considering the nine components of the 429 governing PDE (i.e., six stress components and three 430 velocity components), the DOFs of each element consist of 431 a 9 9 35 matrix which is represented in memory as a one-432 dimensional array of length 315 organized in the column-433 major ordering. To obtain better memory alignment, we 434 padded five zeros behind the DOFs of each element so that 435 the length of the one-dimensional DOF array of each ele-436 ment is increased to 320, which is 10 times the number of 437 threads in a warp. For a subdomain of ''nElem'' elements, 438 the length of the one-dimensional DOF array for the whole 439 subdomain is, therefore, ''nElem 9 320.'' The amount of 440 memory that is wasted on purpose is less than 1.6 %; 441 however, the better memory alignment improved the per-442 formance of some simple operations such as summation 443 operations and scalar-product operations by around 6.3 %.
4.2 Matrix-matrix multiplications
445 The implementation and optimization details for steps (1)- (5) 446 are documented in Mu et al. (2013) . In this section, we give a 447 very brief summary. In those steps, most of the total wall-time 448 is spent on matrix-matrix multiplications. We use step (2) 449 which computes the volume contribution, as an example. A 450 flowchart of the major calculations in step (2) 455 in Fig. 2a , is computed in step (1) and has the same dimension 456 and memory layout as the DOF array. The multiplication 457 between K n kl , denoted as ''Kxi'' in Fig. 2a , and the time-458 integrated DOF is different from the normal matrix-matrix 459 product in linear algebra. This multiplication involves three 460 steps: first, transpose the time-integrated DOF matrix, second, 461 multiply with the stiffness matrix following the usual matrix-462 matrix product rule, third, transpose the matrix obtained in the 463 previous step. We call this multiplication the ''left-multipli-464 cation.'' A code segment for the baseline CUDA implemen-465 tation of the left-multiplication is shown in Fig. 2b . This left-466 multiplication operation is used extensively through the cal-467 culations in steps (1)-(4) and deserves more optimization 468 effort. First, we can reduce the number of floating-point 469 operations by exploiting the fact that some of the matrices in 470 this operation are sparse; second, the order of the floating-471 point operations can be rearranged in a way such that the 472 accesses to ''dgwork'' in the global memory are as coalesced 473 as possible. The DOF, its temporal derivatives, and the time-474 integrated DOF are generally dense. However, the stiffness 475 matrices and the Jacobians have fill-ratios ranging from 8.8 % 476 to 29.6 %. To take advantage of this sparsity, one possibility 477 is to adopt an existing sparse linear algebra library such as 478 ''CUSP'' (Bell and Garland 2009) or cuSPARSE. However, 479 the result we obtained using ''CUSP'' was not very satisfac-480 tory. The best performance gain, which was obtained using 481 the ''HYB'' matrix format, was about 36.2 % compared with 482 the baseline implementation shown in Fig. 2b . This is largely 483 due to the very irregular distribution of the non-zeros in our 484 matrices, which caused a large number of uncoalesced 485 accesses to the time-integrated DOF arrays, and the amount of 486 arithmetic calculations was not large enough to hide the 487 memory access latencies due to the low fill-ratio in the stiff-488 ness matrix. Considering the fact that the locations of the non-489 zero elements in the stiffness matrices can be determined 490 beforehand and are fixed throughout the program, the results 491 of the left-multiplications can be evaluated analytically 492 beforehand and expressed in terms of the non-zero elements 493 in those matrices using a computer algebra system. The 494 expressions of the left-multiplication results, which are linear 495 combinations of the time-integrated DOF with coefficients 496 given by the non-zero elements of the stiffness matrices, can 497 be hardwired into the CUDA kernels. This implementation 498 eliminates all redundant calculations involving zero elements 499 and by carefully arranging the order of the calculations in 500 accordance with the thread layout, we can also minimize the 501 number of uncoalesced memory accesses to the time-inte-502 grated DOF array. A code segment of the optimized left-503 multiplication is shown in Fig. 2c , which is about 4 times 504 faster than the baseline implementation shown in Fig. 2b . 505 This approach can also be applied to normal matrix-matrix 506 product and is used throughout steps (1)-(4) in our optimized 507 CUDA codes. A drawback of this approach is that the 508 resulting kernel source code is quite long, and some manual 509 editing is required to ensure coalesced memory accesses. 510 However, modern computer algebra systems (e.g., Mathem-511 atica, Maple) usually have automated procedures for trans-512 lating long mathematical expressions into the C language 513 which is usually error-proof and can be directly incorporated 514 into the CUDA kernels with minimal effort. 515 Our own matrix-matrix multiplication scheme greatly 516 contributes our CUDA codes. Before we settled for the 517 final version of our single-GPU code, we also complete two 518 other versions of CUDA implementations, one is the 519 baseline implementation, which adapts conventional 520 CUDA dense matrix-matrix multiplication, and the other is 521 the median implementation which uses sparse matrix-522 matrix multiplication scheme. As a result, compared with 523 the CPU based ADER-DG code, the baseline implemen-524 tation obtains a speedup factor of 9.79, and the median 525 implementation improved the speedup factor from 9.79 to 526 13.29. This improvement mainly gains from getting rid of 527 all zeros elements' computation; however, due to the very 528 irregular distribution of non-zeros locality, this median (3) and (4), as 552 well as any contributions from the seismic source, by 553 using Eq. (10) which also involves inverting the mass 554 matrix M kl , which is diagonal.
555
All the calculations in steps (1), (2), (3), and (5) can be 556 performed in an element-local way and require no inter-557 element information exchange, which is ideal for SIMD-558 type processors such as GPUs. The calculations in step (4) 559 need to use the time-integrated DOFs from all neighboring 560 elements which in our distributed-memory, parallel 566 In our implementation, we calculate the time-integrated 567 DOFs for all the outer elements of a subdomain first. The 568 calculation of the time-integrated DOF requires access to 569 the DOF array in the global memory. The DOFs of the 570 outer elements are usually scattered throughout the entire 571 DOF array of the subdomain. To avoid non-coalesced 572 memory accesses, which could impact performance by up 573 to 54 %, the entire DOF array is split into two sub-arrays, 574 one for DOFs of all the outer elements and the other for the 575 DOFs of all inner elements. Once we complete the calcu-576 lations of the time-integrated DOFs of the outer elements, 577 the device starts to compute the time-integrated DOFs of 578 the inner elements of the subdomain right away. At the 579 same time, the time-integrated DOFs of the outer elements 580 are assembled into a separate array which is then copied 581 into the host memory asynchronously to fill the MPI buffer 582 using a separate CUDA stream, and then the host initiates a 583 non-blocking MPI data transfer and returns. While the 584 messages are being transferred, the device completes the 585 calculations of the time-integrated DOFs of the inner ele-586 ments, combines them with the time-integrated DOFs of 587 the outer elements into a single time-integrated DOF array 588 and proceeds to calculations of the volume contributions in 589 step (2) and the first numerical flux term in step (3). On the 590 host, synchronization over all the MPI processes is per-591 formed; once the host receives the array containing the 592 time-integrated DOFs of the outer elements on the neigh-593 boring subdomains, it is copied to the device asynchro-594 nously using a separate CUDA stream. After completing 595 step (3), the device synchronizes all streams to make sure 596 that the required time-integrated DOFs from all neighbor-597 ing subdomains have arrived and proceed to calculate the 598 second numerical flux term in step (4) and then update the 599 DOFs as in step (5). The overhead for splitting the entire 600 DOF array into two sub-arrays for inner and outer elements 601 and for combining the time-integrated DOFs of the outer 602 and inner elements into a single array amounts to less than 603 0.1 % of the total computing time on the device. The entire 604 process is illustrated in Fig. 1 . 605 There are many factors can influent the speedup factor 606 contribution by overlapping communication with compu-607 tation; the overlapping benefit almost differs from com-608 puter to computer. Based on our completed experiments, 609 the best scenario which all the GPUs located on the same 610 node, the communication time takes 11.7 % of total com-611 putation time, while when each GPU locate on the different 612 node, the ratio could rise up to 25.1 %. However, since we 613 apply the multiple streams and the overlapping techniques, b Fig. 2 a The flowchart of the calculations in step (2), the volume contributions. ''dudt'' is the volume contribution, ''Kxi,'' ''Keta,'' ''Kzeta'' correspond to the stiffness matrices, K To ensure effective communication-computation over-617 lap, the ratio of the number of the outer to inner elements 618 must be sufficiently small. An upper bound of this ratio can 619 be estimated based on both the processing capability of the 620 devices and the speed of the host-device and host-host 621 inter-connections. On the NVIDIA Fermi M2070 GPUs 622 that we experimented with, we achieved nearly zero 623 communication overheads when this ratio is below 2 %. 624 We note that if the same approach is implemented using a 625 classic CPU cluster, this ratio can be much larger, since the 626 calculations for the inner elements and steps (2) and (3) are 627 over an order of magnitude slower on a CPU core.
628 5 Performance analysis 629 In this study, the speedup factor is defined as the ratio 630 between the wall-time spent on running a simulation on a 631 number of CPU cores with the wall-time spent on running 632 the same simulation on the same or less number of GPUs. 633 The CPU used as the reference is the Intel Xeon W5660 634 (2.80 GHz/12 MB L2 cache) processor, and the GPU is the 635 NVIDIA Tesla C2070 (1.15 GHz/6 GB 384bit GDDR5) 636 processor. Specifications of our CPU and GPU processors 637 can be found on Intel and NVIDIA's websites.
638
The speedup factor depends strongly upon how well the 639 reference CPU code is optimized and sometimes also on 640 the specific CPU compiler and compiling flags. The fastest 641 executable on our CPU was obtained using the Intel 642 compiler ''ifort'' with the flag ''-O3.'' The wall-time for 643 running the CPU code in double-precision mode is only 644 slightly longer than running the CPU code in single-pre-645 cision mode by around 5 %. Our GPU codes were com-646 piled using the standard NVIDIA ''nvcc'' compiler of 647 CUDA version 4.0. Throughout this article, we use the 648 double-precision version of the fastest CPU code as the 649 reference for computing the speedup factors of our single-650 and double-precision GPU codes.
651
The accuracy of our single-precision GPU codes is 652 sufficient for most seismological applications. The com-653 puted seismograms have no distinguishable differences 654 from the seismograms computed using the double-preci-655 sion version of the reference CPU code, and the energy of 656 the waveform differences is much less than 1 % of the total 657 energy of the seismogram.
658 5.1 Single-GPU performance 659 For our single-GPU performance analysis, we computed 660 the speedup factors for problems with seven different mesh 661 sizes (Fig. 3) . The number of tetrahedral elements used in 662 our experiments is 3, 799, 6,899, 12,547, 15,764, 21,121, 663 24,606, and 29,335 . The material property is constant 664 throughout the mesh with density 3,000 kg/m 3 and Lamé 665 parameters k 5.325 9 10 10 and l 3.675 9 10 10 Pascal. We 666 applied the traction-free boundary condition on the top of 667 the mesh and absorbing boundary condition on all other 668 boundaries. The seismic source is an isotropic explosive 669 source buried in the center of the mesh. The wall-time 670 measurements were obtained by running the simulations 671 for 1,000 time steps. The speedup factors were computed 672 for our single-precision GPU code with respect to the CPU 673 code running on one, two, four, and eight cores. For the 674 multi-core runs on the CPUs, the parallel version of the 675 ''SeisSol'' code is used as the reference. For the seven 676 different mesh sizes, the speedup factor ranges from 23.7 to 677 25.4 with respect to the serial CPU code running on one 678 core, from 12.2 to 14 with respect to the parallel CPU code 679 running on two cores, from 6.5 to 7.2 with respect to the 680 parallel CPU code running on four CPU cores, and from 681 3.5 to 3.8 with respect to the parallel CPU code running on 682 eight CPU cores. The speedup factor does not decrease 683 linearly with increasing number of CPU cores. For 684 instance, the speedup factor with respect to eight CPU 685 cores is about 14 % better than what we would have 686 expected considering the speedup factor with respect to one 687 CPU core if we had assumed a linear scaling. For the 688 parallel version of the CPU code, there are overheads 689 incurred by the MPI communication among different cores, 690 while for the single-GPU simulations, such communication 691 overheads do not exist. 692 Since most of the calculations in the ADER-DG method 693 are carried out in an element-local way, and there are no 694 inter-process communication overheads on a single GPU, 695 we expect the ''strong scaling'' on a single GPU, defined as 696 the total wall-time needed to run the application on one 697 GPU when the total number of elements (i.e., the amount 698 of workload) is increased (e.g., Michéa and Komatitsch 699 2010) , to be nearly linear. In Fig. 4 , we show the total wall-700 time for 100 time steps as a function of the total number of 701 tetrahedral elements. As we can see, the strong scaling of 702 our codes on a single GPU is almost linear. The calcula-703 tions in step (4) (i.e., the second term in the numerical flux) 704 involve time-integrated DOFs of direct neighbors; how-705 ever, this inter-element dependence keeps its spatially local 706 character, while the number of elements increases and does 707 not affect the scaling.
708
Multiple-GPU performance
709
To analyze the performance of the parallel version of our 710 CUDA codes, we use a simplified version of the SEG/ 711 EAGE salt model (Käser et al. 2010) as the benchmark. 712 This model is geometrically complex, as shown in Fig. 5a Journal : Large 11589
Dispatch : 10-12- Table 1 . We note 718 that a thin layer of water lies on top of the three-dimen-719 sional model. The ADER-DG method can accurately 720 handle seismic wave propagation in water simply by set-721 ting the shear modulus of the elements in the water region 722 to zero (Käser and Dumbser 2008) . 723 This salt model is discretized into tetrahedral meshes 724 with different number of elements. In Fig. 6 , we show the 725 speedup factors obtained for two different mesh sizes, one 726 with 327,886 elements, and the other with 935,870 ele-727 ments. The simulations were run on 8, 16, 32, and 48 CPU 728 cores using the parallel version of the ''SeisSol'' code. And 729 the speedup factors were obtained by running the same 730 simulations on the same number of GPUs. On average, the 731 speedup factor for our parallel GPU codes is around 28, 732 which is slightly higher than the speedup factor obtained in 733 the single-GPU-single-CPU comparison. This may due to 734 the fact that in the parallel CPU code, the outer elements of 735 a subdomain are not treated separately from the inner 736 elements, which do not allow the parallel CPU code to 737 overlap the computation on the inner elements with the 738 communication of the time-integrated DOFs of the outer 739 elements.
740 To investigate the strong scalability (i.e., the decrease in 741 wall-time with increasing GPU number, while holding the 742 total workload that is the number of elements and time Fig. 3 Single-GPU speedup factors obtained using 7 different meshes and 4 different CPU core numbers. The total number of tetrahedral elements in the 7 meshes is 3, 799, 6,899, 12,547, 15,764, 21,121, 24,606, and 29,335 , respectively. The speedup factors were obtained by running the same calculation using our CPU-GPU hybrid code with 1 GPU and using the serial/parallel ''SeisSol'' CPU code on 1/2/4/8 CPU cores on the same compute node. The black columns represent the speedup of the CPU-GPU hybrid code relative to 1 CPU core, the dark-gray columns represent the speedup relative to 2 CPU cores, the light gray column represents the speedup relative to 4 CPU cores and the lightest gray columns represent the speedup relative to 8 CPU cores 755 of the GPU and the speed of the inter-connections. In our 756 case, when the number of GPUs used in the simulation 757 starts to exceed 48, this ratio becomes larger than 2 %, 758 which we believe is the threshold for our hardware con-759 figuration. The performance of our parallel GPU codes 760 depends upon a number of factors, such as load balancing, 761 but we think the extra communication overhead that was 762 not effectively hidden by the computation was the domi-763 nant factor for causing our codes to underperform the ideal 764 case. In Fig. 8 779 We have applied our multiple-GPUs code on two well-780 defined models, one is the SEG-EAGE Salt, and the other 781 is marmousi2. All the results with these two models have 782 been compared with CPU code SeisSol for validation. 783 For the marmousi2 model, we extrude its 2D original 784 profile and make it a 3D model with dimension of 3,500 m 785 in depth, 17,000 m in length, and 7,000 m in width 786 (Fig. 9a) . There are 379,039 tetrahedral elements, and each 787 element has its own material property, also 337 receivers 788 locate at 5 m beneath the surface along the A-A 0 (yellow) 789 line, and the horizontal interval is 50 m; the explosive 790 source located at 10.0 m(depth), 7,500.0 m(length), 791 3,500.0 m(width). In this case, we used 16 M2070 Fermi 792 GPUs located on eight different nodes, and each node has 2 793 GPUs. Our CUDA code uses 16 GPUs spend 4, 812.64(s) to 794 calculate 5 s seismogram, (Fig. 9b) 
820
The parallel version of our CUDA codes runs about 28.3 821 times faster than the reference parallel CPU codes at single 822 precision and about 14.9 times at double precision. The 823 increase in speed can be directly translated into an increase 824 in the size of the problems that can be solved using the 825 ADER-DG method. Some preliminary performance ana-826 lysis shows that our parallel GPU codes have favorable 827 strong and weak scalability as long as the ratio between the 828 number of outer elements and inner elements of each sub-829 domain is smaller than a certain threshold. 830 The ADER-DG method has a number of unique charac-831 teristics that make it very suitable for acceleration using 832 GPU-type SIMD processors. The majority of the calcula-833 tions can be carried out in an element-local way with weak 834 inter-element coupling implemented using the numerical 835 flux functions. In particular, as shown in Eq. (10), the only 836 term that involves inter-element information exchange is the 837 second term in the numerical flux, and we have shown that 838 on a distributed-memory parallel system, the communica-839 tion cost can be effectively overlapped with computation. 840 This locality in the ADER-DG method makes it relatively 841 straightforward to partition the workload among different 842 thread blocks on each GPU and also among different GPUs 843 on the cluster and results in close-to-ideal scalability. The 844 ADER-DG method is also a high-order method, which 845 requires more work per DOF than low-order methods. The 846 increase in arithmetic intensity shifts the bottleneck from the 847 memory bandwidth to the compute bandwidth. The relative 848 abundance of cheap computing power on a GPU makes it 849 favorable for implementing high-order methods.
7 Discussion
851 Debates still exist in the computer sciences community 852 about how the speedup factor should be defined in a more 853 general and objective way. Some definitions are more 854 favorable to the GPUs, and some definitions are more 855 favorable to the CPUs. But in spite of the debates about the 856 definitions of the speedup factor, a common consensus 857 among both the theoreticians and the practitioners is that 858 GPUs are relatively low-cost, low-power-consumption, 859 powerful co-processors that are suitable for SIMD-type 860 calculations and studies about efficient implementations of 861 numerical algorithms for scientific computing on the GPU 862 architectures are worthwhile. In this sense, the lessons 863 learned through the implementation and optimization 864 process are more important than the exact speedup num-865 bers that we have obtained.
866 Unlike implementations on a single GPU, to port the 867 codes to multiple GPUs effectively, we need to deal with 868 an extra layer of complexity introduced by inter-process 869 communications. For the ADER-DG method, in which a 870 majority of the calculations are local to each element, we 871 can hide the inter-process communication overhead by 872 overlapping communication with computation. We sepa-873 rate the elements on a subdomain into inner and outer 874 elements. Once the computation on the outer elements is 875 completed, we can fill the MPI buffer using a separate 876 stream on the GPU and issue a non-blocking MPI call on 877 the CPU. While the MPI messages are traveling across the 878 network, the GPU proceeds to perform computations on the 879 inner elements, and all other computations that do not need 880 information from the outer elements. The technologies for 881 multiple-GPU and CPU-GPU inter-connections are rapidly 882 evolving, and GPU-Aware MPI (GAMPI) libraries for 883 CUDA-enabled devices are gradually emerging. It is likely 884 that the process for overlapping communication with 
