We evaluate the performance of Devito, a domain specific language (DSL) for finite differences on Arm ThunderX2 processors. Experiments with two common seismic computational kernels demonstrate that Arm processors can deliver competitive performance compared to other Intel Xeon processors.
HPC-OPTIMISED ARM PROCESSORS
Arm processors such as the Huawei (Kunpeng 920), Ampere (eMAG), Fujitsu (A64FX), and Marvell (ThunderX2) are emerging as an alternative to traditional x86 architectures for HPC. The Isambard [3] is the largest Arm based HPC production system in Europe, and the first Cray XC50 (Scout) system to combine Arm based processors (32-core Marvell ThunderX2) with Cray's Aries interconnect. Each of the 42 blades integrates 4 nodes with two 32-core Marvell ThunderX2 CPUs with 256 GB of DDR4 DRAM. The whole system has 10,752 Armv8 cores. Recent studies compared the single node performance and multi-node scalability of Arm systems [3] [4] [5] . They demonstrated that for a wide range of applications, an Arm based supercomputer provides levels of performance competitive with state-of-the-art HPC-optimized processors (e.g. Intel Skylake and Broadwell) with very attractive performance per dollar ratio.
DEVITO -A DSL FOR FINITE DIFFERENCES
Devito is a DSL and a framework for the solution of PDEs based on the finite difference method (FDM) 1 . Initially designed to implement high-performance wave propagation solvers and adjoint-state methods for seismic imaging problems, Devito allows concise expression of FDM and general stencil operations symbolically. Devito uses SymPy for the generation and manipulation of stencil expressions and a pipeline of compilers and libraries to automate code generation, by applying several symbolic, and loop optimisation to generate highly efficient implementations of algorithms for different hardware architectures [2] . Originally, Devito was designed to support Intel Xeon and Intel Xeon Phi, and early investigation on different optimisation strategies which had not been considered by other stencil compilers. For example, most stencil compilers focus on cache reuse optimisation, while stencils like TTI [5] have very high arithmetic intensity, which results in elevated register pressure and requires specific optimisation techniques [2] . In addition, there are mathematical operators that fall outside the regular stencil programming model but need to be supported for practical applications. For example, source injection, interpolation at receivers and complex boundary conditions rely upon computation that is 1 https://www.devitoproject.org/ both sparse and irregular. Currently, parallelism is supported by OpenMP and MPI which are integrated to the Devito stack. 
PERFORMANCE EVALUATION
We experimented on the Isambard system, which was described in Section 1. Single socket performance is compared against an Intel Xeon Gold 5120 and an Intel Xeon Gold 6126. See Table 1 for specifications of all three processors. Memory bandwidth was measured with STREAM benchmark compiled with GCC for Intel processors. For the Arm we used CCE which presented slightly better results. All experiments were executed 10 times and the best bandwidth was considered. To evaluate the performance of Devito, we used two benchmarks: (i) the acoustic wave equation which models the propagation of an isotropic acoustic wave; and (ii) the Tilted Transverse Isotropy (TTI) model [5] , which is a representative of state-of-art wave propagators for seismic imaging in production codes today. The full model specification, its finite difference schemes, and implementation using Devito are presented in [1] . The first experiment measures the performance of an increasing number of threads running on a single socket. ThunderX2 presents competitive execution times for both benchmarks, compared to the Intel Xeon 5120 and the Intel Xeon 6126 (Table 2) . While the single thread performance is better on the Xeon than the ThunderX2, the ThunderX2 delivers competitive performance to the Xeon when all cores are utilised. This is due to the fact that the benchmark is memory bound (low operational intensity) and the ThunderX2 has a much higher memory bandwidth than the Xeon's. The next experiment measured the performance of the code generated by Devito in terms of the maximum performance for the Arm processor (in Fig. 1 ). We performed a complete set of experiments including two simulation models (acoustic isotropic, TTI), two compilers (GCC-8, and CCE), three Devito optimisation modes (basic, aggressive, DSE), three grid sizes (512 3 , 768 3 , and 1024 3 points), and 20m steps. For the GCC we used the flags -O3 -g -fPIC -march=native --fast-math -shared -fopenmp, and for the CCE compiler we used the flags -O3 -g -fPIC -shared -homp. The results shown for the Arm processor were produced by GCC, which presented slightly better performance compared to CCE. In total, 288 simulations were executed being replicated three times and averaged. The variance observed is negligible (< 1%). 
FINDINGS
The results presented demonstrate that Arm based processors are capable of delivering performance similar to state-of-the-art Intel Xeon processors for the execution of seismic inverse problems. Additionally, Devito is shown to be capable of generating efficient high performance code for Arm processor. All models compiled and ran successfully, and no architecture specific code tuning was necessary to achieve high performance.
