The large processing requirements of seismic wave propagation simulations make High Performance Computing (HPC) architectures a natural choice for their execution. However, to keep both the current pace of performance improvements and the power consumption under a strict power budget, HPC systems must be more energy e cient than ever. As a response to this need, energye cient and low-power processors began to make their way into the market. In this paper we employ a novel low-power processor, the MPPA-256 manycore, to perform seismic wave propagation simulations. It has 256 cores connected by a NoC, no cache-coherence and only a limited amount of on-chip memory. We describe how its particular architectural characteristics influenced our solution for an energy-e cient implementation. As a counterpoint to the low-power MPPA-256 architecture, we employ Xeon Phi, a performance-centric manycore. Although both processors share some architectural similarities, the challenges to implement an e cient seismic wave propagation kernel on these platforms are very di↵erent. In this work we compare the performance and energy e ciency of our implementations for these processors to proven and optimized solutions for other hardware platforms such as general-purpose processors and a GPU. Our experimental results show that MPPA-256 has the best energy e ciency, consuming at least 77 % less energy than the other evaluated platforms, whereas the performance of our solution for the Xeon Phi is on par with a state-of-the-art solution for GPUs.
Introduction
For that, we perform regional scale modeling spanning a few hundred kilometers 120 in each spatial direction. 121 In the next section we present the standard sequential algorithm for seis- 
For instance, the stencil applied for the computation of the velocity compo-151 nent in the x-direction is given by: for y 1 to y dimension
The best parallelization strategy for the elastodynamics equations strongly 157 depends on the characteristics of the underlying hardware architecture. In the 158 following section, we detail the parallelization and optimization strategies we 159 employed for the adaptation of this sequential algorithm to highly parallel many-160 core processors.
161

Elastodynamics Numerical Kernel on Manycores
162
In this section we present our approaches to perform seismic wave propaga-163 tion simulations on MPPA-256 and Xeon Phi. We first discuss in Section 3.1 some 164 of the intrinsic characteristics and challenges that led us to employ di↵erent par-165 allelization strategies and optimizations in each one of these processors. 
184
The development of a seismic wave propagation kernel for this processor 185 can be a challenging task due to some of its intrinsic characteristics. First, the 186 low-latency memory available in each compute cluster acts as a cache, whose 187 goal is to store data retrieved from the DDR. Tag Directory (TD). Additionally, every core can directly address the shared 202 DDR memory (6 GB in our case) and is connected to the remaining cores by a 203 high-performance bidirectional ring-shaped NoC as shown in Figure 1 Xeon Phi comes from these units, it is essential to fully benefit from them. This 214 can be achieved by performing a clever decomposition of the 3D input problem 215 to maximize the use of long vectors in the unit-stride direction. The second 216 aspect is related to the L2 caches. When a core C src accesses its L2 cache and 217 misses, an address request is sent to the tag directories throughout the ring.
218
If the requested data block is found in the cache of another core (C dst ), it is 219 forwarded back through the ring to the L2 cache of C src . Thus, the overhead 220 imposed by this protocol must be avoided whenever possible to improve appli-221 cation's performance. Overall, this can achieved by organizing data memory 222 accesses to improve data locality.
223
In the next sections we describe how the architectural distinctiveness of these in which it is not uncommon to find last-level cache sizes of tens of megabytes,
231
MPPA-256 has only 32 MB of low-latency memory divided into 2 MB chunks 232 spread throughout the 16 compute clusters. These chunks of memory are di-233 rectly exposed to the programmer that must explicitly control them. Indeed, 234 the e ciency of our algorithm relies on the ability to fully exploit these low-235 latency memories. To that end, we implemented a data fractioning strategy 236 that decomposes the problem into tiles small enough to fit into the memory 237 available on the compute clusters. Figure 2 illustrates the general idea of our 238 two-level tiling scheme.
239
The three-dimensional structures corresponding to the velocity and stress for an important share of the total simulation time, specially for a 3D problem.
278
In particular, this limitation is more noticeable in processors that feature 279 high core counts such as Xeon Phi due to the contention on the bidirectional 280 data ring. To improve the cache locality we employ a cache blocking technique.
281
Cache blocking is an optimization technique that intends to reduce memory 282 bandwidth bottlenecks. The main idea is to exploit the inherent data reuse 283 available in the triple nested loop of the elastodynamics kernel by ensuring that 284 data remains in cache across multiple uses.
285
Among classical blocking strategies, the approach described by Rivera and 286 Tseng [24] proposes to tile two dimensions and to perform computations by 287 accumulating the layers in the third one. This strategy is somehow similar to 288 our tiling strategy proposed for the MPPA-256, since the idea is to control the 289 data movement by prefetching the read and the write computing planes. we consider low-level specific optimizations such as data alignment, vectorization 301 and thread a nity. On Xeon Phi, unaligned data severely limits the overall 302 performance. In our implementation, we align data with respect to the length 303 of the stencil (fourth order in our case) and we shift pointers in order fully 304 benefit from the Xeon Phi vectorization capabilities (16 single precision floats).
305
A similar optimization strategy can be found in [25] . Finally, we employ thread 306 a nity to ensure that threads are correctly bound to the cores to reduce memory 307 access latency and to alleviate the memory contention. Bridge-EP processor (with the same specifications of the Xeon E5 platform) 353 and 32 GB of DDR3 memory shared in a cc-NUMA fashion (NUMAlink6).
354
Overall, this platform has 192 physical cores. Compiler version and flags 355 are identical to that of the Xeon E5 platform. 356 We use four metrics to compare the energy and computing performance: such as Tesla K20 also have a similar tool called NVIDIA Management Library 384 (NVML). We used the NVML to gather the power usage for the GPU and its as-385 sociated circuitry (e.g., internal memory). According to NVML documentation, 386 readings are accurate to within ±5 % of the actual power draw. the experiments with dynamic and guided scheduling policies was 1, since it 469 presented the best results 1 . Thread a nity (A nity) was only applied to static 470 scheduling. As we will explain later on, thread a nity has almost no impact 471 when OpenMP's dynamic or guided scheduling policies are used.
472
Experimental results showed us that memory alignment is essential to make scatter and balanced thread placement policies presented very similar results 525 in these cases. However, the compact thread placement policy decreased the 526 performance with dynamic and guided scheduling by 22 % on average, since it 527 creates an important imbalance between cores, specially for low thread-counts.
528
The loss in performance is reduced as the number of threads increases, from 529 50 % for 40 threads to 0 % for thread counts beyond 222. Figure 5 compares the time-to-solution and energy-to-solution across the 553 processors using a problem size of 2 GB (180 3 grid points) and 500 time steps.
554
For these experiments we used the optimal number of threads on each platform.
555
With the exception of Xeon Phi (in which the best results were obtained with 556 224 threads), the thread count was equal to the number of physical cores of 557 each processor. As shown in Figure 4 , our solution for Xeon Phi keeps scaling 558 considerably well past the 57 physical cores.
559
To the best of our knowledge, GPUs are among the most energy e cient when we increase the time-to-solution weight (ED 2 P), Xeon Phi becomes more Phi, there is an additional overhead created by the limited amount of fast local 605 memory which forces the application to frequently employ the NoC during the proposed an optimized implementation of radix sort for the Tilera TILEPro64 617 manycore processor. The results showed that the their solution for TILEPro64 618 provides much better energy e ciency than an general-purpose multicore pro-619 cessor (Intel Xeon W5590) and comparable energy e ciency with respect to a 620 GPU NVIDIA Tesla C2070.
621
Francesquini et al.
[7] evaluated three di↵erent classes of applications (CPU-622 bound, memory-bound and mixed) using highly-parallel platforms such as MPPA-623 256 and a 24-node, 192-core NUMA platform. They showed that manycore 624 architectures can be very competitive, even if the application is irregular in 625 nature. Their results showed that MPPA-256 may achieve better performance 626 than a traditional general-purpose multicore processor (Intel Xeon E5-4640) on 627 CPU-bound and mixed workloads whereas on a memory-bound workload Xeon 628 E5 had better performance than MPPA-256. Among the evaluated platforms,
629
MPPA-256 presented the best energy e ciency reducing the energy consumed 630 on cpu-bound, mixed and memory-bound applications by at least 6.9x, 6.5x 631 and 3.8x, respectively. 
