Seismic Wave Propagation Simulations on Low-power and Performance-centric Manycores by Castro, Márcio et al.
Seismic Wave Propagation Simulations on Low-power
and Performance-centric Manycores
Ma´rcio Castro, Emilio Francesquini, Fabrice Dupros, Hideo Aochi, Philippe
Navaux, Jean-Franc¸ois Mehaut
To cite this version:
Ma´rcio Castro, Emilio Francesquini, Fabrice Dupros, Hideo Aochi, Philippe Navaux, et al.. Seis-
mic Wave Propagation Simulations on Low-power and Performance-centric Manycores. Parallel
Computing, Elsevier, 2016, <10.1016/j.parco.2016.01.011>. <hal-01273153>
HAL Id: hal-01273153
https://hal.archives-ouvertes.fr/hal-01273153
Submitted on 12 Feb 2016
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Seismic Wave Propagation Simulations on Low-power
and Performance-centric Manycores
Ma´rcio Castroa, Emilio Francesquinib, Fabrice Duprosc, Hideo Aochic,
Philippe O. A. Navauxd, Jean-Franc¸ois Me´haute
aDepartment of Informatics and Statistics, Federal University of Santa Catarina (UFSC)
Campus Reitor Joa˜o David Ferreira Lima, Trindade, 88040-970, Floriano´polis, Brazil
bInstitute of Computing, University of Campinas (UNICAMP)
Av. Albert Einstein, 1251, Cidade Universita´ria, 13083-852, Campinas, Brazil
cBRGM
BP 6009, 45060 Orle´ans Cedex 2, France
dInformatics Institute, Federal University of Rio Grande do Sul (UFRGS)
Av. Bento Gonc¸alves, 9500, Campus do Vale, 91501-970, Porto Alegre, Brazil
eCEA-DRT - LIG Laboratory, University of Grenoble Alpes
110 Avenue de la Chimie, 38400 Saint-Martin d’He`res, France
Abstract
The large processing requirements of seismic wave propagation simulations
make High Performance Computing (HPC) architectures a natural choice for
their execution. However, to keep both the current pace of performance improve-
ments and the power consumption under a strict power budget, HPC systems
must be more energy e cient than ever. As a response to this need, energy-
e cient and low-power processors began to make their way into the market.
In this paper we employ a novel low-power processor, the MPPA-256 manycore,
to perform seismic wave propagation simulations. It has 256 cores connected
by a NoC, no cache-coherence and only a limited amount of on-chip memory.
We describe how its particular architectural characteristics influenced our solu-
tion for an energy-e cient implementation. As a counterpoint to the low-power
MPPA-256 architecture, we employ Xeon Phi, a performance-centric manycore.
Although both processors share some architectural similarities, the challenges
to implement an e cient seismic wave propagation kernel on these platforms are
very di↵erent. In this work we compare the performance and energy e ciency of
our implementations for these processors to proven and optimized solutions for
other hardware platforms such as general-purpose processors and a GPU. Our
experimental results show that MPPA-256 has the best energy e ciency, con-
suming at least 77% less energy than the other evaluated platforms, whereas
the performance of our solution for the Xeon Phi is on par with a state-of-the-art
solution for GPUs.
Keywords: Seismic wave propagation; MPPA-256; Xeon Phi; HPC;
performance; energy e ciency
Email addresses: marcio.castro@ufsc.br (Ma´rcio Castro),
francesquini@ic.unicamp.br (Emilio Francesquini), f.dupros@brgm.fr (Fabrice Dupros),
Preprint submitted to Parallel Computing August 11, 2015
1. Introduction1
Public policies for risk mitigation and damage assessment in hypothetical2
earthquakes scenarios as well as oil and gas exploration make an intensive use3
of simulations of large-scale seismic wave propagation. In order to provide re-4
alistic results, these simulations rely on complex models, which often demand5
considerable computational resources to reliably process vast amounts of data6
in a reasonable amount of time.7
Such processing capabilities are often provided by High Performance Com-8
puting (HPC) platforms. Up until recently, HPC platform capabilities were9
evaluated almost exclusively based on their raw processing speed. However, one10
of the aspects that hinder the quest for ever growing performance is the exces-11
sive use of energy. For that reason, the energy e ciency of HPC platforms has12
become, in some contexts, as important as their raw performance.13
Indeed, the seek for alternatives to lower current energy consumption arose14
first in the embedded systems community but lately became one of the ma-15
jor concerns of the HPC scientific community [1, 2]. Recently, manycore pro-16
cessors, a new class of highly-parallel chips, was unveiled. Tilera Tile-Gx [3],17
Kalray MPPA-256 [4], Adapteva Epiphany-IV [5], Intel Single-Chip Cloud Com-18
puter (SCC) [6] and Xeon Phi are examples of such processors, providing up to19
hundreds of autonomous cores that can be used to accomplish both data and20
task parallelism. This distinctive characteristic sets them apart from SIMD-like21
highly-parallel architectures such as Graphics Processing Units (GPUs). For22
this reason, in this paper, we classify GPUs separately, in their own category.23
While some manycore processors may present better energy e ciency than24
state-of-the-art general-purpose multicore processors [7], their particular archi-25
tectural characteristics make the development of e cient scientific parallel ap-26
plications a challenging task [5, 8]. Some of these processors are built and opti-27
mized for certain classes of embedded applications like signal processing, video28
decoding and routing. Additionally, processors such as MPPA-256 have impor-29
tant memory constraints, e.g., limited amount of directly addressable memory30
and absence of cache coherence protocols. Furthermore, communication costs31
between cores are not uniform, they depend on the location of the commu-32
nicating cores. One of the main reasons for this di↵erence in communication33
cost is the Network-on-Chip (NoC). When applications distribute work among34
the cores of the processor, they should take into consideration the placement35
of these tasks and their communication patterns to minimize communication36
costs. This problem is similar to that faced by applications running on NUMA37
platforms [9, 10], or MPI applications that try to take into account both the38
network topology and the memory hierarchy to improve performance [11, 12].39
In this paper we outline the architectural distinctiveness of the low-power40
MPPA-256 manycore processor. Considering these characteristics, we describe41
how the main kernel of a seismic wave propagation simulator was adapted to42
h.aochi@brgm.fr (Hideo Aochi), navaux@inf.ufrgs.br (Philippe O. A. Navaux),
jean-francois.mehaut@imag.fr (Jean-Franc¸ois Me´haut)
2
this platform. Due to the limited size of the local memories in MPPA-256, we43
developed a new multi-level tiling strategy and a prefetching mechanism to44
allow us to deal with real simulation scenarios and to alleviate communication45
overheads. We also describe the di culties and solutions (some of them generic46
enough to be used in di↵erent contexts) we employed during this adaptation.47
As a counterpoint to the low-power MPPA-256, we also adapted the same48
kernel to a performance-centric manycore, the Xeon Phi. Although both pro-49
cessors share some architectural characteristics such as the presence of a NoC,50
non-uniform costs of communication between cores and limited fast local mem-51
ories, they have their own peculiarities which directed us to employ di↵erent52
approaches in each case. For instance, in both architectures, each thread has53
access to approximately 128 kB of fast local memory. However, their memory54
organization is very di↵erent. On Xeon Phi each core has 512 kB of L2 shared55
by up to 4 threads. On the other hand, on MPPA-256 the processing cores have56
8 kB of L1 cache each and cores are grouped into clusters of 16. Each one of57
these clusters has 2MB, of work memory shared among the cores. Contrary to58
Xeon Phi, MPPA-256 does not have cache-coherence or automatic load/store of59
values from/to main memory, and all memory transferences must be handled60
by the programmer explicitly. Ultimately, this means that in order to be able61
to achieve good performance on MPPA-256, the application is obliged to tightly62
control communications between the cores in the same cluster, between clusters,63
and between each cluster and the main memory. This paper explains the de-64
tails that made us decide for a multi-level tiling strategy proposed for MPPA-25665
instead of a more traditional approach of cache blocking that was used for Xeon66
Phi.67
Taking as a basis optimized GPU and general-purpose processor implemen-68
tations for this kernel, we show that even if MPPA-256 presents an increased69
software development complexity, it can indeed be used as an energy-e cient70
alternative to perform seismic wave propagation simulations. Our results show71
that the solution we propose for the MPPA-256 processor achieves the best en-72
ergy e ciency, consuming 77%, 86% and 88% less energy than optimized so-73
lutions for GPU, general-purpose and Xeon Phi architectures, respectively. The74
performance achieved on Xeon Phi, on the other hand, is comparable to a state-75
of-the-art GPU. Moreover, the execution was 58% and 73% faster than that of76
multicores and MPPA, respectively.77
The remainder of this paper is organized as follows. Section 2 discusses78
the fundamentals of seismic wave propagation. Then, Section 3 discusses the79
main challenges we must overcome when dealing with parallel seismic wave80
propagations on MPPA-256 and Xeon Phi. Moreover, it presents our approaches81
to perform e cient seismic wave propagation simulations on both processors.82
Section 4 discusses performance and energy e ciency experimental results. Sec-83
tion 5 describes related work and Section 6 concludes this paper.84
2. Seismic Wave Propagation85
Quantitative earthquake hazard assessment is crucial for public policy, re-86
duction of future damages and urban planning. Recent important geological87
3
events, such as the MW7.9 (moment magnitude) earthquake in China (2008) or88
the MW8.8 earthquake in Chile (2010), have been studied using numerical tools89
with good agreement with observation. From a physical point a view, the mech-90
anisms used to describe earthquake events are complex and highly nonlinear.91
For the sake of simplicity, wave propagation simulations usually consider elastic92
domains and include some anelasticity to account for attenuations. However,93
in the case of strong motion, these simplifying assumptions are not capable of94
correctly describing the phenomenon. A fundamental challenge for earthquake95
engineering is predicting the level and variability of strong ground motion from96
future earthquakes. Enhancing this predictive ability requires understanding of97
the earthquake source, the e↵ects of the propagation path on the seismic waves,98
and basin and near-surface site e↵ects.99
In this paper, we restrict our discussion to the elastodynamics equations100
corresponding to the elastic situation using a linear model. However, our ap-101
proach could also be used in nonlinear models that rely on a linear solution102
inside the iterative procedure such as Newton-like methods. Among numeri-103
cal methods, variational methods such as finite element method [14] or spectral104
element method [15] can be employed for both regional and global scale seismol-105
ogy whereas the Discontinous Galerkin method is very e cient for earthquake106
source modeling [16]. In terms of popularity, the Finite Di↵erence Method107
(FDM) remains one of the top contenders due to its implementation simplicity108
and e ciency for a wide range of applications. Although this method was ini-109
tially proposed almost thirty years ago [17], this classic discretization technique110
has been the focus of continuous refinements since its proposal [18, 19].111
Simulations of seismic wave propagations are often constrained by the com-112
putational and storage capacity of the hardware platform. Thus seismologists113
often reduce the scope of the simulation to a small volume instead of considering114
the whole Earth. The memory footprint of each simulation depends on both115
the domain of interest and the number of grid points per wavelength. This last116
condition guarantees numerical stability and therefore the accuracy of the sim-117
ulation. In our case, for practical reasons, we limit our simulations to problem118
domains that fit into the 2GB of memory available on the MPPA-256 platform.119
For that, we perform regional scale modeling spanning a few hundred kilometers120
in each spatial direction.121
In the next section we present the standard sequential algorithm for seis-122
mic wave propagation simulations. Aochi et al. [20] provide a more detailed123
description of the equations governing these simulations in the case of an elastic124
material.125
2.1. Discretization and Standard Sequential Algorithm126
As mentioned before, the FDM is one of the most popular techniques to127
solve the elastodynamics equations and to simulate the propagation of seismic128
waves. One of the key features of this scheme is the introduction of a staggered-129
grid [17] for the discretization of the seismic wave equation, which in the case130
of an elastic material is:131
4
⇢
@vi
@t
=
@ ij
@j
+ Fi (1)
Additionally, the constitutive relation in the case of an isotropic medium is:132
@ ij
@t
=   ij
✓
@vx
@x
+
@vy
@y
+
@vz
@z
◆
+ µ
✓
@vi
@j
+
@vj
@i
◆
(2)
Where indices i, j, k represent a component of a vector or tensor field in133
cartesian coordinates (x, y, z), vi and  ij represent the velocity and stress field134
respectively, and Fi denotes an external source force. ⇢ is the material density135
and   and µ are the elastic coe cients known as Lame´ parameters. A time136
derivative is denoted by @@t and a spatial derivative with respect to the i-th137
direction is represented by @@i . The Kronecker symbol  ij is equal to 1 if i = j138
and zero otherwise.139
Indeed, all the unknowns are evaluated at the same location for classical140
collocated methods over a regular Cartesian grid whereas the staggered grid141
leads to a shift of the derivatives by half a grid cell. The equations are rewritten142
as a first-order system in time and therefore the velocity and the stress fields143
can be simultaneously evaluated at a given time step.144
Exponents i, j, k indicate the spatial direction with  ijk =  (i s, j s, k s),145
 s corresponds to the space step and t to the time step. The elastic coe cients146
⇢, µ et   are defined at location (i, j, k). Starting at time step t = ` t, the147
following unknowns corresponding to the next time step t =
⇣
` + 12
⌘
 t are148
computed:149
v
(i+ 12 )jk
x
⇣
l +
1
2
⌘
, v
i(j+ 12 )k
y
⇣
l +
1
2
⌘
, v
ij(k+ 12 )
z
⇣
l +
1
2
⌘
(3)
Then, at time t = (`+ 1) t, the following unknowns are computed:150
 ijkxx
⇣
l + 1
⌘
,  ijkyy
⇣
l + 1
⌘
,  ijkzz
⇣
l + 1
⌘
,  
(i+ 12 )(j+
1
2 )k
xy
⇣
l + 1
⌘
,
 
i(j+ 12 )(k+
1
2 )
zy
⇣
l + 1
⌘
,  
(i+ 12 )j(k+
1
2 )
zx
⇣
l + 1
⌘
(4)
For instance, the stencil applied for the computation of the velocity compo-151
nent in the x-direction is given by:152
v
(i+ 12 )jk
x
⇣
l + 12
⌘
= v
(i+ 12 )jk
x
⇣
l   12
⌘
+ a1F
(i+ 12 )jk
x
+ a2
h
 (i+1)jkxx   ijkxx
 x +
 
(i+1
2
)(j+1
2
)k
xy   (i+
1
2
)(j  1
2
)k
xy
 y +
 
(i+1
2
)j(k+1
2
)
xz   (i+
1
2
)j(k  1
2
)
xz
 z
i
  a3
h
 (i+2)jkxx   (i 1)jkxx
 x +
 
(i+1
2
)(j+3
2
)k
xy   (i+
1
2
)(j  3
2
)k
xy
 y +
 
(i+1
2
)j(k+3
2
)
xz   (i+
1
2
)j(k  3
2
)
xz
 z
i
(5)
5
The computational procedure representing this evaluation is described in153
Algorithm 2.1. Inside the time step loop, the first triple nested loop is devoted154
to the computation of the velocity components, and the second loop reuses the155
velocity results of the previous time step to update the stress field.156
Algorithm 2.1: Sequential Seismic Wave Propagation( , v)
for x 1 to x dimension
do
8<:for y  1 to y dimensiondo ⇢for z  1 to z dimensiondo {compute velocity( xx, yy, zz, xy, xz, yz)
for x 1 to x dimension
do
8<:for y  1 to y dimensiondo ⇢for z  1 to z dimensiondo {compute stress(vx, vy, vz)
The best parallelization strategy for the elastodynamics equations strongly157
depends on the characteristics of the underlying hardware architecture. In the158
following section, we detail the parallelization and optimization strategies we159
employed for the adaptation of this sequential algorithm to highly parallel many-160
core processors.161
3. Elastodynamics Numerical Kernel on Manycores162
In this section we present our approaches to perform seismic wave propaga-163
tion simulations on MPPA-256 and Xeon Phi. We first discuss in Section 3.1 some164
of the intrinsic characteristics and challenges that led us to employ di↵erent par-165
allelization strategies and optimizations in each one of these processors. Then,166
in Sections 3.2 and 3.3, we describe in details our solutions for these hardware167
architectures.168
3.1. Platforms and Challenges169
MPPA-256. The MPPA-256 is a single-chip manycore processor developed170
by Kalray that integrates 256 user cores and 32 system cores in 28 nm CMOS171
technology running at 400MHz. These cores are distributed across 16 compute172
clusters and 4 I/O subsystems that communicate through data and control173
NoCs. Each compute cluster has 16 cores called Processing Elements (PEs),174
which are dedicated to run user threads (one thread per PE) in non-interruptible175
and non-preemptible mode, and a low-latency shared memory of 2MB. This176
local memory enables a high bandwidth and throughput between PEs within the177
same compute cluster. Each PE has private 2-way associative 32 kB instruction178
and data caches. The system has no cache coherence protocol between PEs,179
even among those in the same compute cluster. The board used in our tests180
6
PE0 PE1
PE2 PE3
PE4 PE5
PE6 PE7
PE8 PE9
PE10 PE11
PE12 PE13
PE14 PE15Sh
ar
ed
 M
em
or
y 
(2
M
B)
D-NoC C-NoC
RM RM RMRM
RM
RM
RM
RM
RM RM RMRM
RM
RM
RM
RM
Compute Cluster
I/O Subsystem
I/O
 S
ub
sy
st
em
I/O Subsystem
I/O
 S
ub
sy
st
em
RM
PCIe, DDR, ...
PCIe, DDR, ...
(a) MPPA-256.
TD
...
...
...
...
TD TD TD
TD TD TD TD
PCIe
DDR
DDR DDR
DDR
L2
Core 0
L2
Core 1
L2
Core 26
L2
Core 27
L2
Core 28
L2
Core 29
L2
Core 55
L2
Core 56
(b) Xeon Phi.
Figure 1: Manycore architectures.
has one of the I/O subsystems connected to an external LP-DDR3 of 2GB.181
Figure 1(a) shows an architectural overview of the MPPA-256 processor. Some182
uses of this processor include embedded parallel signal processing and video183
decoding [21], and parallel scientific applications [7].184
The development of a seismic wave propagation kernel for this processor185
can be a challenging task due to some of its intrinsic characteristics. First, the186
low-latency memory available in each compute cluster acts as a cache, whose187
goal is to store data retrieved from the DDR. However, data transfers between188
the DDR and compute clusters’ low-latency memories must be explicitly man-189
aged by the programmer, in other words, there are no automatic fetching or190
prefetching mechanisms. Second, the amount of data needed to simulate real191
wave propagation scenarios does not fit into the local memories. In fact, about192
500 kB of memory is always in use by the operating system leaving about 1.5MB193
free to house the application’s data and code. For this reason, the application194
has to take into account the data transfers and how these data should be sliced195
in order to fit into the limited local memory.196
Xeon Phi. Xeon Phi was designed to operate in a way similar to regular197
x86 multicore processors. In this article we used Xeon Phi model 3120, which198
features 57 cores. Each one of its cores has a clock frequency of 1.10GHz,199
32 kB for L1 instruction and data caches and 512 kB for L2 cache. Coherence200
between the caches is guaranteed by hardware through a Global-distributed201
Tag Directory (TD). Additionally, every core can directly address the shared202
DDR memory (6GB in our case) and is connected to the remaining cores by a203
high-performance bidirectional ring-shaped NoC as shown in Figure 1(b). More-204
over, each core is 4-way multithreaded, i.e., it is able to execute instructions205
from four threads/processes (using time-multiplexed multithreading), helping to206
reduce the e↵ect of vector pipeline latency and memory access latencies. Di↵er-207
ently from other regular x86 multicore processors, each core has 512-bit vector208
instructions, which makes this processor able to perform 16 single precision209
operations, or 8 double precision operations, within a single instruction.210
Porting a seismic finite di↵erence numerical stencil to the Xeon Phi architec-211
ture requires a careful exploitation of two main aspects. The first one is related212
to the use of vector processing units. Since most of the performance power of213
7
Xeon Phi comes from these units, it is essential to fully benefit from them. This214
can be achieved by performing a clever decomposition of the 3D input problem215
to maximize the use of long vectors in the unit-stride direction. The second216
aspect is related to the L2 caches. When a core Csrc accesses its L2 cache and217
misses, an address request is sent to the tag directories throughout the ring.218
If the requested data block is found in the cache of another core (Cdst), it is219
forwarded back through the ring to the L2 cache of Csrc. Thus, the overhead220
imposed by this protocol must be avoided whenever possible to improve appli-221
cation’s performance. Overall, this can achieved by organizing data memory222
accesses to improve data locality.223
In the next sections we describe how the architectural distinctiveness of these224
architectures guided the development of a seismic wave propagation simulation225
kernel.226
3.2. A Two-level Tiling Approach for the MPPA-256227
The seismic wave propagation kernel has a high demand for memory band-228
width. This makes the e cient use of the low-latency memories distributed229
among compute clusters indispensable. In contrast to standard x86 processors230
in which it is not uncommon to find last-level cache sizes of tens of megabytes,231
MPPA-256 has only 32MB of low-latency memory divided into 2MB chunks232
spread throughout the 16 compute clusters. These chunks of memory are di-233
rectly exposed to the programmer that must explicitly control them. Indeed,234
the e ciency of our algorithm relies on the ability to fully exploit these low-235
latency memories. To that end, we implemented a data fractioning strategy236
that decomposes the problem into tiles small enough to fit into the memory237
available on the compute clusters. Figure 2 illustrates the general idea of our238
two-level tiling scheme.239
The three-dimensional structures corresponding to the velocity and stress240
fields are allocated on the DDR connected to the I/O subsystem to maximize241
the overall problem size that can be simulated. Next, we divide the global242
computational domain into several subdomains corresponding to the number243
of compute clusters involved in the computation (Figure 2- 1 ). This decom-244
position provides a first level of data-parallelism. To respect the width of the245
stencil (fourth-order), we maintain an overlap of two grid points in each di-246
rection. These regions, called ghost zones, are updated at each stage of the247
computation with point-to-point communications between neighboring clusters.248
Unfortunately, only this first level of decomposition is not enough because249
the three-dimensional tiles do not fit into the 2MB of low-latency memories250
available to the compute clusters. A second level of decomposition is there-251
fore required. We performed this decomposition along the vertical direction as252
we tile each three-dimensional subdomain into 2D slices (Figure 2- 2 ). This253
leads to a significant reduction of the memory consumption for each cluster but254
requires maintaining a good balance between the computation and communica-255
tion. Our solution relies on a sliding window algorithm that traverses the 3D256
domains using 2D planes and overlaps data transfers with computations. This257
mechanism can be seen as an explicit prefetching mechanism (Figure 2- 3 ) as258
8
...
Full domain:
Fist level of tiling
K
K-1
K+1
Sub-domain:
Second level of tiling
I/O Subsystem (2 GB) Compute Cluster (2 MB)
2D slices
Data 
Transfers
Prefetched
Planes
1
2
3
Parallel 2D stencil
OpenMP (4x4) - 16 PEs
PE15PE14
PE13PE12
PE11PE10
PE9PE8
PE7PE6
PE5PE4
PE3PE2
PE1PE04
Figure 2: Two-level tiling scheme to exploit the memory hierarchy of MPPA-256.
2D planes required for the computation at one step are brought to the clusters259
during the computation performed at previous steps. To alleviate communi-260
cation costs, asynchronous background data transfers are employed. Finally,261
OpenMP directives are employed inside each compute cluster to compute 2D262
problems with up to 16 PEs in parallel (Figure 2- 4 ). Additional details about263
the implementation of the seismic wave propagation simulation kernel for the264
MPPA-256 architecture are provided by Castro et al. [8].265
3.3. A Cache-aware Approach for the Xeon Phi266
Research on the e ciency of seismic wave propagation simulations using the267
FDM on multicore architectures has received a lot of attention lately [22, 23].268
The performance of these simulations are typically memory-bound due to the269
imbalance between the relatively fast point-wise computations and the intensive270
memory accesses these computations require.271
Several standard sequential implementations o↵er a poor cache reuse and272
therefore achieve only a fraction of the peak performance of the processor. Even273
if some optimizations alleviate this problem, the standard simulation algorithm274
typically scans an array spanning several times the size of the cache using each275
retrieved grid point only for a few operations. Therefore, the cost to bring the276
needed data from the main memory to the fast local cache memories account277
for an important share of the total simulation time, specially for a 3D problem.278
In particular, this limitation is more noticeable in processors that feature279
high core counts such as Xeon Phi due to the contention on the bidirectional280
data ring. To improve the cache locality we employ a cache blocking technique.281
Cache blocking is an optimization technique that intends to reduce memory282
bandwidth bottlenecks. The main idea is to exploit the inherent data reuse283
available in the triple nested loop of the elastodynamics kernel by ensuring that284
data remains in cache across multiple uses.285
Among classical blocking strategies, the approach described by Rivera and286
Tseng [24] proposes to tile two dimensions and to perform computations by287
accumulating the layers in the third one. This strategy is somehow similar to288
our tiling strategy proposed for the MPPA-256, since the idea is to control the289
data movement by prefetching the read and the write computing planes.290
Although the sliding window algorithm and vectorial instructions are a good291
combination, their use on Xeon Phi is unfeasible in practice. The limited cache292
9
size on Xeon Phi prevents the storage of three velocity and six stress components293
in 512 kB of L2 cache memory shared by up to four threads. For that reason,294
our implementation therefore relies on a 3D blocking strategy which leads to295
a important speedup when we take the na¨ıve implementation as a baseline as296
shown in Section 4. This improvement is due to the reduction of the pressure on297
the memory. We employ OpenMP to distribute the load throughout the di↵erent298
computing cores and we use the OpenMP’s loop collapse directive to improve299
loop execution scheduling across the 3D domain decomposition. Additionally,300
we consider low-level specific optimizations such as data alignment, vectorization301
and thread a nity. On Xeon Phi, unaligned data severely limits the overall302
performance. In our implementation, we align data with respect to the length303
of the stencil (fourth order in our case) and we shift pointers in order fully304
benefit from the Xeon Phi vectorization capabilities (16 single precision floats).305
A similar optimization strategy can be found in [25]. Finally, we employ thread306
a nity to ensure that threads are correctly bound to the cores to reduce memory307
access latency and to alleviate the memory contention.308
4. Experimental Results309
In this section we evaluate the performance and energy consumption of our310
solutions to perform seismic wave propagation simulations on manycores. First,311
we briefly describe our evaluation methodology. Then, we analyze the impact312
of the optimizations on MPPA-256 and Xeon Phi. Finally, we carry out an overall313
comparison of the performance and energy consumption between our solutions314
for manycores and other standard optimized solutions for other modern plat-315
forms.316
4.1. Methodology317
We compare our solutions for the MPPA-256 and Xeon Phi to reference im-318
plementations of the same elastodynamics kernel for multicores [8, 26] and319
GPUs [8, 27]. These implementations were extracted from Ondes3D, a seis-320
mic wave propagation simulator developed by the French Geological Survey321
(BRGM). In this study we used the following platforms:322
• MPPA-256: This low-power manycore processor integrates 16 compute323
clusters in a single chip running at 400MHz. Each compute cluster has324
16 cores called PEs and a low-latency shared memory of 2MB shared325
between all the PEs. Additionally, each PE has non-coherent, 8 kB, 2-way326
set associative caches, one for data and another for instructions. Compute327
clusters can communicate with each other and can indirectly access an328
external LP-DDR3 of 2GB through data and control NoCs. Compilation329
was done using Kalray Compiler 9898489 (based on GCC 4.9) with the330
flags -O3, -fopenmp, and -ffast-math.331
• Xeon Phi: The processor used in this article features 57 4-way multi-332
threaded cores. Each one of its cores has a clock frequency of 1.10GHz,333
32 kB for L1 instruction and data caches and 512 kB for L2 cache. Every334
10
core can directly address the shared DDR memory (6GB) and is connected335
to the remaining cores by a high-performance bidirectional ring-shaped336
NoC. Intel’s ICC version 14.0 was used to compile the code with the fol-337
lowing compilation flags: -O3, -openmp, -mmic (enables the cross compiler338
nedeed for a native execution on the Xeon Phi), and -fp-model fast=2.339
Additionally during development, we used the flag -vec-report2 to verify340
whether the seismic wave simulation kernel was successfully vectorized.341
• Tesla K20: This graphics board features NVIDIA Kepler architecture342
with 2496 CUDA parallel-processing cores, with working frequencies up343
to 758MHz and 5GB of GDDR5 GPU memory. Compilation was done344
using NVIDIA’s nvcc version 5.5 with the flags -O3 -arch sm 20, and345
--use fast math, using GCC 4.9 as the host compiler.346
• Xeon E5: This platform features a Xeon E5-4640 Sandy Bridge-EP pro-347
cessor, which has 8 physical cores, with working frequencies up to 2.4GHz348
and 32GB of DDR3 memory. The code was compiled with GCC 4.9 with349
the flags -O3, -fopenmp, and -ffast-math.350
• Altix UV 2000: It is a Non-Uniform Memory Access (NUMA) platform351
composed of 24 NUMA nodes. Each node has a Xeon E5-4640 Sandy352
Bridge-EP processor (with the same specifications of the Xeon E5 platform)353
and 32GB of DDR3 memory shared in a cc-NUMA fashion (NUMAlink6).354
Overall, this platform has 192 physical cores. Compiler version and flags355
are identical to that of the Xeon E5 platform.356
We use four metrics to compare the energy and computing performance:357
time-to-solution, energy-to-solution, speedup and Energy-delay Product (EDP).358
Time-to-solution is the time spent to reach a solution for a seismic wave prop-359
agation simulation. Analogously, energy-to-solution is the amount of energy360
spent to reach a solution for a seismic wave propagation simulation. Speedups361
are calculated by dividing the time-to-solution of the sequential version by time-362
to-solution of the parallel/distributed version with n threads. The EDP, ini-363
tially proposed by Horowitz et al. [28], fuses the time-to-solution and energy-to-364
solution into a single metric, allowing hardware platforms to be directly com-365
pared taking into account both metrics. Since the relative weights between per-366
formance and energy consumption are subject to pragmatic concerns, we show367
the values for both EDP and Energy-delay-squared Product (ED2P) [29]. All368
results represent averages of 30 runs to guarantee statistically relevant values.369
The energy-to-solution was obtained through each platform’s specific power370
measurement sensors. Both Xeon E5 and Altix UV 2000 are based on Intel’s371
Sandy Bridge microarchitecture. This microarchitecture has Intel’s Running372
Average Power Limit (RAPL) interface which allows us to measure the power373
consumption of CPU-level components through hardware counters. We used374
this approach to obtain the energy consumption of the whole CPU package375
including cores and cache memory.376
On Xeon Phi, we used Intel’s MIC System Management and Configuration377
(MICSMC) tool which allows us to monitor the processor’s power consumption.378
11
Power measurements obtained from RAPL and MICSMC are very accurate as379
shown in [30, 31]. Similarly, MPPA-256 features a tool called K1-POWER to380
collect energy measurements of the whole chip, including all clusters, on-chip381
memory, I/O subsystems and NoCs. According to Kalray’s reference manuals,382
the measurement precision on MPPA-256 is ±0.25W. NVIDIA Kepler GPUs383
such as Tesla K20 also have a similar tool called NVIDIA Management Library384
(NVML). We used the NVML to gather the power usage for the GPU and its as-385
sociated circuitry (e.g., internal memory). According to NVML documentation,386
readings are accurate to within ±5% of the actual power draw.387
MPPA-256, Xeon Phi and Tesla K20 architectures can all be used as accelera-388
tors, where the main application code is executed on the host and performance-389
critical sections of the seismic wave propagation kernel are o✏oaded to them.390
However, since we intend to compare the performance and energy consumption391
of our seismic wave propagation solutions for these processors, we execute the392
entire application on MPPA-256 and Xeon Phi in native mode. In this case, both393
the main application code and the seismic wave propagation kernel are cross-394
compiled for them. GPUs, however, cannot be used in native mode. For that395
reason we avoid host intervention whenever possible. During our tests the host396
was only responsible for loading the kernel and for performing the initial and397
final data exchanges between the host and GPU memories. To provide a fair398
comparison and discard influences caused by the host, all the results presented399
in the following sections reflect only the execution of the kernel.400
4.2. MPPA-256401
Figure 3(a) shows the impact of the number of prefetched planes on commu-402
nication and computation times. As we increase the number of planes available403
at the compute cluster memory level, we improve the reuse of data in the verti-404
cal direction. This allows us to overlap a considerable portion of data transfers405
with computations. However, we observed only slight improvements past six406
planes. This is due to the saturation of the NoC as this strategy increases the407
data tra c each time we increase the number of prefetched planes. Contrary to408
the other architectures, the small amount of memory available at each compute409
cluster on MPPA-256 (2MB) obliges us to perform an important number of data410
transfers from/to the DDR. Due to the limited on-chip memory, were able to411
prefetch only up to eight planes before exhausting the available local memory in412
each compute cluster. Overall, the reason why the communication is still fairly413
high compared to computation is because the ratio between the time taken by414
a compute cluster to compute a subdomain of 1.5MB and the time taken to415
transfer 1.5MB from/to the DDR is low. This limits the performance gains that416
can be achieved by overlapping communications with computations.417
Figure 3(b) presents a weak scalability analysis of the seismic wave propa-418
gation kernel on MPPA-256 when we vary the number of prefetched planes and419
the number of clusters. For this analysis the problem size assigned to each clus-420
ter remains constant as we increase the number of clusters. Therefore, linear421
scaling is achieved if the normalized execution time is kept constant at 1.00422
while the workload is increased in direct proportion to the number of clusters.423
12
0%#
20%#
40%#
60%#
80%#
100%#
1# 2# 3# 4# 5# 6# 7# 8#
Pe
rc
en
ta
ge
)o
f)t
ot
al
)
ex
ec
u/
on
)/
m
e)
Number)of)prefetched)planes)
Communica4on# Computa4on#
(a) Prefetching impact.
0.0 
0.1 
0.2 
0.3 
0.4 
0.5 
0.6 
0.7 
0.8 
0.9 
1.0 
1.1 
1 2 4 8 16 
W
ea
k 
sc
al
in
g 
Number of clusters 
Ideal Baseline Prefetch 2 
Prefetch 4 Prefetch 6 Prefetch 8 
(b) Weak scalability.
Figure 3: Seismic wave propagation simulation kernel on MPPA-256.
Although the increase on the number of prefetched planes can indeed improve424
the scalability of our solution, there is still a considerable performance degra-425
dation as we employ an additional number of clusters. Even if our prefetching426
scheme is capable of masking communication costs on MPPA-256, the latency427
and bandwidth of the NoC can still hurt the execution performance and thus428
limit the scalability. A comparative scalability evaluation between MPPA-256429
and Altix UV 2000 is presented in [8]. However, the results presented on that430
paper are slightly di↵erent from the ones we present here. This is due to the use431
of a newer toolchain in this work which has a slight performance improvement432
as well as better energy e ciency.433
4.3. Xeon Phi434
As discussed in Section 3.3, our solution to seismic wave propagation sim-435
ulations on Xeon Phi employs not only the cache blocking technique, but also436
classic memory alignment with pointer shifting. Besides these optimizations,437
we experimented with di↵erent thread a nity and scheduling strategies.438
Thread a nity can reduce variations on the execution time of an applica-439
tion and in some cases improve execution times by performing a better thread440
placement. We executed the seismic wave propagation kernel using the follow-441
ing OpenMP’s thread placement policies: compact, scatter, balanced and default442
(no thread a nity). Setting a nity to compact will place OpenMP threads443
by filling cores one by one. Balanced a nity evenly distributes threads among444
the cores. It attempts to use all the available cores while keeping the thread445
neighboring logical IDs physically close to one another. The scatter a nity also446
evenly distributes threads among the cores but it does so in a round-robin fash-447
ion. Hence, threads with the adjacent IDs are not guaranteed to be physically448
adjacent.449
Scheduling strategies dictate how the loop iterations are assigned to threads.450
We experimented with three scheduling strategies available in OpenMP: static,451
dynamic and guided. In static scheduling, the number of loop iterations is452
statically divided by the number of threads. This results in equal-sized chunks453
that are assigned to the threads (or as equal as possible in the case where the454
13
Dynamic Static
0 
20 
40 
60 
80 
100 
120 
140 
2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 
Sp
ee
du
p 
Number of threads 
Ideal MA+CB CB MA Baseline MA+CB+Affinity Guided 
0 
20 
40 
60 
80 
100 
120 
140 
2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 
Sp
ee
du
p 
Number of threads 
Ideal MA+CB CB MA Baseline MA+CB+Affinity Guided 
0 
20 
40 
60 
80 
100 
120 
140 
2 4 6 8 50 62 4 6 8 110 122 134 146 158 170 182 194 206 218 230 
Sp
ee
du
p 
Number of threads 
Ide l +CB CB MA Baseline MA+CB+Affinity Guided 
0 
20 
40 
60 
80 
100 
120 
140 
2 14 26 38 50 62 74 6 98 110 122 134 146 58 170 182 194 206 218 230 
Sp
ee
du
p 
Number of threads 
Ideal +CB CB MA Baseline MA+C +Affi ity Guided 
0 
20 
40 
60 
80 
100 
120 
140 
2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 
Sp
ee
du
p 
Number of threads 
Ideal MA+CB CB MA Baseline MA+CB+Affinity Guided 
Guided
0 
20 
40 
60 
80 
100 
120 
140 
2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 82 94 206 218 230
Sp
ee
du
p 
Number of threads 
Ideal MA+CB CB MA Baseline MA+CB+Affinity Guided 
0 
20 
40 
60 
80 
100 
120 
140 
2 14 26 38 50 62 74 8  98 110 122 134 146 158 170 182 194 206 8 23  
Sp
ee
du
p 
Number of threads 
Ideal MA+CB CB MA Baseline +Affinity Guided 
0 
20 
40 
60 
80 
100 
120 
140 
2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 
Sp
ee
du
p 
Number of threads 
Ideal
0 
20 
40 
60 
80 
100 
120 
140 
2 14 26 38 50 62 74 86 98 110 122 134 146 158 170 182 194 206 218 230 
Sp
ee
du
p 
Number of threads 
Ideal 
MA+CB 
CB 
MA 
baseline 
MA+CB+Affinity 
Figure 4: Impact of optimizations for dynamic and static OpenMP scheduling policies.
number of loop iterations is not evenly divisible by the number of threads).455
In dynamic scheduling, blocks of loop iterations are put into a shared work456
queue. These blocks are then dynamically distributed to the threads. When457
a thread finishes its block, it retrieves the next block of loop iterations from458
the top of the work queue. The size of the blocks is fixed and determined by459
the OpenMP’s chunk parameter. The guided scheduling is similar to dynamic460
scheduling, but the size of the blocks starts o↵ large and decreases to better461
handle load imbalance between iterations. In this case, the OpenMP’s chunk462
parameter defines the approximate minimum size of the block.463
Figure 4 shows the speedup we obtained when the kernel was executed464
with dynamic scheduling with no optimizations (Baseline); only memory align-465
ment (MA); only cache blocking (CB); memory alignment and cache blocking466
(MA+CB). Moreover, it presents the results obtained with static and guided467
scheduling policies when all optimizations are applied. The chunk size used for468
the experiments with dynamic and guided scheduling policies was 1, since it469
presented the best results1. Thread a nity (A nity) was only applied to static470
scheduling. As we will explain later on, thread a nity has almost no impact471
when OpenMP’s dynamic or guided scheduling policies are used.472
Experimental results showed us that memory alignment is essential to make473
a good use of Xeon Phi’s 512-bit vector instructions. However, in order to do so,474
cache blocking is indispensable. In fact, memory alignment alone accounted for475
an average improvement of 24% in comparison to the baseline. The performance476
improvement of this optimization is severely limited by the poor utilization of477
the local caches. On the other hand, when compared to the baseline, a more478
1Experiments with chunks greater than 1 exhibited higher load imbalances resulting in
poorer performance.
14
e cient utilization of the caches through cache blocking alone increased the479
performance by 79% on average. If used in isolation, cache blocking does not480
allow a full use of Xeon Phi’s vectorization capabilities and creates additional481
cache faults due to unaligned memory accesses. By using these two techniques482
at once we can, at the same time, make a good utilization of Xeon Phi’s vectorial483
capabilities and its fast local caches. Indeed, when these two techniques are484
used together we see an average improvement of 92% in performance, and an485
improvement of 91% for 224 threads.486
Dynamic and guided scheduling policies had a better scaling behavior as well487
as better performance than static scheduling. On average, dynamic scheduling488
provided an increase of 14% over static scheduling. Static scheduling has the489
downside of causing imbalances during the execution of the simulation with an490
intermediate number of threads. For instance, for 58 threads, iterations would491
be divided equally between threads, but since a single core would execute two492
threads, it would actually end up executing twice as many iterations as the other493
cores. This behavior can clearly be seen as steps in the speedup graph. However,494
this assumption is not enough to e↵ectively explain why the performance using495
static scheduling is not on par with (or better than) the performance of dynamic496
scheduling when the number of threads is a multiple of the number of cores.497
The reason for that lies on the high usage levels of the NoC during the498
implicit synchronization between the calculation of the velocity and stress com-499
ponents of the stencil. For each timestep, the stress calculation loop only starts500
its execution after all the velocity components of the previous loop have finished501
(cf. Algorithm 2.1). This synchronization makes every thread to try to commu-502
nicate at the same time at the start of each parallel loop. On the static scheduler,503
the concurrency for the NoC might introduce small delays for the beginning of504
the execution of each thread, essentially making the loop’s critical path slightly505
longer. While by itself each one of these delays is not very significant, they506
accumulate at each timestep both at the stress and velocity calculation loops.507
Since the concurrency for the NoC also increases with the number of threads,508
the relative execution performance di↵erence between the static and dynamic509
approaches also increases. This imbalance is compensated on the dynamic case,510
even if the dynamic scheduler has a higher overhead, because it is able to keep511
more threads active up to the end of each loop.512
Guided scheduling also allows for a better work distribution and smaller513
imbalances than the static scheduler between threads at the end of each loop.514
However, the synchronization between velocity and stress loops and the small515
delay introduced during the initial phases of each loop still create slight imbal-516
ances at the end of both the stress and velocity loops of each timestep. Using517
the guided scheduler we were able to achieve near linear speedups up to the518
number of cores. After this point, the dynamic scheduler gradually approaches519
the guided performance to eventually beat it with the best execution time with520
224 threads.521
Finally, for static scheduling, thread a nity was able to improve the perfor-522
mance of MA+CB in 11%, on average. On the other hand, by its nature, dy-523
namic and guided scheduling reduce the importance of thread a nity. Indeed,524
15
scatter and balanced thread placement policies presented very similar results525
in these cases. However, the compact thread placement policy decreased the526
performance with dynamic and guided scheduling by 22% on average, since it527
creates an important imbalance between cores, specially for low thread-counts.528
The loss in performance is reduced as the number of threads increases, from529
50% for 40 threads to 0% for thread counts beyond 222.530
Since Xeon Phi does not o↵er Dynamic Voltage and Frequency Scaling (DVFS),531
the energy consumption on this platform is directly related to the execution time532
of the application. The best execution time was obtained with dynamic schedul-533
ing with 224 threads, providing the smallest energy-to-solution at 4.42 kJ, and534
thus resulting in an improvement of 91% when compared to the baseline (46.76 kJ).535
4.4. Overall Energy and Performance Results536
In this section we compare the performance and energy consumption of our537
seismic wave propagation simulation kernel on MPPA-256 and Xeon Phi against538
other multicore and GPU reference implementations. On Xeon E5, we used539
the solution proposed by Dupros et al. [26], which employs OpenMP for the540
parallelization. On GPU we relied on the solution proposed by Miche´a and541
Komatitsch [27], which is a state-of-the-art parallel solution for GPU-based542
seismic wave propagation simulation.543
DVFS can make a significant di↵erence in both performance and energy544
consumption. Although not available on the manycore processors we evaluated,545
it is available for the Xeon E5 and GPU platforms. Therefore, for these platforms546
we always show two measurements. The first ones, Xeon E5 (2.4 GHz) and Tesla547
K20 (758 MHz), represent the experimental results when their frequencies are548
optimized for performance, i.e., using their maximum working frequencies. The549
second ones, Xeon E5 (1.6 GHz) and Tesla K20 (705 MHz), are relative to the optimal550
energy consumption setting, which for this kernel was 1.6GHz and 705MHz on551
Xeon Phi and Tesla K20, respectively.552
Figure 5 compares the time-to-solution and energy-to-solution across the553
processors using a problem size of 2GB (1803 grid points) and 500 time steps.554
For these experiments we used the optimal number of threads on each platform.555
With the exception of Xeon Phi (in which the best results were obtained with556
224 threads), the thread count was equal to the number of physical cores of557
each processor. As shown in Figure 4, our solution for Xeon Phi keeps scaling558
considerably well past the 57 physical cores.559
To the best of our knowledge, GPUs are among the most energy e cient560
platforms currently in use for seismic wave propagation simulation. Yet, our561
proposed solution on MPPA-256 achieves the best energy-to-solution among the562
analyzed processors, consuming 78%, 77%, 88%, 87% and 86% less energy563
than Tesla K20 (758 MHz), Tesla K20 (705 MHz), Xeon Phi, Xeon E5 (2.4 GHz), and564
Xeon E5 (1.6 GHz), respectively. On the other hand, when we consider the time-565
to-solution, MPPA-256 does not have the upper hand. In fact Tesla K20 had the566
best execution times among all the evaluated architectures. This result was,567
however, to be expected since the theoretical peak performance of Tesla K20568
(3.5 TFLOPS in single precision) is at least 75% higher than that of the second569
16
0.52 
4.42 
2.31 2.23 
3.95 
3.62 
95.2 
25.8 
21.4 22.4 
61.1 
91.3 
0 
20 
40 
60 
80 
100 
0 
1 
2 
3 
4 
5 
MPPA-256 Xeon Phi Tesla K20 
(758 MHz) 
Tesla K20 
(705 MHz) 
Xeon E5 
(2.4 GHz) 
Xeon E5 
(1.6 GHz) 
Ti
m
e-
to
-s
ol
ut
io
n 
(s
) 
En
er
gy
-to
-s
ol
ut
io
n 
(k
J)
 
Energy Time 
Figure 5: Chip-to-chip comparison.
placed architecture Xeon Phi (2.0 TFLOPS in single precision). Our solution570
for Xeon Phi improved the execution performance in 58%, 72% and 73% when571
compared to Xeon E5 (2.4 GHz), Xeon E5 (1.6 GHz) and MPPA-256, respectively.572
When compared to the Tesla K20 (758 MHz), our solution for Xeon Phi was only573
20% slower, which not only is compatible to the nominal performance di↵erence574
between these architectures, but also shows that our solution was able to provide575
hardware utilization levels that are at least as e cient as a state-of-the-art576
solution for GPUs.577
A na¨ıve direct comparison between the raw performance or the average power578
consumption between Altix UV 2000, a multi-processor architecture, to the pre-579
vious single-processor architectures would make little sense. Therefore, we com-580
pare the Altix UV 2000 NUMA platform to the other processors in terms of energy581
e ciency and EDP. Table 1 shows the EDP and ED2P for all platforms (the582
lower the better). A scalability analysis of the seismic wave propagation kernel583
on Altix UV 2000 is discussed by Castro et al. [8].584
When we consider EDP, MPPA-256 and Tesla K20 are clearly superior to585
the other evaluated platforms. If energy-to-solution alone is considered, MPPA-586
256 has a clear advantage. MPPA-256 was conceived as a low-power highly-587
parallel processor. During our evaluation the MPPA-256 consumed on average588
only 6.45W, while Tesla K20, Xeon Phi and consumed 184W, 108W and Altix UV589
2000 1458W, respectively. However Tesla K20 shows a better balance between590
energy consumption and performance than the other manycore processors as591
evidenced by ED2P. The high energy consumption of Xeon Phi places it right592
between MPPA-256/Tesla K20 and Xeon E5 when EDP is considered. However593
when we increase the time-to-solution weight (ED2P), Xeon Phi becomes more594
interesting than MPPA-256.595
The Altix UV 2000 platform has the best trade-o↵ between the time-to-596
solution and energy-to-solution, for both maximum performance and optimal597
energy consumption settings. If we consider ED2P, this di↵erence becomes598
even more noticeable. The significant di↵erences highlighted in the combined599
time and energy evaluation can be explained by the poor arithmetic intensity600
17
Table 1: EDP (in thousands) for a simulation with 1803 grid points and 500 time steps.
Platform EDP ED2P
MPPA-256 49 4,695
Xeon Phi 114 2,934
Tesla K20 (758MHz) 49 1,056
Tesla K20 (705MHz) 50 1,124
Xeon E5 (2.4GHz) 242 14,755
Xeon E5 (1.6GHz) 330 30,143
Altix UV 2000 (2.4GHz) 13 40
Altix UV 2000 (1.6GHz) 17 74
(FLOP/transferred byte ratio) which is characteristic of seismic wave kernels601
based on elastodynamics stencils. These memory-bound applications can ex-602
perience important losses of performance if communications are not carefully603
overlapped with communications. On manycores such as MPPA-256 and Xeon604
Phi, there is an additional overhead created by the limited amount of fast local605
memory which forces the application to frequently employ the NoC during the606
execution. Indeed, Altix UV 2000 has approximately 20 times more fast local607
memory per thread than MPPA-256 and Xeon Phi. This reduces considerably the608
communication overhead associated to accesses to the main memory.609
5. Related Work610
Totoni et al. [6] compared the power and performance of Intel’s Single-Chip611
Cloud Computer (SCC) to other types of CPUs and GPUs. The analysis was612
based on a set of parallel applications implemented with the Charm++ pro-613
gramming model. Although they showed that there is no single solution that614
always achieves the best trade-o↵ between power and performance, the results615
suggest that manycores are an opportunity for the future. Morari et al. [3]616
proposed an optimized implementation of radix sort for the Tilera TILEPro64617
manycore processor. The results showed that the their solution for TILEPro64618
provides much better energy e ciency than an general-purpose multicore pro-619
cessor (Intel Xeon W5590) and comparable energy e ciency with respect to a620
GPU NVIDIA Tesla C2070.621
Francesquini et al. [7] evaluated three di↵erent classes of applications (CPU-622
bound, memory-bound and mixed) using highly-parallel platforms such asMPPA-623
256 and a 24-node, 192-core NUMA platform. They showed that manycore624
architectures can be very competitive, even if the application is irregular in625
nature. Their results showed that MPPA-256 may achieve better performance626
than a traditional general-purpose multicore processor (Intel Xeon E5-4640) on627
CPU-bound and mixed workloads whereas on a memory-bound workload Xeon628
E5 had better performance than MPPA-256. Among the evaluated platforms,629
MPPA-256 presented the best energy e ciency reducing the energy consumed630
on cpu-bound, mixed and memory-bound applications by at least 6.9x, 6.5x631
and 3.8x, respectively.632
18
Using the low-power Adapteva Epiphany-IV manycore, Varghese et al. [5]633
described how a stencil-based solution to the anisotropic heat equation using a634
two-dimensional grid was developed. This manycore has a low power budged635
(2W) and has 64 processing cores. Each core has a local scratchpad of 32 kB and636
no cache-memory. Cores are connected by a NoC which allows direct memory637
access to 1GB of RAM and also to other cores’ local memories. The challenges638
we faced for the development of our seismic wave propagation simulation on the639
MPPA-256 architecture are similar to the ones faced by the authors of this work.640
Similar to MPPA-256, Epiphany-IV has a very limited amount of local memory641
available to each core and no automatic prefetching mechanism exists; every642
data movement has to be explicitly controlled by the application. The authors643
demonstrated that, even if the implementation of an e cient application can644
be a challenging task, their solution was able to achieve a FLOPS/Watt ratio 3645
times better than an Intel 80-core Terascale Processor.646
Adapting seismic wave propagation numerical kernels to emerging parallel647
architectures is an active research topic. Due the versatility of the FDM used648
both for oil and gas applications and standard earthquakes modeling, several649
strategies have been proposed. For instance, Intel and NVIDIA have proposed650
optimized implementations of such stencils [25, 32].651
Dupros et al. [33] presented a review of the scalability issues for distributed652
and shared-memory platforms with a focus on mapping processes/threads on653
hierarchical clusters. Dursun et al. [34] introduced several additional strategies,654
including inter-node and intra-node optimizations. Recently, Christen et al. [35]655
described the use of the Patus framework to optimize the AWP-ODC finite656
di↵erence code. In particular, they detailed the impact of vectorization and657
cache prefetching and reported a performance improvement up to 15% when658
running the complete application with 512 MPI processes.659
Several research e↵orts have been done on the adaptation of seismic wave ker-660
nels on GPUs to overcome the poor arithmetic intensity (FLOPS/transferred661
byte ratio) o↵ered by elastodynamics stencils [27, 36]. The results obtained662
demonstrate that GPUs could be a relevant alternative to standard general-663
purpose processors. Krueger et al. [37] compared the performance and energy664
e ciency of an Intel Nehalem platform, an NVIDIA Tesla GPU and a simulated665
general-purpose manycore chip design optimized for high-order wave equations666
called “Green Wave”. They showed that Green Wave can be up to 8x and 3.5x667
more energy e cient per node when compared with the Nehalem and GPU668
platforms, respectively. Di↵erently from this work, our experiments and mea-669
surements were carried out on real manycore processors. Even if the low-power670
MPPA-256 is not optimized for high-order wave equations as Green Wave, we671
were still able to achieve energy e ciency improvements of at least 86% and672
77% when compared to the Xeon E5 and Tesla K20, respectively.673
6. Conclusion and Perspectives674
The use of highly-parallel manycore processors to meet the high demand for675
data processing capabilities needed by parallel scientific simulations has recently676
19
become one of central points of interest in the HPC community. In this paper,677
we presented our approach to the seismic wave propagation simulation using the678
MPPA-256 and Xeon Phi manycore processors. Although both processors share679
some architectural characteristics, they present di↵erent challenges that must680
be overcome in order to obtain good performance results.681
The low-power MPPA-256 manycore processor has a limited amount of dis-682
tributed low-latency memories that must be explicitly managed by the program-683
mer. Xeon Phi, on the other hand, is a performance-centric manycore processor684
that requires careful source code changes to help the compiler to e↵ectively vec-685
torize time-consuming loops and to improve cache locality. In this article we686
proposed a new multi-level tiling strategy and a software prefetching mechanism687
that allowed us to perform real-world seismic wave propagation simulations and688
at the same time to lighten the communication overhead imposed by the NoC689
on the MPPA-256. On the other hand, on the Xeon Phi architecture we employed690
a cache blocking technique along with memory alignment and thread a nity691
strategies which allowed us to considerably improve the performance and scal-692
ability of our solution.693
Our results showed that our approach to the MPPA-256 is the most energy694
e cient whereas our solution for the Xeon Phi achieves a performance compara-695
ble to the state-of-the-art solution for GPUs. MPPA-256 presented energy con-696
sumption improvements of 77%, 86% and 88% over other solutions for GPU,697
general-purpose multicore and the Xeon Phimanycore architectures, respectively.698
In terms of performance, on the other hand, our solution for Xeon Phi achieved699
performance improvements of 73%, 58% with respect to the solutions for MPPA-700
256 and general-purpose multicore processor, respectively.701
Despite encouraging results, our solutions for MPPA-256 and Xeon Phi many-702
core processors could still be improved. On MPPA-256, the communication still703
consumes a large amount of the total execution time (58%). This is due to704
the high tra c on the NoC to access a single DDR memory. Kalray recently705
announced a multi-MPPA solution that features four MPPA-256 processors on706
the same board with less than 50W of power consumption. In this new solution,707
each MPPA-256 processor can access two DDR3 channels in parallel and MPPA-708
256 processors are interconnected through NoC eXpress interfaces (NoCX), pro-709
viding an aggregate bandwidth of 40GB/s. The benefits of this new solution710
for seismic wave propagation simulations are two-fold: (i) this would allow us711
to deal with input problem sizes of 32GB or more; and (ii) distributing data712
among di↵erentMPPA-256 processors along with the parallel access to two DDR3713
memories in each processor would alleviate the current communication bottle-714
neck. Thus, we plan to work on new versions of our multi-level tiling strategy715
and prefetching scheme to exploit the full potential of multi-MPPA solutions as716
soon as they become available.717
On Xeon Phi, the performance could still be improved by using an auto-tuning718
mechanism to automatically select the best size of the tiles and specific optimiza-719
tions from the compiler based on the input problem and the number of threads720
used. Thus, we intend to study if frameworks such as Pochoir [38] or Patus [35]721
could provide us the essentials to implement the auto-tuning approach.722
20
Finally, we intend to study how our optimizations proposed in this paper723
could be made automatic. One possibility would be to extend PSkel [39], a state-724
of-the-art framework that provides a single high-level abstraction for stencil725
programming, to support MPPA-256 and Xeon Phi manycores. This would allow726
us to implement our proposed optimizations inside the framework, so applica-727
tions implemented in PSkel will automatically benefit from software prefetching728
(MPPA-256) and cache blocking, memory alignment and thread a nity (Xeon729
Phi).730
Acknowledgments731
This work was supported by FAPESP grant 2014/17925-0, PNPD/CAPES,732
the European Community’s Seventh Framework Programme (FP7/2007-2013)733
under the Mont-blanc 2 Project (www.montblanc-project.eu) grant 610402,734
and BRGM Carnot-institute. We would also like to thank Bull and the Center735
for Excellence in Parallel Programing (CEPP) for granting us access to their736
hardware infrastructure which was used in our experiments with Xeon Phi.737
References738
[1] N. Rajovic, N. Puzovic, L. Vilanova, C. Villavieja, A. Ramirez, The low-739
power architecture approach towards exascale computing, in: Workshop on740
Scalable Algorithms for Large-scale Systems (ScalA), ACM, Seattle, USA,741
2011, pp. 1–2.742
[2] D. Go¨ddeke, D. Komatitsch, M. Geveler, D. Ribbrock, N. Rajovic, N. Pu-743
zovic, A. Ramirez, Energy e ciency vs. performance of the numerical so-744
lution of PDEs: An application study on a low-power ARM-based cluster,745
Journal of Computational Physics 237 (2013) 132–150.746
[3] A. Morari, A. Tumeo, O. Villa, S. Secchi, M. Valero, E cient sorting on747
the Tilera manycore architecture, in: International Symposium on Com-748
puter Architecture and High Performance Computing (SBAC-PAD), IEEE749
Computer Society, New York, USA, 2012, pp. 171–178.750
[4] B. D. de Dinechin, P. G. de Massasa, G. Lagera et al., A distributed run-751
time environment for the Kalray MPPA-256 integrated manycore processor,752
in: International Conference on Computational Science (ICCS), Elsevier,753
Barcelona, Spain, 2013, pp. 1654–1663.754
[5] A. Varghese, B. Edwards, G. Mitra, A. P. Rendell, Programming the755
Adapteva Epiphany 64-core network-on-chip coprocessor, in: International756
Parallel Distributed Processing Symposium Workshops (IPDPSW), IEEE757
Computer Society, Phoenix, USA, 2014, pp. 984–992.758
[6] E. Totoni, B. Behzad, S. Ghike, J. Torrellas, Comparing the power and759
performance of Intel’s SCC to state-of-the-art CPUs and GPUs, in: In-760
ternational Symposium on Performance Analysis of Systems and Software761
(ISPASS), IEEE Computer Society, New Brunswick, Canada, 2012, pp.762
78–87.763
21
[7] E. Francesquini, M. Castro, , P. H. Penna, F. Dupros, H. C. de Fre-764
itas, P. O. A. Navaux, J.-F. Me´haut, On the energy e ciency and perfor-765
mance of irregular application executions on multicore, NUMA and many-766
core platforms, Journal of Parallel and Distributed Computing , In Press.767
doi:10.1016/j.jpdc.2014.11.002.768
[8] M. Castro, F. Dupros, E. Francesquini, J.-F. Me´haut, P. O. A. Navaux, En-769
ergy e cient seismic wave propagation simulation on a low-power many-770
core processor, in: International Symposium on Computer Architecture771
and High Performance Computing (SBAC-PAD), IEEE Computer Society,772
Paris, France, 2014, pp. 57–64.773
[9] E. Francesquini, A. Goldman, J.-F. Mehaut, A NUMA-Aware Runtime774
Environment for the Actor Model, in: International Conference on Parallel775
Processing (ICPP), IEEE, Lyon, France, 2013, pp. 250–259.776
[10] L. L. Pilla, C. P. Ribeiro, D. Cordeiro, C. Mei, A. Bhatele, P. O. Navaux,777
F. Broquedis, J.-F. Mehaut, L. V. Kale, A hierarchical approach for load778
balancing on parallel multi-core systems, Proceedings of the 41st Interna-779
tional Conference on Parallel Processing 0 (2012) 118–127.780
[11] M.J. Rashti et al., Multi-core and network aware MPI topology functions,781
in: Proceedings of the 18th European MPI Users’ Group conference on782
Recent advances in the message passing interface, EuroMPI’11, 2011, pp.783
50–60.784
[12] G. Mercier, J. Clet-Ortega, Towards an e cient process placement policy785
for mpi applications in multicore environments, in: Recent Advances in786
Parallel Virtual Machine and Message Passing Interface, Springer, 2009,787
pp. 104–115.788
[13] J. McCalpin, Paleoseismology, 2nd Edition, Vol. 95, Academic press, 2009,789
Appendix 1.790
[14] J. Lysmer, L. Drake, A finite element method for seismology, Methods in791
Computational Physics 11 (1972) 181–216.792
[15] D. Komatitsch, J. P. Vilotte, The spectral-element method: An e cient793
tool to simulate the seismic response of 2D and 3D geological structures,794
Bulletin of the Seismological Society of America 88 (2) (1998) 368–392.795
[16] M. Dumbser, M. Ka¨ser, An arbitrary high-order discontinuous Galerkin796
method for elastic waves on unstructured meshes II: The three-dimensional797
isotropic case, Geophys. J. Int. 167 (1) (2006) 319–336.798
[17] J. Virieux, P-SV wave propagation in heterogeneous media: Velocity-stress799
finite-di↵erence method, Geophysics 51 (1986) 889–901.800
[18] E. H. Saenger, N. Gold, S. A. Shapiro, Modeling the propagation of elastic801
waves using a modified finite-di↵erence grid, Wave Motion 31 (1999) 77–92.802
[19] W. Zhang, Z. Zhang, X. Chen, Three-dimensional elastic wave numerical803
modeling in the presence of surface topography by a collocated-grid finite-804
di↵erence method on curvilinear grids, Geophysical Journal International805
190 (1) (2012) 358–378.806
[20] H. Aochi, T. Ulrich, A. Ducellier, F. Dupros, D. Michea, Finite di↵er-807
ence simulations of seismic wave propagation for understanding earthquake808
22
physics and predicting ground motions: Advances and challenges, Journal809
of Physics: Conference Series 454 (2013) 1–9.810
[21] P. Aubry, P.-E. Beaucamps et al., Extended cyclostatic dataflow program811
compilation and execution for an integrated manycore processor, in: Proce-812
dia Computer Science, Vol. 18, Elsevier, Barcelona, Spain, 2013, Ch. 2013813
International Conference on Computational Science, pp. 1624–1633.814
[22] K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, K. Yelick, Optimization815
and performance modeling of stencil computations on modern microproces-816
sors, SIAM Review 51 (1) (2009) 129–159.817
[23] M. Bianco, B. Cumming, A generic strategy for multi-stage stencils, in:818
International Conference on Parallel and Distributed Computing (Euro-819
Par), ACM, Porto, Portugal, 2014, pp. 584–595.820
[24] G. Rivera, C.-W. Tseng, Tiling optimizations for 3D scientific computa-821
tions, in: Supercomputing Conferece (SC), ACM, Dallas, USA, 2000, pp.822
32–38.823
[25] J. Reinders, J. Je↵ers, High Performance Parallelism Pearls: Multicore824
and Many-core Programming Approaches, 1st Edition, Morgan Kaufmann,825
2014.826
[26] F. Dupros, C. Pousa, A. Carissimi, J.-F. Me´haut, Parallel simulations of827
seismic wave propagation on NUMA architectures, in: International Con-828
ference on Parallel Computing (ParCo), IOS Press, Lyon, France, 2010, pp.829
67–74.830
[27] D. Miche´a, D. Komatitsch, Accelerating a three-dimensional finite-831
di↵erence wave propagation code using GPU graphics cards, Geophysical832
Journal International 182 (1) (2010) 389–402.833
[28] M. Horowitz, T. Indermaur, R. Gonzalez, Low-power digital design, in:834
Low Power Electronics, 1994. Digest of Technical Papers., IEEE Sympo-835
sium, 1994, pp. 8–11. doi:10.1109/LPE.1994.573184.836
[29] A. J. Martin, Towards an energy complexity of computation, Information837
Processing Letters 77 (2–4) (2001) 181 – 187. doi:http://dx.doi.org/838
10.1016/S0020-0190(00)00214-3.839
[30] M. Ha¨hnel, B. Do¨bel, M. Vo¨lp, H. Ha¨rtig, Measuring energy consump-840
tion for short code paths using RAPL, ACM SIGMETRICS Performance841
Evaluation Review 40 (3) (2012) 13–17.842
[31] G. Lawson, M. Sosonkina, Y. Shen, Energy evaluation for applications with843
di↵erent thread a nities on the Intel Xeon Phi, in: Workshop on Appli-844
cations for Multi-Core Architectures (WAMCA), IEEE Computer Society,845
2014, pp. 54–59.846
[32] P. Micikevicius, 3D finite-di↵erence computation on GPUs using CUDA, in:847
Workshop on General Purpose Processing on Graphics Processing Units,848
ACM, Washington, USA, 2009, pp. 79–84.849
[33] F. Dupros, H.-T. Do, H. Aochi, On scalability issues of the elastodynamics850
equations on multicore platforms, in: International Conference on Compu-851
tational Science (ICCS), Barcelona, Spain, 2013, pp. 1226–1234.852
[34] H. Dursun, K. ichi Nomura, L. Peng et al., A multilevel parallelization853
23
framework for high-order stencil computations, in: International Confer-854
ence on Parallel and Distributed Computing (Euro-Par), ACM, Delft, The855
Netherlands, 2009, pp. 642–653.856
[35] M. Christen, O. Schenk, Y. Cui, PATUS for convenient high-performance857
stencils: evaluation in earthquake simulations, in: Supercomputing Con-858
ferece (SC), IEEE Computer Society, 2012, pp. 11:1–11:10.859
[36] R. Abdelkhalek, H. Calandra, O. Coulaud, G. Latu, J. Roman, Fast seismic860
modeling and reverse time migration on a graphics processing unit cluster,861
Concurrency and Computation: Practice and Experience 24 (7) (2012)862
739–750.863
[37] J. Krueger, D. Donofrio, J. Shalf, M. Mohiyuddin, S. Williams, L. Oliker,864
F.-J. Pfreund, Hardware/software co-design for energy-e cient seismic865
modeling, in: Supercomputing Conferece (SC), IEEE Computer Society,866
2011, pp. 73:1–73:12.867
[38] Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, C. E. Leiser-868
son, The pochoir stencil compiler, in: ACM Symposium on Parallelism in869
Algorithms and Architectures (SPAA), ACM, San Jose, USA, 2011, pp.870
117–128.871
[39] A. D. Pereira, L. Ramos, L. F. W. Go´es, PSkel: A stencil programming872
framework for CPU-GPU systems, Concurrency and Computation: Prac-873
tice and Experience.doi:10.1002/cpe.3479.874
24
