On the Energy Efficiency and Performance of Irregular Application Executions on Multicore, NUMA and Manycore Platforms by Francesquini, Emilio et al.
On the Energy Efficiency and Performance of Irregular
Application Executions on Multicore, NUMA and
Manycore Platforms
Emilio Francesquini, Ma´rcio Castro, Pedro Penna, Fabrice Dupros, Henrique
Freitas, Philippe Olivier Alexandre Navaux, Jean-Franc¸ois Me´haut
To cite this version:
Emilio Francesquini, Ma´rcio Castro, Pedro Penna, Fabrice Dupros, Henrique Freitas, et al..
On the Energy Efficiency and Performance of Irregular Application Executions on Multicore,
NUMA and Manycore Platforms. Journal of Parallel and Distributed Computing, Elsevier,
2015, 76, pp. 32-48. <10.1016/j.jpdc.2014.11.002>. <hal-01092325>
HAL Id: hal-01092325
https://hal-brgm.archives-ouvertes.fr/hal-01092325
Submitted on 8 Dec 2014
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Distributed under a Creative Commons Attribution - NonCommercial 4.0 International License
On the Energy E ciency and Performance of Irregular
Application Executions on Multicore, NUMA and Manycore
Platforms
Emilio Francesquinia,b,g, Ma´rcio Castroc,d, Pedro H. Pennae, Fabrice Duprosf, Henrique C.
Freitase, Philippe O. A. Navauxc, Jean-Franc¸ois Me´hautg
aInstitute of Computing, University of Campinas (UNICAMP)
Av. Albert Einstein, 1251 - Cidade Universita´ria- 13083-852 - Campinas, Brazil
bInstitute of Mathematics and Statistics, University of Sa˜o Paulo (USP)
Rua do Mata˜o, 1010 - Cidade Universita´ria - 05508-090 - Sa˜o Paulo - Brazil
cInstitute of Informatics, Federal University of Rio Grande do Sul (UFRGS)
Av. Bento Gonc¸alves, 9500 - Campus do Vale - 91501-970 - Porto Alegre - Brazil
dDepartment of Informatics and Statistics, Federal University of Santa Catarina (UFSC)
Campus Reitor Joa˜o David Ferreira Lima - Trindade - 88040-970 - Floriano´polis - Brazil
eDepartment of Computer Science, Pontifical Catholic University of Minas Gerais (PUC Minas)
Avenida Dom Jose´ Gaspar, 500 - 30535-901 - Belo Horizonte - Brazil
fBureau de Recherches Ge´ologiques et Minie`res (BRGM)
BP 6009, 45060 Orle´ans Cedex 2, France
gCEA-DRT - LIG Laboratory, University of Grenoble
110 Avenue de la Chimie, 38400 Saint-Martin d’He`res, France
Abstract
Until the last decade, performance of HPC architectures has been almost exclusively quanti-
fied by their processing power. However, energy e ciency is being recently considered as im-
portant as raw performance and has become a critical aspect to the development of scalable
systems. These strict energy constraints guided the development of a new class of so-called
light-weight manycore processors. This study evaluates the computing and energy performance
of two well-known irregular NP-hard problems — the Traveling-Salesman Problem (TSP) and
K-Means clustering— and a numerical seismic wave propagation simulation kernel —Ondes3D
— on multicore, NUMA, and manycore platforms. First, we concentrate on the nontrivial task of
adapting these applications to a manycore, specifically the novel MPPA-256 manycore proces-
sor. Then, we analyze their performance and energy consumption on those di↵erent machines.
Our results show that applications able to fully use the resources of a manycore can have better
performance and may consume from 3.8x to 13x less energy when compared to low-power and
general-purpose multicore processors, respectively.
Keywords: Manycore, multicore, NUMA, energy e ciency, performance, TSP, seismic wave
propagation, k-means
Email addresses: francesquini@ic.unicamp.br (Emilio Francesquini), marcio.castro@inf.ufsc.br
(Ma´rcio Castro), pedro.penna@sga.pucminas.br (Pedro H. Penna), f.dupros@brgm.fr (Fabrice Dupros),
cota@pucminas.br (Henrique C. Freitas), navaux@inf.ufrgs.br (Philippe O. A. Navaux),
jean-francois.mehaut@imag.fr (Jean-Franc¸ois Me´haut)
Preprint submitted to Journal of Parallel and Distributed Computing August 21, 2014
*Manuscript
Click here to view linked References
1. Introduction1
Demand for higher processor performance led chipmakers to include into their designs so-2
lutions that are a combination of brute-force and innovation. The increase of processors cache3
size, instruction-level parallelism and working frequency have been for the last decades their4
main tools to accomplish this mission. However, these approaches seem to have reached a point5
in which they, by themselves, are not su cent to ensure the steep curve of performance improve-6
ment predicted by Moore’s Law and expected by the users [1].7
An exponential increase in power consumption related to a linear increase in the clock fre-8
quency [2] and a higher complexity to design new processors changed the course of development9
of these new processors. Power consumption has become a critical aspect to the development of10
both large and small scale systems. This concern is now enough to warrant the research on the11
use of embedded low-power processors to create the next generation of HPC systems. For in-12
stance, the European Mont-Blanc project [3] was created to evaluate the use of such components13
in an HPC environment [4]. While these low-power multicore processors usually do not o↵er14
the same performance as their regular counterparts, they normally o↵er better energy-to-solution15
results.16
Current highly-parallel processors take this paradigm even further. They normally possess17
hundreds (sometimes thousands) of cores which execute with high energy e ciency. The execu-18
tion model of these processors usually follows two di↵erent approaches. Light-weight manycore19
processors, such as Tilera Tile-Gx [5] and Kalray MPPA-256 [6], o↵er autonomous cores and a20
shared memory execution model. In this case, traditional tools such as POSIX threads are em-21
ployed to accomplish both data and task parallelism. The use these tools may ease the paradigm22
shift from multicores to manycores, since several parallel applications developed for multicores23
rely on this model. Di↵erently, Graphics Processing Units (GPUs) follow another approach based24
on a Single Program, Multiple Data (SPMD) model, relying on runtime APIs such as CUDA and25
OpenCL. Thus, considerable e↵ort may be necessary to adapt parallel code originally developed26
for multicores to GPUs. Here we are interested in the former.27
In this paper we describe three di↵erent irregular applications and the necessary adaptations28
to use them on four distinct hardware platforms. The Traveling Salesman Problem (TSP), the29
K-Means clustering (K-Means) algorithm, and a Seismic Wave Propagation kernel (Ondes3D).30
Solutions to the TSP and K-Means problems are NP-hard and, for a large enough instance, the31
algorithm can be parallelized to make use of an arbitrary number of threads, assuring the com-32
plete use of the chosen platforms. Ondes3D, on the other hand, employs a numeric seismic33
wave propagation simulation algorithm. These applications were chosen because they represent34
three di↵erent behaviors: CPU-bound (TSP), memory-bound (Ondes3D), and mixed (K-Means).35
However, while all of them are highly parallelizable, they also reveal important issues related to36
imbalance and irregularity: the execution course for the same instance of the problem can dras-37
tically change depending on the order and on the number of employed processor cores.38
We consider two important aspects in this study. The first aspect concerns the programming39
issues and challenges encountered when adapting these irregular applications for the MPPA-25640
manycore processor. The use of Network-on-Chip (NoC) for communication and the absence41
of cache coherence protocols are among the important factors that make the development of42
parallel applications on this processor not trivial. Additionally, processors such as MPPA-25643
have important memory constraints, e.g., limited amount of directly addressable memory (2MB).44
Furthermore, e cient execution on this processor requires data transfers in conformance to the45
NoC topology to mitigate the, otherwise high, communication costs. The lessons learned give46
2
some insights on what can be faced when adapting parallel applications to manycores.47
The second aspect concerns the performance and energy consumption of multicores and48
manycores. Our experiments were carried out on four di↵erent hardware platforms: Intel Xeon49
E5, SGI Altix UV 2000, Samsung Exynos 5, and Kalray MPPA-256. The first two are composed of50
general-purpose processors, while the remaining two are based on embedded low-power proces-51
sors. We compare the overall performance of these platforms as well as their power e ciency.52
Our results show that the energy-to-solution for the same instance of the problem can present53
important variations between the experimental platforms. For every application, MPPA-256 pre-54
sented the best energy-to-solution, consuming from 3.8x to 6.9x less energy than the second best55
platform (Exynos 5). When compared to Xeon E5 and Altix UV 2000, MPPA-256 consumed re-56
spectively from 5.7x to 13.1x and from 8.5x to 12.3x less energy. The time-to-solution on the57
other hand was dominated by the Altix UV 2000 platform. MPPA-256 and Xeon E5 showed approx-58
imately equivalent performances, however with a clear advantage to Xeon E5 for memory-bound59
applications. With relation to the Altix UV 2000 platform, the execution of the applications were60
on average from 13.0x to 20.4x, 108.8x to 154.8x and 13.0x to 15.3x slower on the Xeon E5,61
Exynos 5 and MPPA-256 platforms respectively. Next, we compare the Altix UV 2000 and the62
MPPA-256 platforms. Although very di↵erent from each other, these platforms share some simi-63
larities that give us the opportunity to evaluate important aspects of their scalability and energy64
e ciency. We concluded that both architectures scale well for the chosen applications. While65
on low core counts MPPA-256 may have a higher energy-to-solution, it can quickly fill the gap as66
the increase in the number of cores results in a small increase in the average power consumption.67
The remainder of this paper is organized as follows. Section 2 outlines the evaluated plat-68
forms. A high-level description of the TSP, K-Means and Ondes3D as well as their algorithms69
are detailed in Section 3. Next, Section 4 discusses the challenges encountered when passing70
from multicores to the MPPA-256 manycore processor. Then, Section 5 presents performance71
and energy e ciency evaluations. Finally, we discuss related works in Section 6 and conclude in72
Section 7.73
2. Experimental Platforms74
In this section we describe the experimental platforms used in this study. These platforms75
represent three di↵erent classes: general-purpose, embedded and manycore.76
2.1. General-Purpose77
Xeon E5. The Intel Xeon E5 is a 64-bit x86-64 processor. In this study we used a Xeon E5-78
4640 Sandy Bridge-EP, which has 8 physical cores running at 2.40 GHz. Each core has 32 KB79
instruction and 32 KB data L1 caches and 256 KB of L2 cache. All the 8 cores share a 20 MB80
L3 cache and the platform has 32 GB of DDR3 memory.81
Altix UV 2000. SGI Altix UV 2000 (Figure 1) is a Non-Uniform Memory Access (NUMA)82
platform designed by SGI. The platform is composed of 24 NUMA nodes. Each node has a83
Xeon E5-4640 Sandy Bridge-EP processor (with the same specifications of the Xeon E5 platform)84
and 32 GB of DDR3 memory. This memory is shared in a ccNUMA fashion through SGI’s85
proprietary NUMAlink6 (bidirectional). This high-speed interconnection provides a point-to-86
point bandwidth of 6.7 GB/s per direction. Overall, this platform has 192 physical cores.87
3
Node 0
Node 1
Node 2
Node 23
SGI UV 2000
Main Memory (32GB)
L2 L2
L3
L2 L2 L2 L2L2 L2
Intel Xeon E5
NUMA Node
C0
t0 t1
C7
t14t15
C6
t12 t13
C5
t10 t11
C1
t2 t3
C2
t4 t5
C3
t6 t7
C4
t8 t9
Figure 1: A simplified view of Altix UV 2000.
2.2. Embedded88
Exynos 5. Samsung Exynos 5 is a Multiprocessor System-on-Chip (MPSoC) that implements89
the recent ARM big.LITTLE heterogeneous computing architecture. In this study we used the90
ODROID-XU+E board, which features a Samsung Exynos 5 5410 Octa processor. This proces-91
sor integrates a Cortex-A15 quad-core running at 1.6 GHz with a Cortex-A7 quad-core running92
at 1.2 GHz on the same chip. Both CPUs are connected to 2 GB of LP-DDR3 memory. This93
processor uses a clustered model approach: the operating system scheduler is only aware of four94
out of the total of eight processing cores. If at any point in time the load on the four Cortex-95
A7 cores surpasses a pre-established threshold, the processor itself switches the execution to the96
Cortex-A15 cores. This is done in a way that is transparent to the operating system. The rationale97
here is that, while the Cortex-A15 cores provide better performance, they also incur in a higher98
energy utilization and by switching between the two sets of cores energy can be saved.99
2.3. Manycore100
MPPA-256. Kalray MPPA-256 is a single-chip manycore processor developed by Kalray that101
integrates 256 user cores and 32 system cores. It uses 28nm CMOS technology running at102
400 MHz. These cores are distributed across 16 compute clusters and 4 I/O subsystems that103
communicate through data and control NoCs. This processor targets parallel applications whose104
programming models fall within the following classes: Kahn Process Networks (KPN), as moti-105
vated by media processing; SPMD, traditionally used for numerical kernels; and time-triggered106
control systems [7, 6].107
Figure 2 shows the architecture overview of the MPPA-256. It features two types of cores:108
Processing Elements (PE) and Resource Managers (RM). Although RMs and PEs implement109
the same Very Long Instruction Word (VLIW) architecture, they have di↵erent purposes: PEs110
are dedicated to run user threads (one thread per PE) in non-interruptible and non-preemptible111
mode whereas RMs execute kernel routines and services of NoC interfaces. Operations executed112
by RMs vary from task and communication management to I/O data exchanges between either113
external buses (e.g. PCIe) or SDRAM. For this reason, RMs have privileged connections to NoC114
interfaces. Both PEs and RMs feature private 2-way associative instruction and data caches.115
PEs and RMs are grouped within compute clusters and I/O subsystems. Each compute cluster116
features 16 PEs, 1 RM and a local shared memory of 2 MB, which creates a interconnection with117
high bandwidth and throughput between PEs. Each I/O subsystem relies on 4 RMs with a shared118
D-cache, static memory and external DDR access (2 GB). Contrary to the RMs available on119
4
PE0 PE1
PE2 PE3
PE4 PE5
PE6 PE7
PE8 PE9
PE10 PE11
PE12 PE13
PE14 PE15
Sh
ar
ed
 M
em
or
y
D-NoC C-NoC
RM RM RMRM
RM
RM
RM
RM
RM RM RMRM
RM
RM
RM
RM
Compute Cluster
I/O Subsystem
I/O
 S
ub
sy
st
em
I/O Subsystem
I/O
 S
ub
sy
st
em
PCIe, DDR, ...
PCIe, DDR, ...
MPPA-256
RM
Figure 2: A simplified view of the MPPA-256.
compute clusters, the RMs of I/O subsystems also run user code. An important peculiarity of120
the MPPA-256 architecture is that it does not support cache coherence between PEs, even among121
those in the same compute cluster.122
Parallel applications running on MPPA-256 usually follow the master/worker pattern. The123
master process runs on an RM of the I/O subsystem and it is responsible for spawning worker124
processes. These processes are then executed on compute clusters and each process may create125
up to 16 POSIX threads, one for each PE. In other words, the master process running on the I/O126
subsystem must spawn 16 worker processes and each process must create 16 threads in order to127
make full use of the 256 cores.128
Compute clusters and I/O subsystems are connected by two parallel NoCs, the Data NoC (D-129
NoC) and the Control NoC (C-NoC). Both NoCs have bi-directional links, and there is one NoC130
node per compute cluster, which is controlled by the RM. I/O subsystems, on the other hand,131
have 4 NoC nodes, each one associated with a D-NoC router and a C-NoC router. The D-NoC is132
dedicated to high bandwidth data transfers whereas the C-NoC is dedicated to peripheral D-NoC133
flow control, power management and application messages.134
2.4. Synthesis135
As we previously mentioned, Xeon E5, Altix UV 2000, Exynos 5 and MPPA-256 represent dif-136
ferent platform classes. Both Xeon E5 and Altix UV 2000 belong to the class of general-purpose137
platforms we usually find in servers. These platforms are tuned for performance rather than en-138
ergy e ciency. Di↵erently from those performance-centric platforms, Exynos 5 targets mobile139
devices in which power is one of the most important concerns. Finally, MPPA-256 belongs to140
the light-weight manycore platform class. It presents a high density of cores in a single chip but141
still is more energy e cient than general-purpose processors. In the next section we describe the142
three applications we used in this paper to analyze the performance and the energy e ciency of143
these platforms. These applications can also be categorized into distinct behavioral classes what144
allows us to carry out this study in a comprehensive yet simple manner.145
5
3. Case Studies146
Application execution performance can vary a lot depending on the hardware platform on147
which they are being executed. These pieces of software are normally categorized by their be-148
havior taking into account the execution aspect that influences their performance the most. For149
instance, an application in which the time used for memory accesses is a performance bottleneck150
is said to be memory-bound. Similarly, an application in which the execution bottleneck is the151
computation time is said to be CPU-bound. As we show in Section 5, an application that on152
a hardware architecture is CPU-bound can become memory-bound on other architectures if the153
underlying hardware characteristics are not taken into consideration.154
To highlight the impact di↵erent amounts of computation and communication can make on155
the execution performance and the energy e ciency of the experimental hardware platforms, we156
chose three applications with three distinct behaviors (CPU-bound, memory-bound and mixed).157
We now detail these applications further.158
3.1. Traveling-Salesman Problem159
The TSP consists in solving the routing problem of a hypothetical traveling- salesman. Such160
a route must pass through n cities, only once per city, return to the city of origin and have161
the shortest possible length. It is a very well studied NP-hard problem. More formally, the162
problem could be represented as a complete undirected graph G = (V, E), |V | = n where each163
edge (i, j) 2 E has an associated cost c(i, j)   0 representing the distance from the city i to j164
(Figure 3a). The goal is to find a hamiltonian cycle with minimum cost that visits each city only165
once and finishes at the city of departure.166
1
2
3
4
10
25
15
23
12
17
1
2 3 4
3 4 2 4 2 3
4 2 3 2
12 23 10
25 17 25 15 17 15
25 251715
52 44 55 52 50
3
15
4
17
65
(a) (b)
Figure 3: Example of TSP with 4 cities.
There are several di↵erent approaches to solve this problem [8]. These solutions normally167
employ brute force, simple or complex heuristics, approximation algorithms or a mix of them.168
We are not going to detail the di↵erent available approaches since we are after the evaluation and169
performance comparison of an embarrassingly parallel non-numerical application across di↵er-170
ent architectures. Therefore, we use a brute force exact algorithm based on a simple heuristic [9].171
We first explain the sequential version of our algorithm, then we explain how we extended it to172
work with multiple threads. Finally, we present its distributed version.173
6
3.1.1. Sequential Algorithm174
The sequential version of the algorithm is based on the branch and bound method using brute175
force. Algorithm 3.1 outlines this solution. It takes as input the number of cities and a cost176
matrix, and outputs the minimum path length.177
Algorithm 3.1: TSP Sequential(n cities, costs)
global min path
procedure tsp solve(last city, current cost, cities)
if cities = ;
then return (current cost)
for each i 2 cities
do8>>>>>><>>>>>>:
new cost  current cost + costs[last city, i]
if new cost < min path
then
(
new min tsp solve(i, new cost, cities\{i})
atomic update if less(min path, new min)
main
min path 1
tsp solve(1, 0, {2, 3, ..., n cities})
output (min path)
This algorithm does a depth-first search looking for the shortest path and has complexity178
O(n!). It does not explore paths that are already known to be longer than the best path found179
so far, therefore discarding fruitless branches. Figure 3b shows this behavior. The shaded edges180
are those that the algorithm does not follow, since a possible solution that includes them would181
be more costly than the one it has already identified. This simple pruning technique greatly182
improves the performance of the algorithm. However, it also introduces irregularities into the183
search space. The search depth needed to discard one of the branches depends on the order in184
which the branches were searched.185
3.1.2. Multi-threaded Algorithm186
The multi-threaded version of the algorithm works by creating a queue of tasks from which187
each thread takes the jobs to be executed. A task is nothing more than one of the branches of188
the search tree. The generation of the tasks is done sequentially since the time needed to do189
it is negligible. As soon as one thread runs out of work, it takes a new task from the queue.190
The number of tasks to be generated is a function of the number of threads and is defined by191
the max hops parameter. This is the minimum number of levels of the search tree that must192
be descended so that there is a minimum (parameterizable) number of tasks per thread. The193
total number of tasks as a function of levels l and cities n can be determined by the following194
recurrence relation (Equation 1) which is defined for 0  l < n.195
7
t(l, n) =
8>><>>:1 l = 0t(l   1, n) ⇤ (n   l) otherwise (1)
Algorithm 3.2 shows the pseudo-code for this approach. This algorithm also receives as a196
parameter the number of threads n threads to be used.197
Algorithm 3.2: TSP Multi-threaded(n cities, costs, n threads,max hops)
global queue,min path
procedure generate tasks(n hops, last city, current cost, cities)
if n hops = max hops
then
(
task  (last city, current cost, cities)
enqueue task(queue, task)
else
8>>>>>>>>>>><>>>>>>>>>>>:
for each i 2 cities
do
8>>>>>>>><>>>>>>>>:
if last city = none
then last cost  0
else last cost  costs[last city, i]
new cost  curr cost + last cost
generate tasks(n hops + 1, i, new cost, cities\{i})
procedure do work()
while queue , ;
do
(
(last city, current cost, cities) atomic dequeue(queue)
tsp solve(last city, current cost, cities)
main
min path 1
generate tasks(0, none, 0, {1, 2, ..., n cities})
for i 1 to n threads
do spawn thread(do work())
wait every child thread()
output (min path)
3.1.3. Distributed Algorithm198
The distributed algorithm is similar to the multi-threaded version. It receives as an additional199
parameter the number of distributed peers to be used. The number of peers and the number of200
threads define the total number of lines of execution. For each peer, n threads will be created,201
thus totaling n threads⇥n peers threads. Inside each peer, the execution is nearly identical to that202
of the multi-threaded case. The only di↵erence is that when the min path is updated, this update203
is broadcasted to every other peer so they can also use it to optimize their execution. At the end204
of the execution, one of the peers (typically the 0-th) prints the solution. The final solution might205
have been discovered by any one of the peers, however all of them are aware of it due to the206
broadcasts of each discovered min path.207
8
To avoid two peers working on the same subproblem, each peer peer id only works on the208
tasks which were assigned to it. To do so, we specify the desired number of partitions per209
peer. We also specify the percentage of the tasks that will be distributed in the beginning of210
the execution. Afterwards, as the peers run out of work, they will ask a master peer for more211
partitions. To reduce communication, the master peer sends sets of partitions of decreasing size212
at each request [10]. The rationale behind it is that, as the task sizes are irregular, distributing213
a smaller number of partitions during the end of the execution might decrease the imbalance214
between the peers. In this case, for each request the master peer sends a set of partitions S and215
the peer peer id will work on the tasks such that task index mod n partitions 2 S . Since the216
task generation is done locally, the amount of transferred data can be minimized.217
As a runtime optimization, only one thread per peer becomes responsible for asking for218
more partitions when the peer runs out of work. Once this thread receives a new partition from219
the master peer, it generates and populates the peer’s task queue with new tasks. During the220
generation of these tasks, the remaining n threads   1 threads can begin to process tasks as soon221
as they are enqueued, without the need to wait for the end of the task generation. This behavior222
is further discussed in Section 5.5.223
3.2. The K-Means Clustering Problem224
Clustering analysis plays an important role in di↵erent fields, including data mining, pattern225
recognition, image analysis and bioinformatics [11]. In this context, a widely used and studied226
clustering approach is the K-Means clustering.227
Formally, the K-Means clustering problem can be defined as follows. Given a set of n points228
in a real d-dimensional space, the problem is to partition these n points into k partitions, so as to229
minimize the mean squared distance from each point to the center of the partition it belongs to.230
Figure 4 illustrates an instance of this problem.231
Original data Clustered data
cendroids
Figure 4: An example of K-Means with 5 partitions.
Several distinct heuristics have been proposed to address the K-Means clustering problem232
[12, 13]. One of the most widely employed is the Lloyd’s algorithm [14], also known as K-233
Means algorithm. Such heuristic is based on an iterative strategy that finds a locally minimum234
solution for the problem. In our work we used this algorithm as a case study. In the following235
subsections, we first present the sequential version of the algorithm and then we introduce and236
explain our parallel and distributed versions.237
9
3.2.1. Sequential Algorithm238
The sequential version of K-Means is shown in Algorithm 3.3. The main idea is to use the239
notion of minimum Euclidean distance to iteratively partition the data points. The algorithm takes240
as input the set of data points, the number of partitions k, and the minimum accepted distance,241
mindistance, between each point and the partition’s center (centroids). Upon completion, the242
algorithm returns the partitions themselves.243
Algorithm 3.3: K-means Sequential(points, k,mindistance)
global partitions
procedure populate()
for each pnt 2 points
do pnt.partition nearest(pnt, partitions)
procedure compute centroids()
for each part 2 partitions
do part.centroid  compute mean(part.population)
main
random populate(partitions, points)
compute centroids()
repeat(
populate()
compute centroids()
until has changed() and too far()
return (partitions)
The sequential K-Means algorithm works as follows. Initially, data points are evenly and244
randomly distributed among the k partitions, and the initial centroids are computed. Then, the245
data points are re-clustered into partitions taking into account the minimum Euclidean distance246
between them and the centroids — points are assigned to the nearest partition. Next, the centroid247
of each partition is recalculated taking the mean of all points in the partition, and the whole248
procedure is repeated until no centroid is changed and every point is farther than the minimum249
accepted distance.250
It is worthy to note that this algorithm presents a natural irregularity: at any time during the251
execution, the number of points within each partition (population) may di↵er, implying di↵erent252
recalculation times for each partition’s centroid.253
3.2.2. Multi-threaded Algorithm254
The multi-threaded version of the K-Means algorithm is presented in Algorithm 3.4. Com-255
pared to the sequential algorithm, it takes an additional parameter, t, that specifies the total num-256
ber of execution flows. The strategy adopted is to assign to each thread an unique range of points257
and partitions, and split the algorithm in two phases. In the first phase, each thread re-clusters its258
10
own range of points into the k partitions. In the second phase, each thread works in its own range259
of partitions, in order to recalculate centroids.260
Algorithm 3.4: K-means Multi-Threaded(points, k,mindistance, t)
global partitions
procedure do kmeans(work)
repeat(
populate(work. f irst point,work.last point)
compute centroids(work. f irst partition,work.last partition)
until has changed() and too far()
main
random populate(partitions, points)
compute centroids(partitions. f irst, partition.last)
for i 0 to (t   1)
do
8>>>>>>>><>>>>>>>>:
work. f irst point  i ⇥ num points ÷ t
work.last point  (i + 1) ⇥ num points ÷ t
work. f irst partition i ⇥ k ÷ t
work.last partition (i + 1) ⇥ k ÷ t
spawn thread(do kmeans(work))
wait every child thread()
return (partitions)
The multi-threaded version of the algorithm still presents some important execution irregu-261
larities. Although the range of points and partitions are evenly distributed among the working262
threads, the amount of work for each thread may vary during each iteration, since for the du-263
ration of the second phase more populated partitions end up requiring more operations to have264
their centroids recalculated.265
3.2.3. Distributed Algorithm266
The distributed algorithm described in this section is widely used in practice [11, 15, 16]267
and a scalability analysis for this algorithm can be seen in the work by Rodrigues et. al. [17].268
Compared to the multi-threaded algorithm, the distributed K-Means algorithm takes an additional269
parameter p that specifies the number of distributed peers to be used. Each peer by itself spawns270
t working threads, so the total number of threads equals to p ⇥ t.271
The strategy employed in this algorithm is to first distribute the data points and replicate the272
data centroids among peers, and then to loop over a two-phase iteration. In the first phase, par-273
titions are populated, as in the multi-threaded algorithm, and in the second phase, data centroids274
are recalculated. For this recalculation, first each peer uses its local data points to compute par-275
tial centroids, i.e., a partial sum of data points and population within a partition. Next, peers276
exchange partial centroids so that each peer ends up with the partial centroids of the same parti-277
tions. Finally, peers compute their local centroids and broadcast them.278
11
One could argue that it would be possible to remove some of the irregularity from the multi-279
threaded version if we used the same partial centroid calculation technique used by the distributed280
implementation. The multi-threaded version of the algorithm splits the computation in two inde-281
pendent phases: populate partitions and compute centroids of partitions. The main advantage of282
using this approach is that it requires fewer thread synchronization structures, when compared283
to the distributed implementation. However, this multi-threaded implementation may introduce284
some irregularity in the application. Instead, if we adopted the same technique used by the dis-285
tributed approach, we could split the overall work into smaller tasks and decrease some of the286
irregularity. However, the decreased irregularity would be achieved at the expense of an impor-287
tant increase in the cost of synchronization structures.288
Nevertheless, the irregularity itself is closely related to the working data set: if partitions are289
too unbalanced, irregularity is strongly present; whereas if points are evenly distributed among290
partitions, irregularity is not so sharply presented. In both the multi-threaded and distributed291
algorithms, we work with a uniformly distributed random data set, and thus irregularity is not292
strongly present. Considering that, if we adopted the distributed approach in the multi-threaded293
implementation, we would further decrease irregularity, but the performance gains obtained by294
that strategy are quickly overcome by the additional synchronization procedures causing, in fact,295
performance degradation.296
3.3. Seismic Wave Propagation297
Understanding the wave propagation with respect to the structure of the Earth lies at the298
core of many analysis both in the oil and gas industry and for quantitative seismic hazard as-299
sessment. In this paper the earthquake process is described as elastodynamics and we use a300
finite-di↵erences scheme for solving the wave propagation problem in elastic media [18]. This301
approach was first proposed in 1970 and since then it has been widely employed due to its simple302
formulation and implementation. In this section we describe the governing equations and discuss303
some of their standard sequential and parallel implementations.304
The seismic wave equation in the case of an elastic material is:305
⇢
@vi
@t
=
@ i j
@ j
+ Fi (2)
Additionally, the constitutive relation in the case of a isotropic medium is:306
@ i j
@t
=   i j
 
@vx
@x
+
@vy
@y
+
@vz
@z
!
+ µ
 
@vi
@ j
+
@v j
@i
!
(3)
Where indices i, j, k represent a component of a vector or tensor field in Cartesian coordinates307
(x, y, z), vi and  i j represent the velocity and stress field respectively, and Fi denotes an external308
source force. ⇢ is the material density and   and µ are the elastic coe cients known as Lame´309
parameters. A time derivative is denoted by @@t and a spatial derivative with respect to the i-th310
direction is represented by @@i . The Kronecker symbol  i j is equal to 1 if i = j and zero otherwise.311
3.3.1. Sequential Algorithm312
As mentioned before, the finite di↵erences method is one of the most popular techniques313
to solve the elastodynamics equations and to simulate the propagation of seismic waves [18,314
19]. One of the key features of this scheme is the introduction of a staggered-grid [20] for the315
discretization of the seismic wave equation.316
12
Indeed, all the unknowns are evaluated at the same location for classical collocated methods317
over a regular Cartesian grid whereas the staggered grid leads to a shift of the derivatives by half318
a grid cell (Figure 5). The equations are rewritten as a first-order system in time and therefore319
the velocity and the stress fields can be simultaneously evaluated at a given time step.320
2
2
2
ijk
ii
( 1/ 2)( 1/ 2)i j k
xy
+ +
( 1/ 2) ( 1/ 2)i j k
xz
+ +
( 1/ 2)( 1/ 2)i j k
yz
+ +
( 1/ 2)ij k
z
+
( 1/ 2)i j k
y
+( 1/ 2)i jk
x
+
x
y
z
V
V V
Figure 5: Elementary 3D cell of the staggered grid and distribution of the stress ( ) and the velocity (v) components.
The computational procedure is described in Algorithm 3.5. Inside the time step loop, the321
first triple nested loop is devoted to the computation of the velocity components, and the second322
loop reuses the velocity results of the previous time step to update the stress field. For instance,323
the stencil applied for the computation of the velocity component in the x-direction is given324
by Equation 4. Exponents i, j, k indicate the spatial direction,  i jk =  (i s, j s, k s),  s325
corresponds to the space step,  t to the time step and a1, a2 and a3 are defined as three constants.326
Algorithm 3.5: Sequential Seismic Wave Propagation Kernel( , v)
for x 1 to x dimension
do
8>>>><>>>>:
for y 1 to y dimension
do
(for z 1 to z dimension
do
n
compute stress( xx, yy, zz, xy, xz, yz)
for x 1 to x dimension
do
8>>>><>>>>:
for y 1 to y dimension
do
(for z 1 to z dimension
do
n
compute velocity(vx, vy, vz)
One particularity of the three-dimensional simulation of seismic wave propagation is the327
consideration of a finite computing domain whereas the physical problem is unbounded. Addi-328
tional numerical conditions are then required to absorb the energy at the artificial boundaries.329
At the lateral and bottom edges of the three-dimensional geometry, a specific set of equations is330
implemented. For instance, the classical Perfectly Matched Layer (PML) relies on the implemen-331
13
tation of a sponge numerical function that provides exponential attenuation in the nonphysical332
region [21]. A fixed size of ten grid points is chosen for the thickness of this layer (represented333
in gray color in Figure 6) and the CPU-cost ratio observed between a boundary grid point and a334
physical domain point varies from two to four.335
v(i+
1
2 ) jk
x
⇣
l + 12
⌘
= v(i+
1
2 ) jk
x
⇣
l   12
⌘
+ a1F
(i+ 12 ) jk
x
+ a2
h (i+1) jkxx    i jkxx
 x
+
 
(i+ 12 )( j+
1
2 )k
xy    (i+
1
2 )( j  12 )k
xy
 y
+
 
(i+ 12 ) j(k+
1
2 )
xz    (i+
1
2 ) j(k  12 )
xz
 z
i
  a3
h (i+2) jkxx    (i 1) jkxx
 x
+
 
(i+ 12 )( j+
3
2 )k
xy    (i+
1
2 )( j  32 )k
xy
 y
+
 
(i+ 12 ) j(k+
3
2 )
xz    (i+
1
2 ) j(k  32 )
xz
 z
i
(4)
This numerical kernel leads to several challenges when considering its implementation on336
parallel architectures. The load imbalance must be tackled with di↵erent strategies adapted337
to shared or distributed architectures. This first level of irregularity could be worsened by the338
memory-bound nature of this numerical stencil and advanced strategies must therefore be used339
to maximize the performances on hierarchical platforms.340
3.3.2. Multi-threaded Algorithm341
On shared-memory architectures, a popular way to extract parallelism is to exploit the triple342
nested loops coming from the three dimensions of the problem under study. This allows a very343
straightforward use of OpenMP directives. However, two levels of irregularity should be consid-344
ered with this straightforward implementation. Firstly, the imbalance coming from the absorbing345
boundary conditions could be partially addressed by using a dynamic schedule across the loop it-346
erations. This leads to significant improvements in the distribution of the load [22]. This solution347
comes at the expense of introducing an irregular access of the data with higher NUMA penalties348
on hierarchical platforms. In this paper we tackled the imbalance by exploiting a static strategy349
along with an intelligent memory allocation policy. Basically, we guarantee that the memory350
accessed by each thread is allocated close to the thread. This reduces considerably the latencies351
in NUMA platforms. In the same sense, advanced runtime systems also provide good results in352
order to improve threads and memory mapping [23].353
3.3.3. Distributed Algorithm354
On distributed memory architectures, most standard parallel implementations of the elasto-355
dynamics equations are based on cartesian grid decomposition. Although our code is di↵erent,356
our approach is very similar to that used by Cui et al. [24] and Furumura and Chen [25]. Our357
distributed algorithm works by decomposing the computational domain into sub-domains Di in358
such a way that each sub-domain is mapped to one peer. Inside each peer, the execution is nearly359
identical to that of the multi-threaded case. The di↵erence is that peers need to communicate360
with their neighbors to exchange boundary data. Figure 6 shows this decomposition with 3⇥3361
subdomains with an equal number of grid points in each.362
This strategy can be optimized by using non-blocking communications among peers and by363
overlapping communications and computations. For instance, we first compute the boundary364
grid points located between neighbors. Then, these values are exchanged between neighboring365
peers using non-blocking communications. During this exchange phase, each peer computes its366
inner points in parallel.367
14
Peer 0 Peer 1 Peer 2
Peer 3 Peer 4 Peer 5
Peer 6 Peer 7 Peer 8
Figure 6: Distributed implementation of the seismic wave propagation kernel with a 3⇥3 decomposition. The gray
regions represent the irregular absorbing boundary condition layers.
This decomposition also implies an irregular load between peers, due to the absorbing bound-368
ary condition layers. The irregularity of the distributed algorithm is investigated by Tesser et369
al. [26]. In this case, quasi-static decomposition, adaptive mesh refinement or parallel mesh par-370
titioning are standard techniques that intend to balance computations. In this paper we applied371
the former technique [27].372
3.4. General Considerations373
In this section we described the three applications we use as case studies. We presented374
TSP, a CPU-bound application, in which only a small amount of data needs to be kept during the375
execution. In the multi-threaded algorithm threads rarely communicate, and when they do so they376
only exchange small amounts of data. Conversely, Ondes3D is a memory-bound application that377
accesses vast amounts of memory during its execution. Threads on the multi-threaded version378
often communicate to exchange information about the borders of the current data set on which379
they are working. K-Means is halfway between these two applications: the time used for data380
accesses is well balanced with computation time. In the multi-threaded version each thread has381
its own set of data that is synchronized at the end of each iteration.382
Both K-Means and TSP present strong irregularity which translates to unpredictable exe-383
cution times. The irregularity itself is linked to the problem (and to the input) and the chosen384
algorithm can only do so much to try to alleviate this issue during runtime [28]. These kind of385
applications have a strong need for dynamic load balancing to manage the irregularity. Weak386
irregularity is also present in these applications and it is normally associated to the chosen algo-387
rithms, data structures, and load-balancing strategy. With varying degrees of e↵ort (and success)388
these weak irregularities might lightened by distinct implementation choices.389
Contrary to TSP and K-Means, Ondes3D is an application that presents only weak irregular-390
ity. It is a predictable application in the sense that the total number of operations and communi-391
cations is known in advance. For this application, given the input, it is even possible with some392
e↵ort to statically perform the load balancing before the actual execution. However, in practice,393
this kind of approach seems to be left aside in favor of a dynamically load-balanced stencil based394
approach. In this kind of solution the irregularity arises from the data and their shape (at the bor-395
der of computation domain and inside the computation domain) and not from the load-balancer.396
The load-balancer is, in fact, the responsible for trying to ensure a more regular execution.397
15
By taking into consideration these three applications and the three distinct hardware archi-398
tectures used in our experimental evaluations, we can draw comprehensive (yet straightforward)399
conclusions about execution performance and energy e ciency of these applications on the dis-400
tinct hardware platform classes.401
Manycores have several distinctive features that must be considered so that applications can402
achieve good performance. In the next section we present the adaptations we did to the applica-403
tions we just presented in order to achieve an e cient execution.404
4. Adapting Irregular Applications for Manycores405
The adaptation of existing distributed applications to manycores such as the MPPA-256 can406
be straightforward, as in the case of K-Means. On the other hand, some applications as the407
TSP and Ondes3D demand much more e↵ort. In this section we describe the adaptations these408
applications had to go through to e ciently execute on the MPPA-256 manycore.409
4.1. TSP410
Section 3.1 presented some insights into the algorithms for the resolution of the TSP. How-411
ever, e ciently passing from multicores to manycores might be a nontrivial task. There are sev-412
eral reasons for that, the most evident being the natural architectural di↵erences between these413
platforms. These di↵erences usually force us to make adaptations to the code. In this section, we414
discuss these architectural aspects and adaptations as well as the rationale behind them.415
POSIX threads are supported by all experimental platforms. This allowed us to execute416
the application on Xeon E5, Altix UV 2000, and Exynos 5 using exactly the same code. In our417
implementation, the global variable min path, defined by Algorithm 3.1, is implemented using a418
simple shared variable that is accessed by every thread. The function atomic update if less() is419
therefore implemented using a regular POSIX lock.420
Unfortunately, this common solution is not appropriate to the MPPA-256 platform since it421
does not possess coherent caches. Despite the fact that the update of min path works as it should422
(on the MPPA-256 platform the POSIX lock implementation invalidates the whole cache) and the423
final path length is correct, each one of the worker threads might be using a stale value of the424
min path variable for a long time (in the worst case until the end of its execution) and wasting425
time on fruitless branches of the search tree. This means that, although correct, the execution426
might be severely slowed down. To correct it, we have used platform specific instructions that427
allow us direct access to the local memory of the cluster, bypassing the cache ( builtin k1428
lwu and builtin k1 swu to load/store data from/to the local memory, respectively). The cost429
of reading the variable in this manner is clearly higher than using the value stored in the caches430
(reading from memory takes 8 cycles whereas reading from cache takes at most 2 cycles). Yet,431
the performance improvement due to the better pruning of the search tree largely outweighs the432
additional cost.433
In order to e ciently exploit the MPPA-256 platform, we needed to use every cluster of the434
chip. These clusters do not have a global memory space hence the need for the distributed version435
of the algorithm. Conversely, Altix UV 2000 platform has a global memory space, however, as the436
communications between the NUMA nodes are done through the NUMAlink6 interconnection,437
we can make a better use of this system by keeping the memory near the threads that use it and438
avoid using the link to perform anything but global synchronizations and min path propagation.439
The distributed algorithm fits perfectly in this scenario.440
16
In general, the distributed algorithm used byMPPA-256 and Altix UV 2000 is the same. Peers in441
theMPPA-256 platform take the form of compute clusters while in the Altix UV 2000 platform each442
peer is represented by a NUMA node. The di↵erence lies on the implementation of the min path443
broadcast and the task distribution. On the Altix UV 2000 platform, the implementation is based444
on shared memory using locks and condition variables. On the other hand, the implementation445
for the MPPA-256 platform is more complex. Since there is no shared memory between clusters,446
we employ asynchronous message exchanges. These message exchanges take the form of remote447
memory write operations. This can be done using proprietary MPPA-256 low-level system calls448
that allow a thread in a cluster to write to the memory of any other cluster on the chip. In both449
cases, the local value of the min path variable is updated atomically. However, due to the time450
needed to broadcast a new value, some threads might use a stale value for a short time until the451
broadcast is completed.452
4.2. K-Means453
Section 3.2 presented the K-Means problem as well as three di↵erent approaches to solve454
it. In this section we discuss how the solution of this problem was adapted to the manycore455
architecture used in this work.456
Xeon E5, Altix UV 2000 and Exynos 5 are platforms in which all cores have access to a global457
shared memory space. Additionally, these platforms support OpenMP. Therefore, for these plat-458
forms we employed the multi-threaded solution presented in Algorithm 3.4 using OpenMP for459
parallelization. Unfortunately this same solution is not appropriate for the MPPA-256 platform.460
Even though MPPA-256 supports OpenMP, cores in the MPPA-256 platform are grouped into 16-461
core clusters. Cores in the same cluster have access to the local shared memory but have no462
access to memory present on the remaining clusters. For this reason we had to embrace the dis-463
tributed version of the application in order to explore the full computational power provided by464
this platform.465
Despite the distributed algorithm presented in Section 3.2.3 being more appropriate to the466
MPPA-256 platform than the multi-threaded version, it has some characteristics that limit its direct467
use on this platform. Local memory available to each cluster (2MB, of which 500 KB are used by468
the operating system) creates a strong constraint on the number of points that can be dealt with by469
each cluster. Even though the 32 MB (16⇥2 MB) of memory available in the computing clusters470
could store a reasonably sized workload, a static distribution of points at the initialization of the471
algorithm would totally disregard the 2 GB of memory available at the I/O subsystem. Therefore,472
we employed a dynamic solution for the distribution of points to be able to work with a number473
of points that is only limited by the amount of memory available at the I/O subsystem.474
In order to do so, we implemented a variation of the distributed algorithm using a dynamic475
one-level tiling strategy. In this solution the I/O subsystem keeps a copy of all the points and476
partitions. At each iteration, during the populate partitions and compute centroids phases, each477
computing cluster repeatedly downloads chunks of points from the I/O subsystem. These chunks478
are small enough to fit into the available local memory. After these points are processed, they are479
discarded to make space for the next chunk. This download/process/discard process is repeated480
at each iteration until all points are processed. At this point the results for the current iteration are481
uploaded to the I/O subsystem. Then, the I/O subsystem broadcasts the partial results to every482
computing cluster and the next iteration begins.483
17
Full domain:
Fist level of tiling
K
K-1
K+1
Sub-domain:
Second level of tiling
I/O Subsystem Cluster
Parallel 2D stencil
OpenMP (4x4) - 16 PEs
< 2MBData
Transfers
2GB
Figure 7: Multi-level tiling strategy to exploit the memory hierarchy of MPPA-256.
4.3. Seismic Wave Propagation484
Performing stencil computations on the MPPA-256 processor is a challenging task. This class485
of numerical kernels has an important demand for memory bandwidth. This makes the e cient486
use of the low-latency memories distributed among compute clusters indispensable. In contrast487
to standard x86 processors, in which it is not uncommon to find last-level cache sizes of tens of488
megabytes, the MPPA-256 has only 32 MB of low-latency memory divided into 2 MB chunks489
spread throughout the 16 compute clusters.490
The 3D data required for seismic wave modeling do not fit in those low-latency memories.491
Therefore, we need to design e cient master-to-slave and slave-to-master communications to492
make use of the 2 GB of memory available on the I/O subsystem and carefully overlap commu-493
nications with computations to mask communication costs. We implement a two-level algorithm494
that decomposes the problem with respect to the memory available on both the I/O subsystem495
and the compute clusters. Figure 7 shows the algorithm.496
The three dimensional structures corresponding to the velocity and stress fields are allocated497
on the I/O subsystem to maximize the problem size that can be simulated. Next, we divide the498
global computational domain into several subdomains corresponding to the number of compute499
clusters involved in the computation. This decomposition is performed along the horizontal500
direction providing a first level of data-parallelism. To respect the width of the stencil (fourth-501
order), we maintain an overlap of two grid points in each direction. These regions, called ghost502
zones, are updated at each stage of the computation with point-to-point communications between503
neighboring clusters. This decomposition is rather similar to the description provided in section504
3.3.3. Unfortunately, this first level of decomposition is not su cient as three-dimensional tiles505
do not fit into the 2 MB of memory available on each compute clusters.506
A second level of decomposition is therefore required. This is performed along the vertical507
direction as we tile each three-dimensional subdomain into 2D slices. This leads to a signifi-508
cant reduction in memory consumption for each cluster but requires maintaining a good balance509
between the computation and communication. Indeed the procedure relies on a sliding window510
algorithm that traverses the 3D domains using 2D planes and overlaps data transfers with com-511
putations. This could be viewed as an explicit prefetching mechanism as the 2D planes required512
for the computation at one step are brought to the clusters during the computation performed at513
previous steps. Additionally, this vertical tiling strategy allows us to benefit from the symme-514
try of the domain in the horizontal directions. The costly absorbing boundary conditions grid515
18
points located at the bottom of the domain are therefore evenly distributed among the computing516
clusters.517
The number of planes prefetched in advance is parameterizable and its maximum value de-518
pends on the problem dimensions and the amount of available memory on each compute cluster.519
To better exploit the NoC, we carefully select the NoC node on the I/O subsystem with which the520
compute cluster will communicate. This choice is based on the NoC topology and aims at reduc-521
ing the number of hops necessary to deliver a message. Moreover, the prefetching scheme also522
allows us to send less messages containing more data, which has been empirically proven to be523
more e cient than sending several messages of smaller size. OpenMP directives are employed524
by clusters to compute 2D problems with up to 16 PEs in parallel.525
5. Experimental Results526
In this section, we present performance and energy e ciency evaluations for the experimental527
platforms. These evaluations were conducted by the execution of parallel and distributed versions528
of the presented applications. We begin by introducing our energy consumption measurement529
methodology along with the metrics used to analyze the results on all platforms. Then, we530
compare their energy and computing performance.531
5.1. Measurement Methodology532
We use two important metrics to compare the energy and computing performance of di↵erent533
multicore and manycore platforms: time-to-solution and energy-to-solution. Time-to-solution is534
the time spent to reach a solution for a given problem. In our case, this is the overall execution535
time of the parallel/distributed version of the applications. Energy-to-solution is the amount of536
energy spent to reach a solution for a problem. Thus, the ratio between energy-to-solution and537
time-to-solution yields the average power consumed during the application execution.538
Table 1 lists the average power consumed by each one of the platforms used in our exper-539
iments during the execution of the parallel and distributed versions of the applications. Even540
though the Altix UV 2000 features 24 Xeon E5 processors, it consumes less than 24 times the541
power observed on Xeon E5. This is an expected phenomenon because Xeon E5 runs a multi-542
threaded version of the applications whereas Altix UV 2000 runs their distributed counterparts.543
Distributed versions experience periods of low processor usage as, for example, those during the544
task request/response cycle and those related to load imbalance. This will be further discussed in545
Section 5.5).546
Xeon E5 Altix UV 2000 Exynos 5 MPPA-256
TSP 67.9 W 1,418.4 W 5.3 W 8.3 W
K-Means 61.5 W 1,420.3 W 5.2 W 9.6 W
Ondes3D 57.5 W 1,353.0 W 4.6 W 8.4 W
Table 1: Average power consumption of the 4 processors while running the applications.
The power consumed by each processor was obtained using the same approach. Both Xeon547
E5 and Altix UV 2000 feature Intel Sandy Bridge microarchitecture, which has Running Average548
Power Limit (RAPL) energy sensors. This allows us to measure the power consumption of549
CPU-level components through Machine-Specific Registers (MSRs). We used this approach to550
19
obtain the energy consumption of the whole CPU package including cores and cache memory551
(named RAPL PKG domain). Similarly, MPPA-256 and Exynos 5 also possess hardware sensors552
to measure power consumption of the entire chip. Power measurements using this approach are553
very accurate as shown in [29, 30].554
Small Medium Large
TSP 16 cities 18 cities 20 cities
K-Means 16,384 points 32,768 points 131,072 points512 centroids 512 centroids 512 centroids
Ondes3D 16x16x16 grid points 48x64x48 grid points 128x128x128 grid points
Table 2: Problem sizes.
We also defined three input problem sizes for all applications (Table 2). These problem sizes555
were chosen based on the execution time on all platforms and amount of memory needed. For556
instance, we used a small problem size when running the applications with low thread counts in557
order to obtain the results in a reasonable time1. Each experiment was repeated as many times as558
needed to ensure a relative error inferior to 2% with 95% statistical confidence using Student’s559
t-distribution.560
5.2. Overall Results561
Figure 8 compares both time-to-solution (right y-axis) and energy-to-solution (left y-axis)562
metrics on all processors. Since we used every core of each processor in these experiments, we563
executed the applications with large problem sizes.564
11.9 12.3 
7.2 
1.1 
183 
9 
1393 
120 
?Xeon E5 
(8 cores) 
?Altix UV 2000 
(192 cores) 
?Exynos 5 
(4 cores) 
?MPPA-256 
(256 cores) 
Energy 
Time 
En
er
gy
-to
-s
ol
ut
io
n 
(k
J)
Ti
m
e-
to
-s
ol
ut
io
n 
(s
)
TSP K-means Ondes3D
35.4 
33.2 
18.7 
2.7 
521 
25 
325 
0 
8 
16 
24 
32 
40 
?Xeon E5 
(8 cores) 
?Altix UV 2000 
(192 cores) 
?Exynos 5 
(4 cores) 
?MPPA-256 
(256 cores) 
11.9 12.3 
7.2 
1.1 
183 
9 
120 
?Xeon E5 
(8 cores) 
?Altix UV 2000 
(192 cores) 
?Exynos 5 
(4 cores) 
?MPPA-256 
(256 cores) 
22.5 
33.1 
15.0 
3.9 
403 
30 
459 
0 
160 
320 
480 
640 
800 
?Xeon E5 
(8 cores) 
?Altix UV 2000  
(192 cores) 
?Exynos 5 
(4 cores) 
?MPPA-256 
(256 cores) 
1393 
3263 
3501 
Figure 8: Time and energy-to-solution comparison between multicore, NUMA and manycore processors.
1The large problem size along with very low thread counts takes several hours on embedded processors due to their
low clock frequency.
20
Time-to-Solution. As expected, applications on Exynos 5 presented the highest execution565
times among all platforms, being from 6.7x (TSP) up to 8.1x (Ondes3D) slower than Xeon E5.566
The reason for that is threefold: (i) it has considerably lower clock frequency than Xeon E5; (ii)567
Xeon E5 is a performance-centric processor that is tuned far more for speed than for low power568
consumption; and (iii) Xeon E5 profits from its higher parallelism, since all applications scale569
considerably well as we increase the number of threads. MPPA-256 presented better execution570
times than Xeon E5 on TSP and K-Means, being 1.6x and 1.5x faster respectively. Even though571
the clock frequency of MPPA-256 PEs is lower than that of the Xeon E5 cores, this embedded572
processor achieved better performance. Once again, this is due to the inherent characteristic573
of these applications. On TSP, peers only need to broadcast data when a new shortest path is574
found. On K-Means, peers communicate more often but this application still performs more575
computation than communication.576
An optimized implementation of the seismic wave propagation algorithm has been consid-577
ered as a baseline for our evaluations. As detailed in Section 3.3, the shared-memory implemen-578
tation relies on e cient data and thread mapping strategies in order to reduce both the NUMA579
penalty and the load imbalance. It is well known that stencil-based computations like finite580
di↵erences method applied to seismic wave propagation achieve a low fraction of the peak per-581
formance on standard processors such as x86. This is mainly due to the huge demand for memory582
bandwidth typical for this class of algorithms. On average, 30% of peak performance is reported583
for such implementations [31]. A detailed characterization of this behavior taking into consider-584
ation both the architecture and the algorithms is given by the roofline model [32]. Nonetheless,585
a more detailed discussion on the peak performance on the MPPA-256 architecture would require586
revisiting the roofline model which is out the scope of this paper. Our analysis confirmed our587
expectation that an important share of Ondes3D execution time is spent in communications. Al-588
though the prefetching scheme considerably hides the communication costs on MPPA-256, the589
latency and bandwidth of the NoC still hurts its performance, resulting in an execution time ap-590
proximately 10% worse on MPPA-256 compared to Xeon E5. Not surprisingly, Altix UV 2000 plat-591
form presented the best execution times, since it has 24 performance optimized general- purpose592
multicore processors. We further discuss the scalability results on Altix UV 2000 and MPPA-256 in593
Section 5.4.594
Energy-to-Solution. Both Exynos 5 and MPPA-256 presented better energy-to-solution than595
the other platforms. However, the low degree of parallelism available on the ARM processor596
was a clear disadvantage for Exynos 5. Even though this processor consumes less power than597
the others, it ends up executing the applications during a longer period of time. This results598
in a higher energy consumption compared to MPPA-256. Overall, MPPA-256 achieved the best599
energy-to-solution results, reducing the energy consumed by other platforms on TSP, K-Means600
and Ondes3D in at least 6.9x, 6.5x and 3.8x, respectively.601
5.3. Energy E ciency602
In the previous section, we showed that MPPA-256 presented the best energy-to-solution re-603
sults among all platforms. The main reason is that MPPA-256 o↵ers a high parallelism and yet604
has a low power consumption. In this section, we intend to look in more detail at the energy605
e ciency of all platforms when we vary the number of cores. We first compare the energy-to-606
solution of all applications when varying the number of cores from 1 to the maximum number of607
cores available in each processor (Figure 9a). In other words, we compare the energy-to-solution608
obtained with a single processor of Altix UV 2000 (which is actually the Xeon E5), Exynos 5 and609
a single compute cluster of MPPA-256 (in this case, we vary the number of PEs). For these tests,610
21
0.0 
0.5 
1.0 
1.5 
2.0 
2.5 
3.0 
3.5 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
En
er
gy
-to
-S
ol
ut
io
n 
(k
J)
 
Number of cores 
Altix UV 2000 MPPA-256 Exynos 5 
TS
P
K-
M
ea
ns
En
er
gy
-to
-s
ol
ut
io
n 
(k
J)
Number of peersNumber of cores
On
de
s3
D
0.0 
0.5 
1.0 
1.5 
2.0 
2.5 
3.0 
3.5 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
Number of cores 
0.0 
0.2 
0.4 
0.6 
0.8 
1.0 
1.2 
1.4 
2 3 4 5 6 7 8 9 10 11 12 
Number of peers 
0.0 
2.0 
4.0 
6.0 
8.0 
10.0 
12.0 
14.0 
13 14 15 16 17 18 19 20 21 22 23 24 
Number of peers 
0.0 
0.2 
0.4 
0.6 
0.8 
1.0 
1.2 
1.4 
1.6 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
Number of cores 
0.0 
0.5 
1.0 
1.5 
2.0 
2.5 
3.0 
3.5 
4.0 
2 3 4 5 6 7 8 9 10 11 12 
Number of peers 
0.0 
5.0 
10.0 
15.0 
20.0 
25.0 
30.0 
35.0 
40.0 
45.0 
13 14 15 16 17 18 19 20 21 22 23 24 
Number of peers 
(a) Small (b) Medium (c) Large
0.0 
1.0 
2.0 
3.0 
4.0 
5.0 
6.0 
7.0 
8.0 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
Number of cores 
0.0 
0.5 
1.0 
1.5 
2.0 
2.5 
3.0 
3.5 
2 3 4 5 6 7 8 9 10 11 12 
Number of peers 
0.0 
5.0 
10.0 
15.0 
20.0 
25.0 
30.0 
35.0 
40.0 
45.0 
13 14 15 16 17 18 19 20 21 22 23 24 
Number of peers 
Figure 9: Energy-to-solution comparison on all platforms with three problem sizes.
we used a small problem size due to time constraints. Then, we compare the energy-to-solution611
on Altix UV 2000 and MPPA-256 when varying the number of peers (i.e., processors on Altix UV612
2000 and compute clusters on MPPA-256), while using the maximum number of cores available613
on each Altix UV 2000 processor (8 cores) and MPPA-256 clusters (16 PEs). We used a medium614
problem size to compare the energy-to-solution from 2 to 12 peers (Figure 9b) and a large prob-615
lem size for more than 12 peers (Figure 9c). Note that the MPPA-256 architecture is limited to 16616
peers, therefore we only show the results for more than 16 peers on Altix UV 2000.617
Varying the Number of Cores. Exynos 5 achieved the best energy-to-solution for the small618
problem size in all applications. Figure 9a shows this behavior. The reasons for that are twofold.619
First, Exynos 5 is the least power hungry processor among all the experimental platforms. Sec-620
ond, the problem size used in this experiment is too small to scale past 4 cores. When we compare621
the energy e ciency of a single Altix UV 2000 processor against a single MPPA-256 cluster, we622
notice that Altix UV 2000 outperformedMPPA-256with low core counts on TSP and K-Means. For623
more than 8 cores, however, the MPPA-256 cluster outperformed the Altix UV 2000 processor. This624
comes from the fact that the power consumed by a single Altix UV 2000 processor considerably625
increased as we increased the number of used cores whereas the power consumed by a single626
22
MPPA-256 cluster remained practically unchanged. The only exception occurred on Ondes3D.627
In this case, MPPA-256 consumed much more energy than the Altix UV 2000 processor because628
communications on MPPA-256 could not be overlapped with computations using small problem629
sizes on Ondes3D. Moreover, the NoC bandwidth achieved in this case is poor, since we only630
perform 1-to-1 communications between the I/O subsystem and a single cluster.631
Varying the Number of Peers. The gap between the energy consumed by Altix UV 2000 and632
MPPA-256 became more important as we increased the number of peers. From 2 to 12 peers633
(Figure 9b), MPPA-256 consumed at least 2.3x less energy than Altix UV 2000. This gap was634
even larger from 13 to 16 peers with a large problem size (Figure 9c): in this case, MPPA-256635
consumed on average ⇠11x less energy than Altix UV 2000. Once again, the rationale behind that636
comes from the high energy cost associated to the Altix UV 2000 processors: adding one Xeon E5637
processor usually increases the overall power consumption of Altix UV 2000 by ⇠60W on average638
whereas adding one MPPA-256 cluster increases the overall power consumption of MPPA-256 by639
⇠0.3 W.640
5.4. Scalability641
So far, we have only compared the energy-to-solution of MPPA-256 and Altix UV 2000, show-642
ing that the former consumed far less energy than latter to solve the same problems. Figure 10643
illustrates the time-to-solution gap between them for a medium problem size when considering644
an equal number of resources (peers) as well as the comparative speedup between the architec-645
tures. The speedup calculation was based on the e↵ect that an increase on the number of peers646
has on performance. For that reason, and to maintain consistency throughout our comparisons,647
we employed as the baseline the execution time of the multi-threaded algorithms using a single648
peer. In other words, we compared the performance of the distributed algorithm using di↵erent649
numbers of fully utilized peers to that of a parallel version using all the resources of single peer,650
i.e., with no inter-peer communications. We measure, therefore, the scalability of the distributed651
version of the algorithms and not that of the of the multi-threaded version. Detailed scalability652
evaluation analysis for the multi-threaded algorithms can be found in the base works presented653
in Section 3.654
Overall, the distributed version of the applications scaled considerably well and execution655
times showed similar trends on both platforms. However, Altix UV 2000 was from 9x up to 13x656
faster thanMPPA-256. This result was expected, since peers mean processors running at full speed657
(2.4 GHz) on Altix UV 2000 whereas they represent blocks (compute clusters) of the MPPA-256658
processor running at 400MHz. In other words, we are comparing sets of entire processors on Altix659
UV 2000 against subsets of a single MPPA-256 processor. We also observed similar performance660
gaps with other problem sizes.661
The near-linear speedups of TSP and K-Means on both architectures show that, although662
the actual implementations of the evaluated applications were adapted to accommodate each663
platform’s idiosyncrasies, they in fact display good and similar scalability. The exception of664
Ondes3D can be explained by the amount communications performed by this algorithm. While665
TSP and K-Means are CPU-bound and communicate at regular but not so frequent intervals,666
communication on the Ondes3D is much more intensive. The weak scalability past six peers667
demonstrates the toll imposed by these communications to the NUMA interconnections on Altix668
UV 2000 and to the NoC on MPPA-256.669
Moreover, in order to avoid NUMA e↵ects on the Altix UV 2000 platform and ensure good670
execution performance, we had to employ some additional runtime optimizations. For all appli-671
cations we employed thread-pinning [33]. Since the TSP was implemented using POSIX threads,672
23
TSP
Number of peers
Ti
m
e-
to
-s
ol
ut
io
n 
(s
)
K-Means Ondes3D
0.0 
0.5 
1.0 
1.5 
2.0 
2.5 
3.0 
3.5 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
En
er
gy
-to
-S
ol
ut
io
n 
(k
J)
 
Number of cores 
Altix UV 2000 MPPA-256 Exynos 5 
1 
10 
100 
1000 
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
Number of peers 
1 
10 
100 
1000 
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
Number of peers 
1 
10 
100 
1000 
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
Number of peers 
Sp
ee
du
p
0 
2 
4 
6 
8 
10 
12 
14 
16 
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
Number of peers 
0 
2 
4 
6 
8 
10 
12 
14 
16 
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
Number of peers 
0 
2 
4 
6 
8 
10 
12 
14 
16 
2 3 4 5 6 7 8 9 10 1  12 13 14 15 16 
Number of peers 
Figure 10: Time-to-solution and speedup comparison between Altix UV 2000 and MPPA-256.
we used Linux’s specific system calls to ensure that threads would not be migrated during the673
execution of the application. K-Means and Ondes3D used OpenMP to implement parallelism,674
therefore in these cases we employed the GOMP CPU AFFINITY environment variable to ensure675
threads were correctly bound to the available cores. Additionally, we specially modified the ini-676
tialization phase of the applications so that the first-touch strategy (Linux’s default) could suitably677
place the allocated memory on the NUMA nodes. For that, each application would make each678
thread initialize (either with the actual value or with a dummy value when initialization had to679
be centralized) each private region of memory.680
5.5. Irregularity681
In the last sections, we analyzed the energy-to-solution and the scalability of the distributed682
versions of all applications on both Altix UV 2000 and MPPA-256. The results showed that our dis-683
tributed solutions scaled well, which indicates that the inherent irregularities of each application684
were satisfactorily handled.685
However, Figure 9a and Figure 9b reveal some points where the energy-to-solution abruptly686
increases (e.g., from 6 to 7 and from 13 to 14 peers) in the TSP. In these cases, the addition687
of a peer incurred performance losses (higher execution times). In order to investigate this pe-688
culiar behavior, we traced the execution of the distributed version of TSP. Figure 11 shows the689
execution traces obtained on Altix UV 2000 while running the TSP with 14 peers.690
Figure 11a shows a global view of the execution aggregated per peer. At the beginning691
(Figure 11b), one thread in each peer asks a master peer for partitions and starts populating692
the local pool of tasks. As tasks become available, other threads in the same peer can start the693
computation. Once the thread assigned to populate the local pool of tasks finishes its job, it694
also starts the computation. Afterwards, as the peers run out of work, they ask a master peer695
24
Pe
er
 3
0s 0.3s
Pe
er
 4
35s 52s
Task Generation Computation SynchronizationLegend
(c) End(b) Start
Thread 0
Thread 1
Thread 2
0s 52s
(a
) T
ot
al
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Thread 8
Thread 9
Thread 10
Thread 11
Thread 12
Thread 13
Thread 14
Thread 15
Peer 0
Peer 1
Peer 2
Peer 3
Peer 4
Peer 5
Peer 6
Peer 7
Figure 11: Execution traces of TSP on Altix UV 2000.
for more partitions. This strategy works fairly well throughout most of the execution. However,696
in some cases the load may become imbalanced at the end of the execution. Figure 11c shows697
what happens inside the peers. In this case, only peer 4 keeps processing for a long time while698
other peers are running out of tasks. This happens because the last task associated to Thread699
0 from Peer 4 takes much longer to be completed. This problem can be reduced with a work-700
stealing strategy inside each peer, so that threads can steal segments of this big task to improve701
parallelism. However, we leave this optimization for a future work.702
6. Related Work703
Many works have been focusing on analyzing the performance and energy e ciency of low-704
power multicore processors. Padoin et al. [34] compared an ARM Cortex-A9 1.0 GHz dual-705
core processor from Texas Instruments to two multiprocessors: one composed of two quad-core706
2.4 GHz Intel Xeon E5 processors and the other composed of four 2.0 GHz 8-core Xeon X7707
processors. They analyzed di↵erent metrics such as time-to-solution, peak power and energy-to-708
solution using 6 benchmarks from the NAS Parallel Benchmarks (NPB). Their results showed709
that the ARM processor outperforms Xeon X7, considering the energy-to-solution metric, for710
most of the analyzed benchmarks. However, the Xeon E5 had the best energy-to-solution among711
the three processors.712
Go¨ddeke et al. [3] also conducted a comparison between ARM and x86 architectures using713
di↵erent classes of numerical solution methods for partial di↵erential equations. They evaluated714
25
weak and strong scalability on a cluster of 96 ARM Cortex-A9 dual-core processors and demon-715
strated that the ARM-based cluster can be more e cient in terms of energy-to-solution compared716
to a cluster of Intel Xeon X5550 processors. Similarly, Ou et al. [35] compared the energy ef-717
ficiency of a ARM-based cluster against an Intel X86 workstation on three applications: a web718
server, an in-memory database and a video transcoding. They concluded that energy/e ciency719
ratio of the ARM cluster against the Intel workstation depend on the application and may vary720
from 1.21 up to 9.5.721
Recent works are aiming to assess whether light-weight manycore processors can be used as722
basic blocks for future HPC architectures. Totoni et al. [36] compared the power and performance723
of Intel’s Single-Chip Cloud Computer (SCC) to other types of CPUs and GPUs. The analysis724
was based on a set of parallel applications implemented with the Charm++ programming model.725
They showed that there is no single best solution that always achieves the best trade-o↵ between726
power and performance. However, the results obtained with the Intel SCC suggest that many-727
cores are an opportunity for the future. Morari et al. [37] proposed an optimized implementation728
of radix sort for the Tilera TILEPro64 manycore processor. The results showed that the their so-729
lution for TILEPro64 provides much better energy e ciency than an general-purpose multicore730
processor (Intel Xeon W5590) and comparable energy e ciency with respect to a GPU NVIDIA731
Tesla C2070. Gharaibeh et al. [38] showed how a synergistic use of CPUs and GPUs can im-732
prove the overall energy-to-solution on large-scale graph processing. In particular their approach733
is similar to our seismic wave simulation application in the sense that they map the problem (the734
input graph) to the interconnection topology of the underlying hardware platform.735
Castro et al. [9] showed that MPPA-256 can be very competitive considering both perfor-736
mance and energy e ciency for a fairly parallelizible problem: the Traveling- Salesman Problem737
(TSP). The results indicated that the MPPA-256 may achieve better performance than an Intel738
Xeon processor with 8 CPU cores (16 threads with Hyper-Threading) running at 2.40GHz while739
consuming approximately 13 times less energy. Using a slightly di↵erent approach, Aubry et740
al. [7] compared the performance of an Intel Core i7-3820 processor with the MPPA-256. Their741
application, an H.264 video encoder, was implemented using a dataflow language that o↵ers742
direct automatic mapping to MPPA-256. Their findings show that the performance of these tra-743
ditional processors is on par with the performance of the MPPA-256 embedded processor which744
provides 6.4 times better energy e ciency.745
Unlike the previous works, we have focused on the passage from multicores to manycores746
from the perspective of three irregular applications. We pointed out some of the programming747
issues that must be considered when developing parallel applications to manycores. Moreover,748
we analyzed the performance and energy consumption of these applications on a set of state-749
of-the-art multicore and manycore platforms, ranging from low-power processors to general-750
purpose processors.751
7. Conclusion and Future Work752
In this work we analyzed the performance and the energy-e ciency of four di↵erent hard-753
ware platforms. For that we employed applications with three di↵erent behaviors. The exper-754
imental results obtained during this research corroborated the widely accepted practice on the755
high-performance research domain that considers an appropriate appreciation of the underlying756
hardware idiosyncrasies essential to obtain good performance and energy e ciency.757
Manycore processors seem to be the trend in the development of faster energy-e cient pro-758
cessors. The e cient use of a light-weight manycore processor demands adaptations to the759
26
application code so that it can e ciently use the whole chip. Often these modifications are not760
trivial. For instance, the MPPA-256 platform has a strong constraint on the amount of available761
local memory. For this reason we had to implement specific tiling mechanisms to be able to762
deal with real-world scenarios (Ondes3D) and arbitrary problem sizes (K-Means). In the case763
of Ondes3D, we also needed to implement a prefetching mechanism to overlap communications764
with computation. On TSP, on the other hand, modifications were similar to those needed to port765
an application to the MPI paradigm. However, the absence of a coherent cache considerably in-766
creased the implementation complexity, requiring the use of full memory barriers or proprietary767
system calls designed to completely bypass the cache. On the Altix UV 2000, we had to employ768
thread pinning and memory placement to ensure performance.769
As it is often the case for parallel applications, such modifications tend to introduce redundant770
computations and extra communications in order to improve the parallelism of the whole solu-771
tion. Not every application is suitable to this kind of modification and, in the worst case scenario,772
a strictly serial application might be limited to the performance of a single core. For these three773
classes of applications (CPU-bound, memory- bound and mixed) we showed that highly-parallel774
platforms can be very competitive, even if the application is irregular in nature. Our results775
showed that MPPA-256 may achieve better performance than a traditional general-purpose multi-776
core processor (Xeon E5) on CPU-bound and mixed workloads. For a memory-bound workload777
(Ondes3D) Xeon E5 performed better than MPPA-256. Although Altix UV 2000 presented the best778
performance results among all platforms it also presented a higher energy consumption when779
communication became more important (K-Means and Ondes3D), however it still showed an780
energy e ciency similar to Xeon E5. MPPA-256 presented the best energy e ciency among all781
platforms, reducing the energy consumed on TSP, K-Means and Ondes3D by at least 6.9x, 6.5x782
and 3.8x, respectively.783
This work can be extended in two directions. First, we compared the energy e ciency of784
state-of-the-art Intel-based platforms (Xeon E5 and Altix UV 2000) to other low-power platforms785
(MPPA-256 and Exynos 5). These specific Intel-based platforms are optimized for performance,786
not for low energy consumption. As future work, we plan to compare the performance of these787
low-power processors to those based on low-power Intel processors such as the Intel Atom and788
the mobile versions of the Sandy Bridge architecture. Next, we intend to compare the perfor-789
mance and energy e ciency of lightweight manycore processors such as MPPA-256 to other790
manycore processors such as GPUs and the Intel Xeon Phi.791
Acknowledgments792
The authors would like to thank CAPES for funding this research through project CAPES/793
Cofecub 660/10 and through a PNPD/CAPES scholarship. This work was done in the con-794
text of LICIA and Mont-Blanc project (funded from the European Union’s Seventh Framework795
Programme under grant agreement #288777), being partially supported by CNPq, FAPEMIG,796
FAPERGS and INRIA.797
References798
[1] J. Larus, Spending Moore’s Dividend, Communications of the ACM 52 (2009) 62–69.799
[2] D. Brooks, P. Bose, S.E. Schuster et. al, Power-Aware Microarchitecture: Design and Modeling Challenges for800
Next-Generation Microprocessors, IEEE Micro 20 (2000) 26–44.801
27
[3] D. Go¨ddeke, D. Komatitsch, M. Geveler, D. Ribbrock, N. Rajovic, N. Puzovic, A. Ramirez, Energy E ciency vs.802
Performance of the Numerical Solution of PDEs: An Application Study on a Low-power ARM-based Cluster, J.803
Comput. Physics 237 (2013) 132–150.804
[4] N. Rajovic et. al, The Low-Power Architecture Approach Towards Exascale Computing, in: Workshop on Scalable805
Algorithms for Large-Scale Systems (ScalA), ACM, New York, USA, 2011, pp. 1–2.806
[5] T. Fleig, O. Mattes, W. Karl, Evaluation of Adaptive Memory Management Techniques on the Tilera TILE-Gx807
Platform, in: International Conference on Architecture of Computing Systems (ARCS), VDE VERLAG, Luebeck,808
Deutschland, 2014, pp. 88–96.809
[6] C. L. Benoıˆt Dupont de Dinechin and Pierre Guironnet de Massasa, Guillaume Lagera, B. Orgogozoa, J. Reyberta,810
T. Strudela, A Distributed Run-Time Environment for the Kalray MPPA-256 Integrated Manycore Processor, in:811
Intl. Conference on Computational Science (ICCS), volume 18, Elsevier, Barcelona, Spain, 2013, pp. 1654–1663.812
[7] P. Aubry, P.-E. Beaucamps, F. Blanc, B. Bobin, S. Carpov, L. Cudennec, V. David, P. Dore, P. Dubrulle, B. D.813
de Dinechin, F. Galea, T. Goubier, M. Harrand, S. Jones, J.-D. Lesage, S. Louise, N. M. Chaisemartin, T. H.814
Nguyen, X. Raynaud, R. Sirdey, Extended Cyclostatic Dataflow Program Compilation and Execution for an Inte-815
grated Manycore Processor, in: International Conference on Computational Science (ICCS), volume 18, Elsevier,816
Barcelona, Spain, 2013, pp. 1624–1633.817
[8] G. Laporte, The Traveling Salesman Problem: An Overview of Exact and Approximate Algorithms, European818
Journal of Operational Research 59 (1992) 231–247.819
[9] M. Castro, E. Francesquini, T. M. Ngue´le´, J.-F. Me´haut, Analysis of Computing and Energy Performance of820
Multicore, NUMA, and Manycore Platforms for an Irregular Application, in: Workshop on Irregular Applications:821
Architectures & Algorithms (IAˆ3) - Supercomputing Conference (SC), ACM, Denver, EUA, 2013, p. Article No.822
5.823
[10] H. Li, H. L. Sudarsan, M. Stumm, K. C. Sevcik, Locality and Loop Scheduling on NUMA Multiprocessors, in:824
International Conference on Parallel Processing (ICPP), volume 2, IEEE Computer Society, Syracuse, USA, 1993,825
pp. 140–147.826
[11] R. Xu, I. Wunsch, D., Survey of clustering algorithms, Neural Networks, IEEE Transactions on 16 (2005) 645–678.827
[12] L. Kaufman, P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis, John Wiley and Sons,828
New York, 1990.829
[13] A. K. Jain, R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.830
[14] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, A. Wu, An e cient k-means clustering algorithm:831
analysis and implementation, Pattern Analysis andMachine Intelligence, IEEE Transactions on 24 (2002) 881–892.832
[15] I. Dhillon, D. Modha, A data-clustering algorithm on distributed memory multiprocessors, in: M. Zaki, C.-T.833
Ho (Eds.), Large-Scale Parallel Data Mining, volume 1759 of Lecture Notes in Computer Science, Springer Berlin834
Heidelberg, 2000, pp. 245–260.835
[16] S. Rao, E. V. Prasad, N. B. Venkateswarlu, A scalable k-means clustering algorithm on multi-core architecture, in:836
Methods and Models in Computer Science, 2009. ICM2CS 2009. Proceeding of International Conference on, pp.837
1–9.838
[17] L. Rodrigues, L. Zarate, C. Nobre, H. Freitas, Parallel and distributed kmeans to identify the translation initiation839
site of proteins, in: Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on, pp. 1639–840
1645.841
[18] P. Moczoa, J. O. A. Robertssonb, L. Eisnerc, The Finite-di↵erence Time-domain method for Modeling of Seis-842
mic Wave Propagation, in: Advances in Wave Propagation in Heterogeneous Media, volume 48 of Advances in843
Geophysics, Elsevier - Academic Press, 2007, pp. 421–516.844
[19] H. Aochi, T. Ulrich, A. Ducellier, F. Dupros, D. Michea, Finite di↵erence simulations of seismic wave propagation845
for understanding earthquake physics and predicting ground motions: Advances and challenges, in: Journal of846
Physics: Conference Series, volume 454, IOP Publishing, p. 012010.847
[20] R. Madariaga, Dynamics of an expanding circular fault, Bulletin of the Seismological Society of America 66848
(1976) 639–666.849
[21] F. Collino, Perfectly matched absorbing layers for the paraxial equations, Journal of Computational Physics 131850
(1997) 164–180.851
[22] F. Dupros, H.-T. Do, H. Aochi, On scalability issues of the elastodynamics equations on multicore platforms, in:852
International Conference on Computational Science (ICCS), volume 18 of Procedia Computer Science, Elsevier,853
Barcelona, Spain, 2013, pp. 1226–1234.854
[23] F. Dupros, C. Pousa, A. Carissimi, J.-F. Me´haut, Parallel Simulations of Seismic Wave Propagation on NUMA855
Architectures, in: International Parallel Computing conference (ParCo), volume 19 of Advances in Parallel Com-856
puting, IOS Press, Lyon, France, 2010, pp. 67–74.857
[24] Y. Cui, K. Olsen, T. Jordan, K. Lee, J. Zhou, P. Small, D. Roten, G. Ely, D. K. Panda, A. Chourasia, J. Levesque,858
S. M. Day, P. Maechling, Scalable earthquake simulation on petascale supercomputers, in: High Performance859
Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pp. 1–20.860
28
[25] T. Furumura, L. Chen, Parallel simulation of strong ground motions during recent and historical damaging earth-861
quakes in tokyo, japan, Parallel Computing 31 (2005) 149 – 165. Parallel Graphics and Visualization.862
[26] R. K. Tesser, L. L. Pilla, F. Dupros, P. O. A. Navaux, J.-F. Me´haut, C. Mendes, Improving the performance863
of seismic wave simulations with dynamic load balancing, in: Euromicro International Conference on Parallel,864
Distributed and Network-Based Processing (PDP), IEEE Computer Society, Turin, Italy, 2014, pp. 196–203.865
[27] F. Dupros, H. Aochi, A. Ducellier, D. Komatitsch, J. Roman, Exploiting Intensive Multithreading for the E -866
cient Simulation of 3D Seismic Wave Propagation, in: International Conference on Computational Science and867
Engineering, Sa˜o Paulo, Brazil, pp. 253–260.868
[28] A. Gursoy, Data decomposition for parallel k-means clustering, in: R. Wyrzykowski, J. Dongarra, M. Paprzycki,869
J. Was´niewski (Eds.), Parallel Processing and Applied Mathematics, volume 3019 of Lecture Notes in Computer870
Science, Springer Berlin Heidelberg, 2004, pp. 241–248.871
[29] E. Rotem, A. Naveh, A. Ananthakrishnan, E. Weissmann, D. Rajwan, Power-Management Architecture of the Intel872
Microarchitecture Code-Named Sandy Bridge, IEEE Micro 32 (2012) 20–27.873
[30] M. Ha¨hnel, B. Do¨bel, M. Vo¨lp, H. Ha¨rtig, Measuring Energy Consumption for Short Code Paths Using RAPL,874
ACM Sigmetrics Performance Evaluation Review 40 (2012) 13–17.875
[31] C. Andreolli, P. Thierry, L. Borges, C. Yount, G. Skinner, Genetic Algorithm Based Auto-Tuning of Seismic876
Applications on Multi and Manycore Computers, in: EAGE Workshop on High Performance Computing for877
Upstream, Amsterdam, Netherlands. September, 2014. (To Appear).878
[32] K. Datta, S. Kamil, S. Williams, L. Oliker, J. Shalf, K. Yelick, Optimization and performance modeling of stencil879
computations on modern microprocessors, SIAM Review 51 (2009) 129–159.880
[33] R. Love, K. Korner, CPU A nity, Linux Journal, (111), 2003.881
[34] E. L. Padoin, D. A. G. de Oliveira, P. Velho, P. Navaux, Time-to-Solution and Energy-to-Solution: A Comparison882
between ARM and Xeon, in: Workshop on Applications for Multi-Core Architectures (WAMCA), IEEE Computer883
Society, New York, USA, 2012, pp. 48–53.884
[35] Z. Ou, B. Pang, Y. Deng, J. Nurminen, A. Yla¨-Ja¨a¨ski, P. Hui, Energy and Cost-E ciency Analysis of ARM-Based885
Clusters, in: IEEE/ACM Intl. Symposium on Cluster, Cloud and Grid Computing (CCGrid), IEEE Computer886
Society, Ottawa, Canada, 2012, pp. 115–123.887
[36] E. Totoni, B. Behzad et. al, Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and888
GPUs, in: IEEE Intl. Symposium on Performance Analysis of Systems and Software (ISPASS), IEEE Computer889
Society, New Brunswick, Canada, 2012, pp. 78–87.890
[37] A. Morari, A. Tumeo, O. Villa, S. Secchi, M. Valero, E cient sorting on the Tilera manycore architecture, in:891
IEEE International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), IEEE892
Computer Society, New York, USA, 2012, pp. 171–178.893
[38] A. Gharaibeh, E. Santos-Neto, L. B. a. Costa, M. Ripeanu, The energy case for graph processing on hybrid cpu894
and gpu systems, in: Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms,895
IA3 ’13, ACM, New York, NY, USA, 2013, pp. 2:1–2:8.896
29
