Mapping Matters: Application Process Mapping on 3-D Processor Topologies by Korndörfer, Jonas H. Müller et al.
Mapping Matters: Application Process Mapping
on 3-D Processor Topologies
Jonas H. Mu¨ller Korndo¨rfer1, Mario Bielert2, Lae´rcio L. Pilla3, and
Florina M. Ciorba1
1 Department of Mathematics and Computer Science,
University of Basel, Basel, Switzerland,
Email: jonas.korndorfer@unibas.ch and florina.ciorba@unibas.ch
2 Center for Information Services and High Performance
Computing (ZIH),
Technische Universita¨t Dresden, Dresden, Germany,
Email: mario.bielert@tu-dresden.de
3 Laboratoire de Recherche en Informatique, CNRS &
Univ. Paris-Saclay, Orsay, France,
Email: pilla@lri.fr
May 22, 2020
Abstract
Applications’ performance is influenced by the mapping of processes
to computing nodes, the frequency and volume of exchanges among pro-
cessing elements, the network capacity, and the routing protocol. A poor
mapping of application processes degrades performance and wastes re-
sources. Process mapping is frequently ignored as an explicit optimization
step since the system typically offers a default mapping, users may lack
awareness of their applications’ communication behavior, and the oppor-
tunities for improving performance through mapping are often unclear.
This work studies the impact of application process mapping on several
processor topologies. We propose a workflow that renders mapping as an
explicit optimization step for parallel applications. We apply the workflow
to a set of four applications (NAS CG and BT-MZ, CORAL-2 AMG, and
CORAL LULESH), twelve mapping algorithms (communication & topol-
ogy oblivious/aware), and three direct network topologies (3-D mesh, 3-D
torus, and a novel highly adaptive energy-efficient 3-D topology, called
the HAEC Box). We assess the mappings’ quality in terms of volume,
frequency, and distance of exchanges using metrics such as dilation (mea-
sured in hop·Byte). A parallel trace-based simulator predicts the appli-
cations’ execution on the three topologies using the twelve mappings. We
1
ar
X
iv
:2
00
5.
10
41
3v
1 
 [c
s.D
C]
  2
1 M
ay
 20
20
evaluate the impact of process mapping on the applications’ simulated per-
formance in terms of execution and communication times and identify the
mapping that achieves the highest performance. To ensure correctness of
the simulations, we compare the resulting volume, frequency, and distance
of exchanges against their pre-simulation values. This work emphasizes
the importance of process mapping as an explicit optimization step and
offers a solution for parallel applications to exploit the full potential of
the allocated resources on a given system.
1 Introduction
The growing amount of computing elements in HPC systems inherently presents
a new bottleneck in terms of the necessary communication for data distribution
as well as controlling the application processing elements, e.g., tasks, threads,
and processes. Therefore, the performance of parallel applications highly de-
pends on their communication behavior.
Nowadays, parallel applications execute on a broad range of parallel comput-
ing architectures, from large supercomputers to embedded low-power architec-
tures. When these applications execute on parallel systems, their communica-
tion time is affected by how intensely their processing elements exchange data,
by the capacity and performance of the network links, and by the placement of
processing elements on the computing resources.
Application placement is typically the result of a mapping algorithm. Effi-
cient application placement on modern hardware architectures is of paramount
importance for performance [25] [21] [41] [33] [9]. A poor choice of the mapping
algorithm may lead to larger communication latencies and, therefore, to signif-
icant performance loss and energy waste. A plethora of communication and/or
topology-aware mapping algorithms emerged over the years in the literature to
improve application process placement (see [25] for a recent overview).
Process mapping1 is an active research field with a vast history that in-
cludes algorithms whose performance benefits have been recorded in multiple
situations. For instance, communication and/or topology-aware mapping tech-
niques combine information about the target application (its communication
pattern or virtual topology) and the target system (its physical topology) to take
mapping decisions following a performance objective, e.g., minimizing conges-
tion, dilation, distance, or volume·distance [25]. Nevertheless, there are many
other approaches, such as space-filling curves (SFCs), that are simply based on
common communication patterns [10]. In practice, the operating system (at the
node level) or the batch system (across nodes) determines the process placement
with only the knowledge about the physical topology of the system and without
consideration for the application’s virtual topology.
Given the need to achieve high performance and mitigate resource waste,
many application developers and users face the following questions:
1Also known as topology mapping or application process placement.
2
Q1. Is the application suffering from poor communication performance?
Q2. Does mapping impact application’s performance on this system?
Q3. Which mapping algorithm is the highest performing for the given
application–system pair?
The absence of simple answers to these questions has severe consequences.
While a na¨ıve process mapping may lead to performance loss, an inapt mapping
algorithm may cause longer execution times in addition to the overhead associ-
ated with generating the mapping itself. Over time, such performance loss trans-
lates into congested resources and increased energy consumption. Moreover,
performing repeated experiments to identify the highest performing mapping
algorithm is neither a sustainable nor a scalable solution. Therefore, process
mapping is frequently ignored as an explicit optimization step.
In this work, we study the impact of application process mapping on sev-
eral processor topologies. We propose a workflow that renders mapping as an
explicit optimization step for communication-intensive applications. We apply
the workflow to a set of four applications, twelve mapping algorithms,and three
direct network topologies.
This works makes the following contributions:
(i) Proposes a generic workflow to support process mapping as an explicit
optimization step.
(ii) Conducts an analysis on the predicted mapping benefit for a given application–
system pair.
(iii) Contributes a Python-based library of well-known topology mapping al-
gorithms from the literature.
The remainder of this work is organized as follows. The work related to ap-
plication optimization through careful placement is reviewed in Section 2. The
generic workflow for mapping applications onto processor topologies is intro-
duced and described in Section 3. The application characteristics and evaluation
metrics are presented in Section 4, while the processor topologies and network
models are discussed in Section 5. Section 6 describes the mapping algorithms
and their characteristics. The design of performance experiments, their analysis
and evaluation are presented in Section 7, while Section 8 concludes the work
and outlines future work directions.
2 Related Work
Process placement on modern hardware architectures has been studied from
various dimensions and in various contexts, leading to a significant body of work
emerging in the literature over the years. One may refer to recent surveys on the
topic for an overview of existing solutions and open problems [25] [33] [9] [39].
3
The approach taken in this work involves: parallel MPI applications, al-
gorithmic strategies for topology mapping, three-dimensional direct network
topologies, communication models, application tracing, and trace-driven simu-
lation.
Most existing work considers parallel MPI applications when studying pro-
cess mapping. In a recent study [32], MPI point-to-point calls have been found
to be more prominently used that either persistent point-to-point or one-sided
MPI calls. In this work, we concentrate on the effect of process mapping on the
performance of MPI point-to-point calls.
Hoefler et al. [25] classify the algorithmic strategies for topology mapping
into four categories. The algorithms considered in this work fall into three of
these categories: greedy includes the Peano, Hilbert, Gray, and Sweep SFCs,
greedy, FGgreedy, greedyALLC, and topo-aware, graph partitioning includes
bipartition and PaCMap, while Bokhari is an isomorphism-based algorithm.
Process mapping is influenced by the underlying network topology. There-
fore, mapping has been studied in the context of modern network topologies
(torus, fat tree, and Dragonfly) and technologies (InfiniBand, Ethernet, Blue-
Gene, Cray and others) [25]. LibTopoMap [26] is a generic library of graph map-
ping heuristics (recursive bisection, k-way partitioning, simple greedy strategies,
and Cuthill-McKee). Mapping with LibTopoMap is based on similarity metrics
(e.g., bandwidth of the adjacency matrices), employs rank reordering for MPI
applications, and can be used on various network topologies and technologies.
While the library is generic and versatile, it is not directly usable in a simulated
environment where certain practical effects in the software stack are abstracted
and the process mapping study can concentrate on specific aspects such as the
communication cost. Simulation is also important for the co-design of applica-
tions and future systems, which is the approach we take in the present work.
MPIPP [16] is a framework dedicated to MPI applications with arbitrary
virtual topologies executing on SMP clusters and multi-clusters. Similar to our
work, the framework employs application tracing to obtain the communication
behavior of the application. In contrast to this work, the communication pattern
is stored as a communication graph which is placed on the topology graph
(determined on-the-fly via a parallel ping-pong mechanism). Also different from
our work, they only consider graph partitioning algorithms and conduct direct
experiments on SMP clusters.
Rodrigues et al. [40] use a purely quantitative approach to apply resource
binding for MPI applications and reduce communication costs in multicore
nodes. Mercier and Clet-Ortega [34] use a similar but qualitative approach to
the same problem. Both works [40] [34] employ graph partitioning to compute
the mapping.
Automated mapping has also been studied in the context of regular appli-
cation communication graphs on 2-D and 3-D mesh and torus networks [12].
There, the virtual topology is also represented as a graph and the mapping
heuristics are chosen to optimize the hop·Byte metric. In our work, the virtual
topology is represented as an adjacency matrix, while the mapping heuristics
also optimize for the dilation (measured as hop·Byte) on 3-D mesh and torus
4
topologies, complemented by the 3-D HAEC Box topology.
EagerMap [17] employs a greedy topology mapping algorithm for hierarchi-
cal machine topologies (trees). Its algorithm groups together the application’s
processes that show the highest affinity based on the communication matrix. Al-
though the algorithm has been adapted to handle arbitrary network topologies,
it requires hierarchical multicore nodes.
Rico-Gallego et al. [39] surveyed prominent communication performance
models in high-performance computing, which are often tested in simulation.
They state that future models will need to take into account accurate perfor-
mance and energy modeling. In this work, we employ a contention-oblivious
communication model called network coding dynamic resilient (NCDr) [37] that
transmits messages efficiently, reliably, and with minimal energy costs. NCDr
is implemented in HAEC-SIM, the trace-based simulator used in this work.
In addition to careful process mapping, Sensi et al. [41] show that application-
aware routing outperforms application-agnostic routing. Similarly, in this work
we employ shortest path routing and argue that, in addition, communication-
aware mapping is needed and show that, in certain cases, it outperforms
communication-oblivious mapping.
Trace-driven simulation has also been recently used by Tsuji et al. [42] to
study the application behavior on future systems. Their workflow is very similar
to ours with the difference that their focus is to support the analysis of applica-
tions larger than the real MPI traces from existing systems. While we are also
concerned with scalability (planned for future work), in the present work we
concentrate on answering the three research questions (Section 1) for modern
application-system pairs.
Kenny et al. [30] consider the influence of the network and its parameters on
the performance of MPI-based parallel applications. They decompose the appli-
cation time spent in MPI into: communication, synchronization, and software
stack components. Through the combination of Bayesian inference and trace
replay, they found that synchronization and MPI software stack overheads are
at least as important as the network itself in determining time spent in com-
munication routines. In this work, we explicitly study the impact of network
topology and parameters under various process mapping strategies. Measuring
the time spent in synchronization and in various components of the software
stack is highly complex and will be incorporated into our workflow in the fu-
ture.
The above review shows partial similarities and differences between this work
and related work. This clarifies why it is not possible to directly compare our
approach with any of the above efforts.
3 Proposed Workflow
In this work, we propose a workflow to explicitly render mapping as an op-
timization step for communication-intensive applications. The proposed work-
flow, illustrated in Fig. 1, contains some steps that can be performed in parallel.
5
Rectangles represent actions, while light-colored parallelograms represent infor-
mation used as input or output. Each color refers to a different kind of step:
red steps relate to applications; blue steps are associated to mapping activities;
orange steps concern machine topologies; and green refers to the performance
evaluation and analysis phase.
We follow the proposed workflow along the course of this paper as a use case
and proof of concept. We explain and exemplify each task group in separated
sections.
The steps in red are detailed and exemplified in Section 4. They cover the
extraction and analysis of performance metrics that belong only to the target
application itself (i.e., are independent from topology or mapping).
Input / output
Process
Application tracing
Application trace
Extraction of 
communication 
profile
Communication 
matrices
Pre-simulation
analysis
Communication
volume, frequency, 
distance
Pre- and 
post-simulation 
comparison
Communication time, 
execution time
Start
End
Application 
(code & data)
Topologies 
(processor 
topologies, 
communication 
model, 
path selection)
Mapping algorithms
Generate process 
mapping
Mapping files
Trace-based 
simulation
Post-simulation
analysis
Simulation results 
(trace)
Communication
volume, frequency, 
distance
Figure 1: Workflow to explicitly render mapping as an optimization step for
communication-intensive applications.
6
The topology information in orange is discussed in Section 5. It includes the
definition of the target topologies, communication model, and path selection
that are used later in the workflow.
The mapping steps in blue are presented in Section 6. They combine the
information gathered from applications and topologies to generate mappings by
different process mapping algorithms.
The performance evaluation steps in green are exemplified in Section 7. They
handle the results from previous steps and provide an analysis process to eval-
uate the performance gains of alternative mappings for an application–system
pair. These steps are organized in two main phases. The first phase introduces a
pre-simulation analysis of mappings based on metrics that do not require native
or simulated experiments. The second and final phase contemplates a post-
simulation performance analysis that implies the use of a trace-based simulator
to check the performance gains of the studied applications, and later validate
the findings from the pre-simulation phase.
4 Parallel Communication-Intensive Applications
This section describes the four tasks which consider the applications tracing
and evaluation (red tasks in Figure 1) of the proposed workflow exposing how
we followed them. These consider the collection of information regarding the
target applications and an initial analysis over the applications’ communication
matrices which aims to predict which of them may benefit from improved process
mapping (subsection 4.2).
To follow the proposed workflow entirely, the target applications must be
traced. For this step, we used Score-P [31] measurement infrastructure version
4.1. We target four applications from different benchmark suites. From the
NAS parallel benchmarks [11] [7], we investigate CG and BT-MZ. Both appli-
cations were traced considering the input size class C [1]. NAS CG computes an
approximation of the smallest eigenvalue for a large, sparse, and symmetric ma-
trix. It stresses several irregular long distance communications. NAS BT-MZ
is a newer version of the Block Tri-diagonal solver (BT) which was re-design
to exploit multiple levels of parallelism (this case, MPI and OpenMP). From
the CORAL2 [6] benchmark suite, we investigate AMG [2]. This application
was traced with the default input (problem 1) [2]. AMG is a multigrid solver,
it generates many small messages and stresses memory and network latency.
Lastly, from the CORAL [5] benchmark suite, we approach LULESH [3] [28].
This application was traced considering 1000 iterations. LULESH is a proxy
application that simulates a variety of problems which describe the motion of
materials relative to each other when subject to forces.
The applications tracing process was performed on a system composed by
Intel Broadwell E5-2640 v4 [4] organized in 2 sockets with 10 cores each. The
nodes are connected by Intel Omni-Path network with 100 Gbit/s speed and
the interconnection topology is a two-level fat-tree. The applications were exe-
cuted on 16 nodes of this system, each executing 4 ranks (2 on each socket of
7
the node) totalizing 64 MPI ranks. Table 1 summarizes the total communica-
tion versus computation cost of the applications highlighting the most common
MPI operations. Theses results were obtained from the traces produced by the
applications’ execution on the aforementioned system.
Table 1: Application time spent in computation versus communication
NAS CG NAS BT-MZ CORAL2 AMG CORAL LULESH
Computation total 140.45 s, 2.8% 860.88 s, 84.4% 711.32 s, 75.8% 14,231.36 s, 83.2%
MPI Send 3,628.63 s, 71.3% — 0.17 s, 0.0% —
MPI Receive — — 1.64 s, 0.2% —
MPI Isend — 4.06 s, 0.4% 3.13 s, 0.3% 19.74 s, 0.1%
MPI Irecv 1.71 s, 0.0% 0.53 s, 0.1% 0.41 s, 0.0% 2.29 s, 0.0%
MPI Wait 1,301.72 s, 25.6% — — 729.47 s, 4.3%
MPI Waitall — 126.97 s, 12.4% 90.39 s, 9.6% 4.04 s, 0.0%
MPI total 4,945.84 s, 97.2% 159.47 s, 15.6% 226.62 s, 24% 2,870.59 s, 16.8%
4.1 Process Logical Communication Matrices
The next step in the red group of tasks of the workflow considers the extraction
of the communication behavior of the applications as a process logical communi-
cation matrix. The communication matrix of the applications must be extracted
from the traces or collected with another approach in a CSV text file format.
Communication matrices can be collected considering different data from the
applications (e. g., average message transfer time, volume of data exchanged,
among others). Figure 2 shows a graphical representation of the two types of
communication matrices required in this work. The top row of Figure 2 shows
the communication matrices in terms of point-to-point messages (commMatrix
count), and the bottom row shows the communication matrices in terms of vol-
ume (commMatrix size). The commMatrix count presents the total number of
messages exchanges between the processes. The commMatrix size presents the
volume of data in Bytes exchanged between the processes. The X and Y axis
represent the processes of the applications. It is important to highlight that BT-
MZ, LULESH and AMG present very irregular communication patterns which
can already indicate potential room for performance improvement.
4.2 Application-related Communication Metrics
Communication matrices provide a first view of an application’s communica-
tion behavior. Yet, the structure of a communication matrix alone may not be
enough to predict if or how much an application will benefit from carefully map-
ping its processes [15]. To provide better predictions, communication metrics
(or matrix statistics) have been previously proposed and tested in the context
of thread mapping on shared-memory machines [19,20] and process mapping on
hierarchical machines (nodes with multiple cores organized in fat-tree topolo-
gies) [15]. These metrics are useful in situations where we want to predict gains,
but also to filter applications (for instance, in a situation where multiple appli-
8
Receiver Process
Se
nd
er
 P
ro
ce
ss
2000
3000
4000
5000
6000
Nu
m
be
r o
f m
es
sa
ge
s
(a) CG (exchanges)
Receiver Process
Se
nd
er
 P
ro
ce
ss
190
200
210
220
Nu
m
be
r o
f m
es
sa
ge
s
(b) BT-MZ (exchanges)
Receiver Process
Se
nd
er
 P
ro
ce
ss
1500
2000
2500
3000
Nu
m
be
r o
f m
es
sa
ge
s
(c) LULESH (exchanges)
Receiver Process
Se
nd
er
 P
ro
ce
ss
250
500
750
1000
Nu
m
be
r o
f m
es
sa
ge
s
(d) AMG (exchanges)
Receiver Process
Se
nd
er
 P
ro
ce
ss
0
10000
20000
30000
Vo
lu
m
e 
ex
ch
an
ge
d 
[B
]
+2.964e8
(e) CG (volume)
Receiver Process
Se
nd
er
 P
ro
ce
ss
0.5
1.0
1.5
Vo
lu
m
e 
ex
ch
an
ge
d 
[B
]
1e7
(f) BT-MZ (volume)
Receiver Process
Se
nd
er
 P
ro
ce
ss
2
4
6
8
Vo
lu
m
e 
ex
ch
an
ge
d 
[B
]
1e7
(g) LULESH (volume)
Receiver Process
Se
nd
er
 P
ro
ce
ss
0.5
1.0
1.5
Vo
lu
m
e 
ex
ch
an
ge
d 
[B
]
1e7
(h) AMG (volume)
Figure 2: Communication behavior of the four used applications recorded on
MiniHPC with 64 MPI processes. The top row shows the communication matri-
ces in terms of point-to-point messages (heat bar: number) exchanged between
the processes (X and Y axes). The bottom row shows the communication ma-
trices in terms of volume (heat bar: Byte) exchanged between the processes
(X and Y axes). In all figures, white denotes absence of exchanges and of com-
munication volume between a pair of sender-receiver processes.
9
cations can be tested but access to resources is limited). Still, it is important
to emphasize that these metrics have not been tested in 3-D topologies before.
We list the set of communication metrics that we consider in our work below.
For all of these metrics, higher values are supposed to indicate a higher potential
benefit from a careful mapping.
• Communication Heterogeneity (CH): measures the average communica-
tion variance of each process [20].
• Communication Amount (CA): measures how much processes communi-
cate on average [20].
• Communication Balance (CB): measures how much the process that com-
municates the most diverges from other processes [19].
• Communication Centrality (CC): measures how communication is dis-
persed from the diagonal of the communication matrix [15].
• Neighbor Communication Fraction (NBC): measures the fraction of com-
munication that happens between processes with close identifiers (ranking
neighbors) [15].
• Split Fraction (SP(k)): measures the fraction of communication that hap-
pens among processes in k2 blocks [15].
We computed the values of these six metrics for our different applications
based on the formulas presented in [15] except for CA, which is not mentioned
in that work2. Additionally, we computed the Split Fraction of applications for
values of k = 4 and 16 because they represent a portion and a full plane in
our 3-D topologies (more details in Section 5), which is the closest that we can
get to the original idea of k (i.e., size of nodes with multiple cores). Tables 2
and 3 present the values of these metrics for our four applications based on their
commMatrix count and commMatrix size, respectively. All values are rounded
to three decimal places. Bold values represent the largest value for each metric.
Table 2: Communication metrics based on commMatrix count .
AMG BT-MZ CG Lulesh
CH 0.119 0.171 0.046 0.073
CA 306.948 44.656 312.313 413.314
CB 0.418 0.289 0.000 0.413
CC 0.369 0.339 0.061 0.325
NBC 0.910 0.963 0.700 0.858
SP(4) 0.898 0.941 0.387 0.858
SP(16) 0.637 0.651 0.074 0.589
2The communication metrics scripts are available online at https://github.com/llpilla/
communication-statistics.
10
Table 3: Communication metrics based on commMatrix size.
AMG BT-MZ CG Lulesh
CH 0.063 0.025 0.059 0.041
CA 1,326,101.373 1,168,398.867 18,526,539.000 4,922,160.809
CB 0.273 0.205 0.000 0.258
CC 0.163 0.292 0.061 0.157
NBC 0.686 0.954 0.750 0.677
SP(4) 0.685 0.926 0.469 0.677
SP(16) 0.354 0.584 0.187 0.344
We can notice multiple points from Tables 3 and 2. BT-MZ shows the largest
values for the most metrics among applications, meaning that it probably has
the largest potential to benefit from topology mapping. AMG is the second
application based on the number of largest values with four largest values. Nev-
ertheless, we can notice that these values appear for metrics CH, CB, and CC.
As these metrics are the ones that show the smallest ranges, they may not help
differentiate the applications enough, which makes it difficult to estimate the
impact of mapping for AMG compared to others. The Communication Amount
metric shows that CG exchanges the most data and Lulesh exchanges the most
messages among the applications, which could indicate that they are the most
sensitive to changes in bandwidth and latency, respectively. Finally, CG has a
Communication Balance of zero for both communication matrices. This means
that all processes exchange exactly the same volume of data and number of
messages.
Although these metrics provide us with predictions of which applications
will benefit from precise mapping algorithms, how much the application actu-
ally benefit is strongly related to the specific network topologies being used
(for instance, a fully connected topology would see no changes with different
mappings). The next section explores the characteristics of network topologies.
5 Interconnection Network Topologies
On of the three external inputs to our workflow is the target topology repre-
sented by the orange box. The mapping problem might only be apparent when
thinking of parallel applications running on different nodes of an HPC system.
However, taking a more abstract approach, a self-similar fractal-like observation
enfolds.
Most contemporary networks can be seen as connected to an encompassing
interconnection network or as an interconnection network to its subnetworks.
For instance, from an HPC system comprising several isles, over a node that
contains multiple NUMA domains, down to modern SoC processor architectures,
each one is a network topology containing another subnetwork. Hence, we see
numerous opportunities to analyze the influence of different mapping strategies.
11
5.1 Direct 3-D Topologies
Nowadays, HPC systems are usually built with sophisticated network topologies,
e.g., butterfly, fat-tree, Dragonfly, while other networks still employ simpler
topologies. For instance, the Intel SkyLake SP architecture uses a 2D mesh
topology to connect all cores of one chip.
Influenced by these observations, we only consider topologies that arrange
64 nodes in a 4 × 4 × 4 fashion as primary building blocks for more complex
interconnection topologies. This node topology already allows for a rich selection
of exciting network topologies, i.e., 3-D mesh, 3-D torus, and the HAEC box
topology.
According to [25], the torus topology is used in IBMs Blue Gene series
(BG/L, BG/P and BG/Q ), Crays Gemini network, and Fujitsus K computer.
The mesh topology is a simpler version of a torus [12].
The HAEC project envisions a novel network topology [22], which arranges
the nodes on four boards with each board containing 4× 4 nodes. The nodes of
one board are connected using fast optical links in a 2D torus. A fully-connected
wireless connection array facilitates inter-board communication.
Table 4: The network link characteristics used in the simulation.
Link type Bandwith Latency Bit error rate
Wireless 100 Gbit s−1 100 ps 1e−8
Optical 250 Gbit s−1 10 ps 1e−12
In this work, we assume that all topologies employ future-generation network
links. Table 4 shows the characteristics of the used links. While the mesh and
torus topologies only contain the optical links, the HAEC Box is a heterogeneous
network topology by employing both wireless and optical links. Therefore, we
can analyze the fitness of the selected mapping algorithms in that case.
5.2 Path Selection
In our pre- and post-simulation analysis (subsections 7.1 and 7.3), we exclusively
use static path selection, i.e., the path of individual packets through the network
is only dependent on the source and destination positions and does not change
over time. The algorithm used in the Mesh and Torus topology to predeter-
mine the course of packets is the shortest-path routing in the XY Z dimension
order. This algorithm firstly sends packets along the X dimension until the
X-coordinate of the current hop destination is equivalent to the X-coordinate
of the packet destination. The algorithm then repeats this hop selection for the
Y -dimension and Z-dimension.
On the HAEC Box topology, the simulation uses the same routing algorithm
when the message source and destination are within the same board. When
the message has to jump between boards, the first hop is to the node, which
has the same X- and Y -coordinate as the destination. After the first hop,
12
every subsequent jump is along the Z-dimension, until the packet reaches the
destination.
5.3 Communication Model
For the modeling of the communications, we use a contention-oblivious imple-
mentation of the NCDr network model [38]. This model describes the separa-
tion of message bodies into packets and the duration of transmissions over the
network based on future-generation wireless and optical links.
6 Mapping Applications onto Parallel Machines
This section describes the blue task group of the proposed workflow (Figure 1).
It clarifies how we generate the mappings and briefly summarizes the available
algorithms.
6.1 Library of Mapping Algorithms
We implemented twelve mapping algorithms from the literature. The map-
ping algorithms were developed as a library in Python named MapLib3. Cur-
rently, the algorithms support the three topologies studied here, 3-D mesh,
3-D torus, and HAEC. The support for other topologies is considered as fu-
ture work. The algorithms input is the communication matrices of the ap-
plications (commMatrix count or commMatrix size) in CSV text file format.
The following line provides an example of the command that can be used
to generate the mappings: map -i $comm matrix -m $mapping -t $topology
-d $dimensions save -o "$outfile" stat. The parameter -i expects the
path to the CSV communication matrix of an application. The following, -m
specify which of the supported mapping algorithms will be calculated. The pa-
rameter -t selects the target topology and -d defines size of the three dimensions
of the topology, starting from X and followed by Y and Z. More specific details
regarding the installation and particular configurations can be found together
with the library itself. An example of mapping file generated by the Python
library can be found in [13].
The following subsections summarize each of the mapping algorithms imple-
ment in MapLib. These are divided into two categories. Communication- and
topology-oblivious mapping strategies (subsection 6.2) consider algorithms that
do not take into account the communication matrices of the applications, neither
the topologies. These algorithms follow a predetermined path/order to map all
processes to the available nodes. Thus, the resulting mappings will always be
the same independently of the communication matrix or topology considered.
The following category contains communication- and topology-aware mapping
strategies (subsection 6.3). These algorithms consider both the target topology
3The library is available at: https://github.com/unibas-dmi-hpc/MapLib
13
and communication matrices of the applications. Hence, they produce different
mappings for each topology and types of communication matrices.
6.2 Communication- and Topology-Oblivious Mapping
Strategies
The communication-oblivious mapping strategies implemented were five space
filling curves (SFCs) which have been extensively used as a mapping scheme from
the discrete multi-dimensional space into the one-dimensional space [35]. SFCs
were discovered in the nineteen century by Peano [36] followed by Hilbert [24]
and since then numerous variations have been studied. The five implemented
SFCs are depicted in Figure 3. The sweep mapping algorithm is the default
naive mapping which follows the topology in a XY Z fashion. It is also used as
reference to consider quality and performance improvements generated by other
mappings which are evaluated in the following Section 7.
x
y
z
(a) Peano
x
y
z
(b) Hilbert
x
y
z
(c) Gray
x
y
z
(d) sweep
x
y
z
(e) scan
Figure 3: Three-dimensional SFCs used in this work. The curves start on the
bottom left corner of the topology and proceed along the lines in the red-orange-
yellow-olive-green-blue order.
6.3 Communication- and Topology-Aware Mapping
Strategies
Bokhari [14] is an algorithm which originates from graph theory and was ini-
tially proposed for the assignment of parallel applications solving structural
problems on the finite element machine (FEM). This algorithm takes an initial
14
mapping and proceeds in two steps. In the first step, it searches for the pairwise
task interchanges that maximize the cardinality (matching of edges of the ap-
plication graph to edges in the machine topology) to generate a new mapping.
In the second step, it checks if the new mapping has a higher cardinality than
the previous one. If this is the case, it stores the best mapping, generates a new
mapping through random task swaps, and returns to first step. If not, it stops
and returns the best mapping found previously.
The bipartition algorithm implemented in this work was proposed by Wu,
Xiong and Lan [46] as a way to improve inter-node mapping on torus and
mesh topologies through a recursive bipartitioning algorithm. Mapping is done
by recursively dividing the communication graph using the multilevel k-way
partitioning algorithm of Karypis and Kumar (k = 2) [29], while the machine
topology is simply recursively split in the middle of its largest dimension.
PaCMap [43] is a graph-based mapping algorithm that simultaneously con-
ducts job allocation and process mapping to reduce communication overhead.
In this work we implemented the process mapping step of the algorithm. PaCMap
starts by partitioning the application communication matrix into process groups
(PGs) of highly-communicating processes. After that, the algorithm selects a
center PG (in our case a single process) and maps it to a center node in the
cluster topology. Then, it expands the allocation by picking a node and map-
ping a process to it based on the network topology and on the communication
graph until all tasks are mapped. Therefore, mapping highly-communicating
PGs close to each other.
The topo-aware algorithm was proposed by Agarwal et al. [8]. The main
idea is to divide the mapping challenge in two separated phases. First, a parti-
tioning phase groups heavily communicating processes in the same task. Next,
the mapping phase maps tasks onto the processors such that more heavily com-
municating tasks are placed on nearby processors. The algorithm uses an es-
timation function that calculates the cost of placing an unallocated task on
an available processor in each cycle. Their main focus is to reduce the overall
number of hops aiming at highly communicating processes.
The Greedy algorithm assessed herein was proposed by Hoefler et al. [26] and
supports heterogeneous networks by considering edge weights. Greedy starts at
some vertex of the graph mapping the most communicative process to it. Next,
the algorithm maps recursively the following heaviest communicating processes
in the neighboring vertices until all processes are mapped.
The fast and high quality greedy (FHgreedy) that we implemented was pro-
posed by Deveci et al. [18]. The goal is also to reduce the number of hops between
most communicating processes. The algorithm starts by randomly mapping the
most communicating process. Next, the neighbors nodes are mapped regarding
the amount of communication that they have with the previous mapped process
until all are mapped.
The greedyALLC [23] is rather similar to FHgreedy and was proposed by
Glantz et al. [23]. The algorithm starts by mapping the most communicating
process to the most connected node in certain topology. Then, the following
processes are mapped considering the previously mapped processes trying to
15
place highly communicating pairs close to each other. greedyALLC focuses on
improving dilation and congestion.
7 Performance Evaluation
The performance evaluation phase of the proposed workflow (green in Figure 1)
comprises three steps. The pre-simulation performance analysis step evaluates
the mappings’ quality using metrics that can be derived without requiring an
execution or simulation. The post-simulation performance analysis step evalu-
ates the performance of the mappings using the conducted simulations. And
finally, the pre- and post-simulation performance comparison shows the impact
of mapping and allows the assertion of constant properties after the simulation.
In table 5, we present all factors, which we considered for the organization of
the experiments. All legends of the following Figures 4, 5, and 6 list the process
mapping algorithms in chronological order of their appearance, starting from
the oldest and ending with the newest proposed algorithm.
Table 5: Design of Factorial Experiments
Parameter Count Values
Application 4
NAS CG, NAS BT-MZ, CORAL-2 AMG,
and CORAL LULESH
Mapping
Communication- &
topology-oblivious
5 Peano, Hilbert, Gray, sweep, scan
algorithms
Communication- &
topology-aware
7
Bokhari, topo-aware, greedy, FHGreedy,
greedyALLC, bipartition, PaCMap
Mapping input 2 commMatrix count, commMatrix size
Topology 3 3-D mesh, 3-D torus, HAEC Box
Network model 1 NCDr
Total number of experiments 288
7.1 Pre-simulation Mapping Performance Analysis
The pre-simulation analysis step of the proposed workflow (Figure 1) suggests
assessing the quality of the mappings considering metrics that do not require an
experimental or simulation process. This step requires four pieces of informa-
tion: the application communication behavior denoted by both communication
matrices types commMatrix count and commMatrix size, the generated process
mappings from the algorithms, the target network topology, and the used rout-
ing algorithm. With this information, we could consider numerous metrics in
this step, such as dilation4, average and the total number of hops, and volume
of data traveled trough the network links [47].
In this work, we selected dilation as our pre-simulation metric to evaluate
the mappings. Dilation as a performance and quality metric for the evaluation
4Dilation is often also referred to as ‘hop-Bytes’.
16
of process mapping is a tendency in the literature [47] [8] [27] [15]. It is easily
deliverable and tends to correlate well with the applications’ performance [12].
The dilation D can be calculated using Equation 1, for the list of processes P ,
the mapping function δ, and the weight function w, which usually denotes the
number of bytes communicated. Therefore, a lower dilation value indicates a
higher performance and thus lower energy consumption.
D =
∑
i∈P
∑
j∈P
d(δ(i), δ(j)) · w(i, j) (1)
Figure 4 presents the dilation caused by each mapping for every application
and topology. The X axis presents the topologies while the Y axis presents
the dilation. The colors identify each mapping. The green line highlights the
dilation achieved by sweep, which we consider as the default mapping. To
compare the efficancy of the mappings resulting from different input types of
communication matrices, we plot as circles the mappings that were calculated
using commMatrix count , while the triangles indicate the mappings considering
commMatrix size.
The total amount of dilation vary significantly between different mappings
and topologies. One can observe that numerous mappings achieved better per-
formance than sweep (default) for CG and BT-MZ. CG achieved lowest dilation
with Peano, topo-aware, and pacmap for 3-D mesh topology, Peano and PaCMap
for 3-D torus, and Peano for HAEC. BT-MZ achieved lowest dilation with
topo-aware for all topologies with the exception of 3-D torus, for which the
resulting mapping from PaCMap with commMatrix size produced less dilation.
Therefore, CG and BT-MZ present potential for performance improvement with
the aforementioned mappings. In its turn, both LULESH and AMG achieved
lowest dilation with sweep (default) which may indicate that those applications
will not benefit from any of the other mappings considering these topologies.
Although commMatrix count and commMatrix size produced similar mappings,
the ones generated considering commMatrix size presented slightly lower dila-
tion in most cases. This indicate that commMatrix size is more appropriated
for the mappings calculation. Finally, in a direct comparison between the same
mapping algorithm across the different topologies, HAEC always present the low-
est dilation since its larger amount of connections create shorter paths which
allow the messages to travel less hops.
17
3-D Mesh 3-D Torus HAEC
Topology
0.0
0.5
1.0
1.5
2.0
2.5
W
ei
gh
te
d 
av
er
ag
e 
di
la
tio
n 
[B
]
1e11
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(a) CG
3-D Mesh 3-D Torus HAEC
Topology
0.0
0.5
1.0
1.5
2.0
2.5
W
ei
gh
te
d 
av
er
ag
e 
di
la
tio
n 
[B
]
1e10
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(b) BT-MZ
3-D Mesh 3-D Torus HAEC
Topology
0
1
2
3
4
5
6
7
8
W
ei
gh
te
d 
av
er
ag
e 
di
la
tio
n 
[B
]
1e10
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(c) LULESH
3-D Mesh 3-D Torus HAEC
Topology
0.0
0.5
1.0
1.5
2.0
2.5
W
ei
gh
te
d 
av
er
ag
e 
di
la
tio
n 
[B
]
1e10
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(d) AMG
Figure 4: Total dilation for all mappings, applications, and topologies. The X
axis presents the topologies while the Y axis presents the total amount of dila-
tion. The colors identify each mapping. The green line highlights the dilation
achieved by sweep, which we consider as the default mapping. The circles iden-
tify mappings that were calculated using commMatrix count , while the trian-
gles indicate mappings calculated considering commMatrix size. LULESH and
AMG do not present improvements in relation to sweep. For CG and BT-MZ
several mappings present lower dilation which indicates potential performance
improvements.
7.2 Application Performance Prediction Using Simulation
A vital part of the proposed workflow is the evaluation of the impact of the
process mappings on the performance using simulation. In this work, we will
use the HAEC Simulator [13] to simulate a set of selected well-known HPC
benchmarks. The HAEC Simulator uses the recorded traces of the applications
to model their behavior during the simulation process. With the hardware
and software models for the target machine, the simulator generates new traces
representing the applications’ behavior when executed on the target system.
The simulator works deterministically, i.e., different simulations with the same
input will generate the same output.
While the duration of computations are fixed, we use a contention-oblivious
implementation of the NCDr network model [38] to model the point-to-point
communications. Pfennig et al. [37] verified this implementation. While the
18
HAEC Simulator uses sophisticated modeling of point-to-point communications,
the model for collective communications adds a fixed minimum delay and tem-
porally synchronizes all involved processes.
To predict the performance impact of the different mapping algorithms, we
simulate the execution for each combination of mapping, topology, and appli-
cation. After the simulation, we analyses the resulting application traces.
7.3 Post-simulation Performance Evaluation
Given the simulation results (Figure 5), we calculate several performance met-
rics, e.g., the point-to-point communication time, the MPI point-to-point costs,
and the parallel costs, as well as communication metrics, e. g., volume, dilation,
and distance.
3-D Mesh 3-D Torus HAEC
Topology
0
25
50
75
100
125
150
175
200
M
PI
 c
os
ts
 v
s. 
pa
ra
lle
l c
os
ts
 [s
]
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(a) CG
3-D Mesh 3-D Torus HAEC
Topology
0
200
400
600
800
1000
1200
M
PI
 c
os
ts
 v
s. 
pa
ra
lle
l c
os
ts
 [s
]
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(b) BT-MZ
3-D Mesh 3-D Torus HAEC
Topology
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
M
PI
 c
os
ts
 v
s. 
pa
ra
lle
l c
os
ts
 [s
]
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(c) LULESH
3-D Mesh 3-D Torus HAEC
Topology
0
200
400
600
800
1000
1200
M
PI
 c
os
ts
 v
s. 
pa
ra
lle
l c
os
ts
 [s
]
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(d) AMG
Figure 5: Simulated AGG. MPI point to point (MPI P2P) execution time
versus simulated AGG. parallel EXEC. time for all mappings, applications, and
topologies. The X axis identifies the topologies while the Y axis presents both
AGG. MPI P2P time in lighter colors and AGG. parallel EXEC. time in darker
colors. The colors identify each mapping and metric being displayed. The green
lines highlight the performance achieved by sweep (default) for both metrics.
The circles identify mappings based on commMatrix count and the triangles in-
dicate the mappings calculated considering commMatrix size. LULESH, AMG,
and BT-MZ do not present improvements. CG was the only application sensible
to different mappings when looking at AGG. MPI P2P and AGG. EXEC. time
achieving the highest performance with Peano, topo aware, and pacmap.
19
The point-to-point communication time is the aggregated duration of all
point-to-point message transfers. The MPI point-to-point cost metric denotes
the accumulated time of all processes of a parallel application spent in MPI
point-to-point communication functions, such as MPI Send, MPI recv, and
MPI Wait.
Figure 5 shows the impact of process mapping on application performance.
For that matter, the figure presents the value of the parallel costs and the
part attributed to the MPI point-to-point costs. The portion of point-to-point
communication to the total execution time can be derived to compare with the
values in Table 1.
Except for the application CG, neither the topology nor the mapping algo-
rithm has an impact on the shown metrics. In the case of CG, the 3-D torus
topology is the best performing one. Also, the mappings generated by the
PacMap and Peano algorithm have the lowest MPI and parallel costs. These
results so far show that the mapping has barely any influence on the application
runtime.
3-D Mesh 3-D Torus HAEC
Topology
0
2
4
6
8
10
12
14
16
Ag
gr
eg
at
ed
 c
om
m
un
ica
tio
n 
tim
e 
[s
]
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(a) CG
3-D Mesh 3-D Torus HAEC
Topology
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
Ag
gr
eg
at
ed
 c
om
m
un
ica
tio
n 
tim
e 
[s
]
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(b) BT-MZ
3-D Mesh 3-D Torus HAEC
Topology
0
1
2
3
4
5
6
Ag
gr
eg
at
ed
 c
om
m
un
ica
tio
n 
tim
e 
[s
]
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(c) LULESH
3-D Mesh 3-D Torus HAEC
Topology
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Ag
gr
eg
at
ed
 c
om
m
un
ica
tio
n 
tim
e 
[s
]
Peano
Hilbert
gray
sweep
scan
Bokhari
topo_aware
greedy
FHGreedy
greedyALLC
bipartition
pacmap
(d) AMG
Figure 6: Simulated aggregated communication time for all mappings, appli-
cations, and topologies. The X axis presents the topologies while the Y axis
presents the simulated aggregated communication time. The different colors
identify each mapping technique. The circles identify mappings calculated using
commMatrix count , while the triangles indicate mappings calculated considering
commMatrix size. The green line highlights the results for sweep (default).
However, a different conclusion becomes visible when we look at the point-
to-point communication time in Figure 6. This figure shows the costs of the
20
communication from the network perspective and reveals significant differences
in the communication time.
Again, for the HAEC Box topology, we observe a different behavior. On this
topology, for all applications, sweep causes the smallest communication time.
This fact highlights the need for process mapping algorithms tailored towards
heterogeneous network topologies.
While the application runtime may not be influenced much by the used map-
ping, the actual communication time still is. Given that we simulated the per-
formance with a contention-oblivious communication model, we expect a much
higher negative impact on performance with mappings that expose a higher
communication time. With these findings, we postulate that mapping matters.
7.4 Comparison of Pre- and Post-Simulation Performance
Predictions
Using our a priori knowledge, we can infer certain truths about the predicted
performance of the applications, which we then use to verify the simulation
results. Given the definition of the dilation, the pre-simulation value cannot
change through simulation. As part of the automated simulation process, we
calculate this metric before and after every simulation and assert that it remains
the same.
As described in Section 4, we use commMatrix count and commMatrix size
as input to all mapping algorithms resulting in two mappings for each algorithm.
However, for communication-oblivious process mapping algorithms, we generate
the same mapping twice. Hence, the simulated performance should be equal for
these twins of mappings. This verification test also passes, as the Figure 5
shows.
These observations give us confidence in our results; however, we also need
to address other points.
In the pre-simulation performance evaluation, we show the dilation for the
different combinations of topology, mapping, and application. Given the sig-
nificant differences in the values, this metric hints at great performance opti-
mization opportunities. However, except for the CG application, we can barely
see any differences in the simulation parallel costs. This behavior stems from
several intricacies.
As shown in Table 1, CG is the only application that significantly uses
blocking MPI point-to-point operations. In combination with the fast future-
generation network links, the other applications successfully hide their com-
munication costs using non-blocking MPI point-to-point operations. Addition-
ally, in the case of CG, the communication volume is an order of magnitude
higher, compared to the other applications. And finally, as mentioned before,
the communication model is contention-oblivious, which means that concurrent
communications do not impede each other.
When we compare the pre-simulation dilation with the point-to-point com-
munication time, we can gather new insights. The differences in the communica-
tion time displayed in Figure 6, match with the prediction for the performance
21
derived from the dilation in the case of the mesh and torus topology. For the
HAEC Box topology, there seems to be no such correlation. Apparently, the cal-
culation of the dilation misses a weight to represent the different types of hops
in a heterogeneous network.
In summary, we can confirm that the dilation can be used as a prediction for
the communication time on homogeneous topologies. However, we emphasize
the need for a novel metric in the case of heterogeneous applications.
8 Conclusion and Future Work
Given the decisive results, it is clear that process mapping is an important topic
that should become part of every application execution. In a real setting, our
work can be practiced by first binding the processing elements to their dedicated
hardware processing units and reordering the MPI ranks of the application
according to the relevant mapping algorithm.
We proposed and exercised a generic workflow to render mapping as an ex-
plicit optimization step for communication-intensive applications. We predicted
the communication performance for four applications, three topologies, and
twelve mapping algorithms that we implemented as a Python library. Using a
trace-based simulator and pre-simulation metrics, we showed that communication-
aware mappings frequently outperform communication-oblivious ones.
We observed that the dilation does not correlate well with the simulated
application runtime. However, when looking exclusively at the aggregated com-
munication times, the performance changes significantly between the mappings.
This observation means that even if the application runtime does not change,
good mapping reduces the network load. Thus, mapping matters.
In the future, we want to improve the proposed workflow, the set of mapping
algorithms, and the used simulator. We will expand our mapping library with
other algorithms, such as those from Bhatele et al. [12] and Wu et al. [45]. Fur-
thermore, we envision the integration of the HAEC-Simulator with TopGen [44]
to simulate additional interconnection networks.
Inter-application interference and the impact of process mapping on other
metrics (such as congestion) are also part of future work. Finally, a scalability
study considering topologies with higher dimensions and a greater number of
processes is also an important aspect to be considered in future work.
Acknowledgments
This work is in part supported by the Swiss National Science Foundation in the
context of the “Multi-level Scheduling in Large Scale High Performance Com-
puters” (MLS) grant, number 169123, and by the German Research Foundation
(DFG) within the CRC912 - HAEC.
22
References
[1] Application 352.nab from the SPEC OpenMP 2012 Benchmarks. https://
www.nas.nasa.gov/publications/npb_problem_sizes.html. Accessed:
May 10, 2020.
[2] Application AMG from CORAL2 Benchmarks. https://asc.llnl.gov/
coral-2-benchmarks/downloads/AMG_Summary_v1_7.pdf. Accessed:
May 09, 2020.
[3] Application LULESH from CORAL benchmarks. https://computing.
llnl.gov/projects/co-design/lulesh. Accessed: May 09, 2020.
[4] Intel Broadwell E5-2640 v4. https://ark.
intel.com/content/\www/us/en/ark/products/92984/
intel-xeon-processor-e5-2640-v4-25m-cache-2-40-ghz.html.
Accessed: April 18, 2020.
[5] The CORAL Benchmarks Suite. https://asc.llnl.gov/
CORAL-benchmarks/. Accessed: May 09, 2020.
[6] The CORAL2 Benchmarks Suite. https://asc.llnl.gov/
coral-2-benchmarks/. Accessed: May 09, 2020.
[7] The NAS Parallel Benchmarks Suite. https://www.nas.nasa.gov/
publications/npb.html. Accessed: May 09, 2020.
[8] Agarwal, T., Sharma, A., Laxmikant, A., and Kale, L. V.
Topology-aware task mapping for reducing communication contention on
large parallel machines. In Proceedings 20th IEEE International Parallel
Distributed Processing Symposium (April 2006).
[9] Ait Salaht, F., Desprez, F., and Lebre, A. An overview of service
placement problem in fog and edge computing. ACM Comput. Surv. 0, ja.
[10] Bader, M. Space-Filling Curves: An Introduction with Applications in
Scientific Computing. Springer Heidelberg New York Dordrecht London,
2013.
[11] Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S.,
Carter, R. L., Dagum, L., Fatoohi, R. A., Frederickson, P. O.,
Lasinski, T. A., Schreiber, R. S., et al. The nas parallel benchmarks.
The International Journal of Supercomputing Applications 5, 3 (1991), 63–
73.
[12] Bhatele´, A., Gupta, G. R., Kale´, L. V., and Chung, I. Automated
mapping of regular communication graphs on mesh interconnects. In 2010
International Conference on High Performance Computing (2010), pp. 1–
10.
23
[13] Bielert, M., Ciorba, F. M., Feldhoff, K., Ilsche, T., and Nagel,
W. E. HAEC-SIM: A simulation framework for highly adaptive energy-
efficient computing platforms. In Eighth EIA International Conference on
Simulation Tools and Techniques (Aug. 2015), G. Theodoropoulos, G. S. H.
Tan, and C. Szabo, Eds., ACM.
[14] Bokhari, S. H. On the mapping problem. IEEE Transactions on Com-
puters, 3 (1981), 207–214.
[15] Bordage, C., and Jeannot, E. Process affinity, metrics and impact on
performance: An empirical study. In Proceedings of the 18th IEEE/ACM
International Symposium on Cluster, Cloud and Grid Computing (2018),
CCGrid ’18, IEEE Press, pp. 523–532.
[16] Chen, H., Chen, W., Huang, J., Robert, B., and Kuhn, H. Mpipp:
An automatic profile-guided parallel process placement toolset for smp clus-
ters and multiclusters. In Proceedings of the 20th Annual International
Conference on Supercomputing (New York, NY, USA, 2006), ICS ’06, As-
sociation for Computing Machinery, pp. 353–360.
[17] Cruz, E. H. M., Diener, M., Pilla, L. L., and Navaux, P. O. A.
Eagermap: A task mapping algorithm to improve communication and load
balancing in clusters of multicore systems. ACM Trans. Parallel Comput.
5, 4 (Mar. 2019).
[18] Deveci, M., Kaya, K., Uc¸ar, B., and C¸atalyu¨rek, U¨. V. Fast and
high quality topology-aware task mapping. In 2015 IEEE International
Parallel and Distributed Processing Symposium (2015), IEEE, pp. 197–206.
[19] Diener, M., Cruz, E. H., Alves, M. A., Alhakeem, M. S., Navaux,
P. O., and Heiß, H.-U. Locality and balance for communication-aware
thread mapping in multicore systems. In European Conference on Parallel
Processing (2015), Springer, pp. 196–208.
[20] Diener, M., Cruz, E. H., Pilla, L. L., Dupros, F., and Navaux,
P. O. Characterizing communication and page usage of parallel applica-
tions for thread and data mapping. Performance Evaluation 88 (2015),
18–36.
[21] European Technology Platform for High Performance Com-
puting (ETP4HPC). SRA4: Strategic Research Agenda for High-
Performance Computing in Europe.
[22] Fettweis, G., Nagel, W., and Lehner, W. Pathways to servers of the
future. In 2012 Design, Automation Test in Europe Conference Exhibition
(DATE) (March 2012), pp. 1161–1166.
[23] Glantz, R., Meyerhenke, H., and Noe, A. Algorithms for mapping
parallel processes onto grid and torus architectures. In 2015 23rd Euromi-
cro International Conference on Parallel, Distributed, and Network-Based
Processing (March 2015), pp. 236–243.
24
[24] Hilbert, D. V. U¨ber die stetige Abbildung einer Linie auf ein
Fla¨chenstu¨ck. Mathematische Annalen (1891), 459–460.
[25] Hoefler, T., Jeannot, E., and Mercier, G. An Overview of Process
Mapping Techniques and Algorithms in High-Performance Computing. In
High Performance Computing on Complex Environments, E. Jeannot and
J. Zilinskas, Eds. Wiley, June 2014, pp. 75–94.
[26] Hoefler, T., and Snir, M. Generic topology mapping strategies for
large-scale parallel architectures. In Proceedings of the International Con-
ference on Supercomputing (New York, NY, USA, 2011), ICS ’11, ACM,
pp. 75–84.
[27] Jain, N., Bhatele, A., Robson, M. P., Gamblin, T., and Kale,
L. V. Predicting application performance using supervised learning on
communication features. In Proceedings of the International Conference
on High Performance Computing, Networking, Storage and Analysis (New
York, NY, USA, 2013), SC ’13, Association for Computing Machinery.
[28] Karlin, I., Bhatele, A., Keasler, J., Chamberlain, B. L., Co-
hen, J., DeVito, Z., Haque, R., Laney, D., Luke, E., Wang, F.,
Richards, D., Schulz, M., and Still, C. Exploring traditional and
emerging parallel programming models using a proxy application. In 27th
IEEE International Parallel & Distributed Processing Symposium (IEEE
IPDPS 2013) (Boston, USA, may 2013).
[29] Karypis, G., and Kumar, V. Multilevel k-way partitioning scheme for ir-
regular graphs. Journal of Parallel and Distributed computing 48, 1 (1998),
96–129.
[30] Kenny, J. P., Sargsyan, K., Knight, S., Michelogiannakis, G.,
and Wilke, J. J. The pitfalls of provisioning exascale networks: A trace
replay analysis for understanding communication performance. In High
Performance Computing (Cham, 2018), R. Yokota, M. Weiland, D. Keyes,
and C. Trinitis, Eds., Springer International Publishing, pp. 269–288.
[31] Knu¨pfer, A., Ro¨ssel, C., Mey, D. a., Biersdorff, S., Diethelm,
K., Eschweiler, D., Geimer, M., Gerndt, M., Lorenz, D., Mal-
ony, A., Nagel, W. E., Oleynik, Y., Philippen, P., Saviankou,
P., Schmidl, D., Shende, S., Tschu¨ter, R., Wagner, M., Wesarg,
B., and Wolf, F. Score-P: A joint performance measurement run-time
infrastructure for Periscope, Scalasca, TAU, and Vampir. In Tools for High
Performance Computing 2011 (Berlin, Heidelberg, 2012), H. Brunst, M. S.
Mu¨ller, W. E. Nagel, and M. M. Resch, Eds., Springer Berlin Heidelberg,
pp. 79–91.
[32] Laguna, I., Marshall, R., Mohror, K., Ruefenacht, M., Skjel-
lum, A., and Sultana, N. A large-scale study of mpi usage in open-
source hpc applications. In Proceedings of the International Conference
25
for High Performance Computing, Networking, Storage and Analysis (New
York, NY, USA, 2019), SC ’19, Association for Computing Machinery.
[33] Mahmud, R., Srirama, S. N., Ramamohanarao, K., and Buyya,
R. Profit-aware application placement for integrated fog–cloud computing
environments. Journal of Parallel and Distributed Computing 135 (2020),
177 – 190.
[34] Mercier, G., and Clet-Ortega, J. Towards an efficient process
placement policy for mpi applications in multicore environments. In Re-
cent Advances in Parallel Virtual Machine and Message Passing Interface
(Berlin, Heidelberg, 2009), M. Ropo, J. Westerholm, and J. Dongarra, Eds.,
Springer Berlin Heidelberg, pp. 104–115.
[35] Mokbel, M. F., Aref, W. G., and Kamel, I. Analysis of multi-
dimensional space-filling curves. GeoInformatica 7, 3 (2003), 179–209.
[36] Peano, G. Sur une courbe, qui remplit toute une aire plane. Mathema-
tische Annalen (1890), 157–160.
[37] Pfennig, S., Feldhoff, K., Ciorba, F. M., Bielert, M., Franz, E.,
Ilsche, T., Reiher, T., and Nagel, W. E. Simulation models veri-
fication for resilient communication on a highly adaptive energy-efficient
computer. In Proceedings of the 24th High Performance Computing Sym-
posium (HPC 2016), part of the 2016 Spring Simulation Multi-Conference,
SpringSim’16, CA, USA, April 3-6, 2016 (San Diego, CA, USA, Apr.
2016), HPC ’16, Society for Computer Simulation International, pp. 6:1–
6:8. April 03 - 06, 2016.
[38] Pfennig, S., and Franz, E. Adjustable redundancy for secure network
coding in a unicast scenario. In 2014 International Symposium on Network
Coding (NetCod) (2014), pp. 1–6.
[39] Rico-Gallego, J. A., D´ıaz-Mart´ın, J. C., Manumachu, R. R., and
Lastovetsky, A. L. A survey of communication performance models for
high-performance computing. ACM Comput. Surv. 51, 6 (Jan. 2019).
[40] Rodrigues, E. R., Madruga, F. L., Navaux, P. O. A., and Panetta,
J. Multi-core aware process mapping and its impact on communication
overhead of parallel applications. In 2009 IEEE Symposium on Computers
and Communications (2009), pp. 811–817.
[41] Sensi, D. D., Girolamo, S. D., and Hoefler, T. Mitigating net-
work noise on dragonfly networks through application-aware routing. In
Proceedings of the International Conference for High Performance Com-
puting, Networking, Storage and Analysis, SC 2019, Denver, Colorado,
USA, November 17-19, 2019 (2019), M. Taufer, P. Balaji, and A. J. Pen˜a,
Eds., ACM, pp. 16:1–16:32.
26
[42] Tsuji, M., Boku, T., and Sato, M. Scalable communication perfor-
mance prediction using auto-generated pseudo mpi event trace. In Pro-
ceedings of the International Conference on High Performance Computing
in Asia-Pacific Region (New York, NY, USA, 2019), HPC Asia 2019, As-
sociation for Computing Machinery, pp. 53–62.
[43] Tuncer, O., Leung, V. J., and Coskun, A. K. Pacmap: Topology
mapping of unstructured communication patterns onto non-contiguous al-
locations. In Proceedings of the 29th ACM on International Conference on
Supercomputing (2015), ACM, pp. 37–46.
[44] Villar, J. A., Mathey, G. M., Escudero-Sahuquillo, J., Garcia,
P. J., Alfaro, F. J., Sanchez, J. L., and Quiles, F. J. Topgen:
A library to provide simulation tools with the modeling of interconnection
network topologies. In 2018 International Conference on High Performance
Computing Simulation (HPCS) (2018), pp. 452–459.
[45] Wu, J., Xiong, X., Berrocal, E., Wang, J., and Lan, Z. Topology
mapping of irregular parallel applications on torus-connected supercomput-
ers. The Journal of Supercomputing 73, 4 (2017), 1691–1714.
[46] Wu, J., Xiong, X., and Lan, Z. Hierarchical task mapping for parallel
applications on supercomputers. The Journal of Supercomputing 71, 5 (May
2015), 1776–1802.
[47] Yu, H., Chung, I.-H., and Moreira, J. Topology mapping for blue
gene/l supercomputer. In Proceedings of the 2006 ACM/IEEE Conference
on Supercomputing (New York, NY, USA, 2006), SC ’06, Association for
Computing Machinery, pp. 116–es.
27
