Topology-induced Enhancement of Mappings by Glantz, Roland et al.
Topology-induced Enhancement of Mappings*
Roland Glantz
Karlsruhe Institute of Technology
Karlsruhe, Germany
rolandglantz@gmail.com
Maria Predari
University of Cologne
Cologne, Germany
mpredari@uni-koeln.de
Henning Meyerhenke
University of Cologne
Cologne, Germany
h.meyerhenke@uni-koeln.de
ABSTRACT
In this paper we propose a new method to enhance a mapping µ(·)
of a parallel application’s computational tasks to the processing
elements (PEs) of a parallel computer. The idea behind our method
TIMER is to enhance such a mapping by drawing on the observation
that many topologies take the form of a partial cube. This class of
graphs includes all rectangular and cubic meshes, any such torus
with even extensions in each dimension, all hypercubes, and all trees.
Following previous work, we represent the parallel application
and the parallel computer by graphs Ga = (Va ,Ea ) and Gp =
(Vp ,Ep ). Gp being a partial cube allows us to label its vertices,
the PEs, by bitvectors such that the cost of exchanging one unit of
information between two vertices up and vp of Gp amounts to the
Hamming distance between the labels of up and vp .
By transferring these bitvectors from Vp to Va via µ−1(·) and
extending them to be unique onVa , we can enhance µ(·) by swapping
labels of Va in a new way. Pairs of swapped labels are local w. r. t.
the PEs, but not w. r. t. Ga . Moreover, permutations of the bitvectors’
entries give rise to a plethora of hierarchies on the PEs. Through
these hierarchies we turn TIMER into a hierarchical method for
improving µ(·) that is complementary to state-of-the-art methods for
computing µ(·) in the first place.
In our experiments we use TIMER to enhance mappings of com-
plex networks onto rectangular meshes and tori with 256 and 512
nodes, as well as hypercubes with 256 nodes. It turns out that com-
mon quality measures of mappings derived from state-of-the-art
algorithms can be improved considerably.
KEYWORDS
multi-hierarchical mapping; parallel communication optimization;
partial cube; Hamming distance
1 INTRODUCTION
Large-scale matrix- or graph-based applications such as numerical
simulations [30] or massive network analytics [25] often run on
parallel systems with distributed memory. The iterative nature of the
underlying algorithms typically requires recurring communication
between the processing elements (PEs). To optimize the running
time of such an application, the computational load should be evenly
distributed onto the PEs, while at the same time, the communication
volume between them should be low. When mapping processes to
PEs on non-uniform memory access (NUMA) systems, it should
also be taken into account that the cost for communication operations
depends on the locations of the PEs involved. In particular, one wants
to map heavily communicating processes “close to each other”.
*This work is partially supported by German Research Foundation (DFG) grant ME
3619/2-1.
More formally, let the application be modeled by an application
graphGa = (Va ,Ea ,ωa ) that needs to be distributed over the PEs. A
vertex inVa represents a computational task of the application, while
an edge ea = {ua ,va } indicates data exchanges between tasks ua
and va , with ωa (ea ) specifying the amount of data to be exchanged.
The topology of the parallel system is modeled by a processor graph
Gp = (Vp ,Ep ,ωp ), where the edge weightωp ({up ,vp }) indicates the
cost of exchanging one data unit between the PEs up and vp [8, 11].
A balanced distribution of processes onto PEs thus corresponds to a
mapping µ : Va 7→ Vp such that, for some small ε ≥ 0,
|µ−1(vp )| ≤ (1 + ε) · d|Va |/|µ(Va )|e (1)
for allvp ∈ µ(Va ). Therefore, µ(·) induces a balanced partition ofGa
with blocks µ−1(vp ), vp ∈ Vp . Conversely, one can find a mapping
µ : Va 7→ Vp that fulfills Eq. (1) by first partitioning Va into |Vp |
balanced parts [2] and then specifying a one-to-one mapping from
the blocks of the partition onto the vertices of Gp . For an accurate
modeling of communication costs, one would also have to specify
the path(s) over which two vertices communicate in Gp . This is,
however, system-specific. For the purpose of generic algorithms, it
is thus often abstracted away by assuming routing on shortest paths
in Gp [12]. In this work, we make the same assumption.
In order to steer an optimization process for obtaining good map-
pings, different objective functions have been proposed [8, 12, 17].
One of the most commonly used [24] objective functions is Coco(·)
(also referred to as hop-byte in [33]), cf. Eq. (3):
µ∗ := argmin
µ :Va 7→Vp
µ balanced
Coco(µ), with (2)
Coco(µ) :=
∑
ea ∈Ea
ea = {ua , va }
ωa (ea ) dGp (µ(ua ), µ(va )), (3)
where dGp (µ(ua ), µ(va )) denotes the distance between µ(ua ) and
µ(va ) in Gp , i. e. the number of edges on shortest paths. Broadly
speaking, Coco(·) is minimised when pairs of highly communicating
processes are placed in nearby processors. Finding µ∗(·) is NP-
hard. Indeed, finding µ∗(·) for a complete graph Gp amounts to
graph partitioning, which is an NP-hard problem [10].
Related work and motivation. One way of looking at previous
algorithmic work from a high-level perspective (more details can be
found in the overview articles by Pellegrini [24], Buluc¸ et al. [4],
and Aubanel [1]) is to group it into two categories. One line has tried
to couple mapping with graph partitioning. To this end, the objective
function for partitioning, usually the edge cut, is replaced by an
objective function like Coco(·) that considers distances dGp (·) [32].
To avoid recomputing these values, Walshaw and Cross [32] store
them in a network cost matrix (NCM). When our proposed method
TIMER (Topology-induced Mapping Enhancer) is used to enhance
ar
X
iv
:1
80
4.
07
13
1v
1 
 [c
s.D
C]
  1
9 A
pr
 20
18
(a) Ga
5
5
4
4
4
(b) Gc (c) Gp
Figure 1: (a) Application graph partitioned into four blocks with uni-
tary edge weights. (b) Resulting communication graph, numbers indi-
cate edge weights. (c) The processor graph is a partial cube and the
bijection ν (·) between the vertices of Gc and those of Gp is indicated by
the colors. Communication between the red and the green vertex has to
be routed via an intermediary node (the blue one or the black one).
a mapping, it does so without an NCM, thus avoiding its quadratic
space complexity.
The second line of research decouples partitioning and mapping.
First, Ga is partitioned into |Gp | blocks without taking Gp into ac-
count. Typically, this step involves multilevel approaches whose
hierarchies on Ga are built with edge contraction [15, 20, 27] or
weighted aggregation [6, 19]. Contraction of the blocks of Ga into
single vertices yields the communication graph Gc = (Vc ,Ec ,ωc ),
where ωc (ec ), ec ∈ Ec , aggregates the weight of edges in Ga with
end vertices in different blocks. For an example see Figures 1a,1b.
Finally, one computes a bijection ν : Vc 7→ Vp that minimizes
Coco(·) or a related function, using (for example) greedy meth-
ods [3, 8, 11, 12] or metaheuristics [3, 31], see Figure 1c. When
TIMER is used to enhance such a mapping, it modifies not only ν (·),
but also affects the partition of Va (and thus possibly Gc ). Hence,
deficiencies due to the decoupling of partitioning and mapping can
be compensated. Note that, since TIMER is proposed as an improve-
ment on mappings derived from state-of-the-art methods, we assume
that an initial mapping is provided. This is no limitation: a mapping
can be easily computed from the solution of a graph partitioner using
the identity mapping from Gc to Gp .
Being man-made and cost-effective, the topologies of parallel
computers exhibit special properties that can be used to find good
mappings. A widely used property is that PEs are often “hierarchi-
cally organized into, e. g., islands, racks, nodes, processors, cores
with corresponding communication links of similar quality” [29].
Thus, dual recursive bisection (DRB) [22], the recent embedded sec-
tioning [18], and what can be called recursive multisection [5, 14, 29]
suggest themselves for mapping: DRB and embedded sectioning cut
both Ga (or Gc ) and Gp into two blocks recursively and assign the
respective blocks to each other in a top-down manner. Recursive
multisection as performed by [29] models the hierarchy inherent
in Gp as a tree. The tree’s fan-out then determines into how many
blocks Gc needs to be partitioned. Again, TIMER is complementary
in that we do not need an actual hierarchy of the topology.
Overview of our method. To the best of our knowledge, TIMER is
the first method to exploit the fact that many parallel topologies are
partial cubes. A partial cube is an isometric subgraph of a hypercube,
see Section 2 for a formal definition. This graph class includes all
rectangular and cubic meshes, any such torus with even extensions
in each dimension, all hypercubes, and all trees.
Gp being a partial cube allows us to label its vertices with bitvec-
tors such that the distance between two vertices inGp amounts to the
Hamming distance (number of different bits) between the vertices’
labels (see Sections 2 and 3). This will allow us to compute the con-
tribution of an edge ea ∈ Ea to Coco(·) quickly. The labels are then
extended to labels for vertices in Ga (Section 4). More specifically,
a label of a vertex va ∈ Va consists of a left and a right part, where
the left part encodes the mapping µ(·) : Va 7→ Vp and the right part
makes the labels unique. Note that it can be instructive to view the
vertex labels of Va , la (·), as a recursive bipartitioning of Va : the i-th
leftmost bit of la (va ) defines whether va belongs to one or the other
side of the corresponding cut in the i-th recursion level. Vertices
ua ,va ∈ Va connected by an edge with high weight should then be
assigned labels that share many of the bits in the left part.
We extend the objective function Coco(·) by a factor that accounts
for the labels’ right part and thus for the uniqueness of the label, We
do so without limiting the original objective that improves the labels’
left part (and thus µ(·)) (Section 5). Finally, we use the labeling
of Va to devise a multi-hierarchical search method in which label
swaps improve µ(·). One can view this as a solution method for
finding a Coco(·)-optimal numbering of the vertices inVa , where the
numbering is determined by the labels.
Compared to simple hill-climbing methods, we increase the local
search space by employing very diverse hierarchies. These hier-
archies result from random permutations of the label entries; they
are imposed on the vertices of Gp and built by label contractions
performed digit by digit. This approach provides a fine-to-coarse
view during the optimization process (Section 6) that is orthogonal
to typical multilevel graph partitioning methods.
Experimental results. As the underlying application for our ex-
periments, we assume parallel complex network analysis on systems
with distributed memory. Thus, in Section 7, we apply TIMER to
further improve mappings of complex networks onto partial cubes
with 256 and 512 nodes. The initial mappings are computed using
KAHIP [28], SCOTCH [23] and LIBTOPOMAP [12]. To evaluate
the quality results of TIMER, we use Coco(·), for which we observe
a relative improvement from 6% up to 34%, depending on the choice
of the initial mapping. Running times for TIMER are on average
42% faster than the running times for partitioning the graphs with
KAHIP.
2 PARTIAL CUBES AND THEIR
HIERARCHIES
The graph-theoretical concept of a partial cube is central for this
paper. Its definition is based on the hypercube.
Definition 2.1 (Hypercube). A dimH -dimensional hypercubeH is
the graphH (VH ,EH )withVH := {0, 1}dimH and EH := {{uH ,wH } |
uH ,vH ∈ VH and dh (uH ,vH ) = 1}, where dh (uH ,vH ) denotes the
Hamming distance (number of different bits) between uH and vH .
More generally, dH (uH ,vH ) = dh (uH ,vH ), i. e., the (unweighted)
shortest path length equals the Hamming distance in a hypercube.
2
xxxx
0000 0001 0010 0011 0100 0110 0111 1000 1001 1010 1011 1100 1110 1111
01xx 10xx 11xx
111x110x101x100x011x010x001x000x
00xx
0xxx 1xxx
0101 1101
level 0
level 1
level 2
level 3
level 4
xxx1
0000 1111
level 0
level 1
level 2
level 3
level 4
xxxx
1000 0100 1100 0010 1010 0110 1110 0001 01011001 1101 0011 1011 0111
x000 x100 x010 x110 x001 x101 x011 x111
xx00 xx10 xx01 xx11
xxx0
Figure 2: Two opposite hierarchies of the 4D hypercube. “x” means
“0” or “1”. Top: hierarchy Hpi with pi = (1, 2, 3, 4). Bottom: hierarchy
Hpi with pi = (4, 3, 2, 1).
Partial cubes are isometric subgraphs of hypercubes, i. e., the
distance between any two nodes in a partial cube is the same as their
distance in the hypercube. Put differently:
Definition 2.2 (Partial cube). A graph Gp with vertex set Vp
is a partial cube of dimension dimGp if (i) there exists a labeling
lp : Vp 7→ {0, 1}dimGp such that dGp (up ,vp ) = dh (lp (up ), lp (vp ))
for all up ,vp ∈ Vp and (ii) dimGp is as small as possible.
The labeling lp : Vp 7→ {0, 1}dimGp gives rise to hierarchies on
Vp as follows. For any permutation pi of {1, . . . ,dimGp }, one can
group the vertices in Vp by means of the equivalence relations ∼pi ,i ,
where dimGp ≥ i ≥ 1:
up ∼pi ,i vp :⇔ pi (lp (up ))[j] = pi (lp (vp ))[j] (4)
for all 1 ≤ j ≤ i (where l[i] refers to the i-th character in string
l). As an example, ∼id,i , where id is the identity on {0, 1}dimGp ,
gives rise to the partition in which up and vp belong to the same
part if and only if their labels agree at the first i positions. More
generally, for each permutation pi (·) from the set ΠdimGp of all
permutations on {1, . . . ,dimGp }, the equivalence relations ∼pi ,i ,
dimGp ≥ i ≥ 1, give rise to a hierarchy of increasingly coarse
partitions (PdimGp , . . . ,P1). As an example, the hierarchies defined
by the permutations id and pi (j) := dimGp + 1 − j, 1 ≤ j ≤ dimGp ,
respectively, are opposite hierarchies, see Figure 2.
3 VERTEX LABELS OF PROCESSOR GRAPH
In this section we provide a way to recognize if a graph Gp is a
partial cube, and if so, we describe how a labeling on Vp can be
obtained in O(|Ep |2) time. To this end, we characterize partial cubes
in terms of (cut-sets of) convex cuts.
Definition 3.1 (Convex cut). Let Gp = (Vp ,Ep ) be a graph, and
let (V 0p ,V 1p ) be a cut of Gp . The cut is called convex if no shortest
path from a vertex in V 0p [V
1
p ] to another vertex in V
0
p [V
1
p ] contains
a vertex of V 1p [V
0
p ].
The cut-set of a cut (V 0p ,V 1p ) of Gp consists of all edges in Gp
with one end vertex in V 0p and the other one in V
1
p . Given the above
11 1 1 1
0
0
0
0
0
0 1
1
1
1
1
10 1
01 1
0 0
1
0 00 00 0
1
(a) Processor graph Gp
2
11
1
1
2
(b) Distances in Gp (c) Application graph Ga
Figure 3: Gp is a partial cube with two convex cuts: the dotted vertical
cut and the dashed horizontal cut (3a). First [second] digit of vertex
labels lp (·) indicates position w. r. t. vertical [horizontal] cut. Distance
between up and vp in Gp equals Hamming distance between lp (up )
and lp (vp ) (3b). In 3c, a mapping µ(·) from Va to Vp ’ is indicated by
the colors. Communication across solid [dashed] edges requires 1 hop
[2 hops] in Gp .
definitions, Gp is a partial cube if and only if (i) Gp is bipartite and
(ii) the cut-sets of Gp ’s convex cuts partition Ep [21]. In this case,
the equivalence relation behind the partition is the Djokovic´ relation
θ [21]. Let ep = {xp ,yp } ∈ Ep . An edge fp is Djokovic´ related to
ep if one of fp ’s end vertices is closer to xp than to yp , while the
other end vertex of fp is closer to yp than to xp . Formally,
ep θ fp :⇔ | fp ∩Wxp,yp | = | fp ∩Wyp,xp | = 1 , where
Wxp,yp = {wp ∈ Vp | dGp (wp ,xp ) < dGp (wp ,yp )}.
Consequently, no pair of edges on any shortest path ofGp is Djokovic´
related. In the following, we use the above definitions to find
out whether Gp is a partial cube and, if so, to compute a labeling
lp : Vp 7→ {0, 1}dimGp for Vp according to Definition 2.2:
(1) Test whether Gp is bipartite (in asymptotic time O(|Ep |)).
If Gp is not bipartite, it is not a partial cube.
(2) Pick an arbitrary edge e1p and compute the edge set Ep (e1p ,θ ) :=
{ fp ∈ Ep | fp θ e1p }.
(3) Keep picking edges e2p , e
3
p , . . . that are not contained in
an edge set computed so far and compute the edge sets
Ep (e2p ,θ ), Ep (e3p ,θ ), . . . . If there is an overlap with a previ-
ous edge set, Gp is not a partial cube.
(4) While calculating Ep (e jp ,θ ) where 1 ≤ j ≤ dimGp , set
lp [j](up ) :=
{
0, if up ∈Wx jp,y jp .
1, otherwise.
(5)
For an example of such vertex labels see Figure 3a.
Assuming that all distances between node pairs are computed
beforehand, calculating Ep (e jp ,θ ) for 1 ≤ j ≤ dimGp and setting
the labeling lp (·) (in steps 2,3 and 4) take O(|Ep |) time. Detecting
overlaps with already calculated edge sets takes O(|Ep |2) time (in
step 3), which summarizes the time complexity of the proposed
method.
Note that the problem has to be solved only once for a paral-
lel computer. Since |Ep | = O(|Vp |) for 2D/3D grids and tori –
the processor graphs of highest interest in this paper – and |Ep | =
O(|Vp | log |Vp |) for all partial cubes, our simple method is (almost)
3
as fast, asymptotically, as the fastest and rather involved methods
that solve the problem. Indeed, the fastest method to solve the prob-
lem takes time O(|Vp | |Ep |) [13]. Assuming that integers of at least
log2(|Vp |) bits can be stored in a single machine word and that addi-
tion, bitwise Boolean operations, comparisons and table look-ups
can be performed on these words in constant time, the asymptotic
time is reduced to O(|Vp |2) [9].
4 VERTEX LABELS OF APPLICATION
GRAPH
Given an application graph Ga = (Va ,Ea ,ωa (·)), a processor graph
Gp = (Vp ,Ep ) that is a partial cube and a mapping µ : Va 7→ Vp , the
aim of this section is to define a labeling la (·) of Va . This labeling
is then used later to improve µ(·) w. r. t. Eq. (3) by swapping labels
between vertices of Ga . In particular, for any ua ,va ∈ Va , the effect
of their label swap on µ(·) should be a function of their labels and
those of their neighbors in Ga . It turns out that being able to access
the vertices ofGa by their (unique) labels is crucial for the efficiency
of our method. The following requirements on la (·) meet our needs.
(1) la (·) encodes µ(·).
(2) For any two vertices ua ,va ∈ Ga , we can derive the dis-
tance between µ(ua ) and µ(va ) from la (ua ) and la (va ).
Thus, for any edge ea of Ga , we can find out how many
hops it takes in Gp for its end vertices to exchange informa-
tion, see Figure 3c.
(3) The labels are unique on Va .
To compute such la (·), we first transport labeling lp (·) from Vp to
Va through lp (va ) := lp (µ(va )) for all va ∈ Va . This new labeling
lp : Va 7→ {0, 1}dimGp already fulfills items 1) and 2). Indeed,
item 1) holds, since labels are unique on Vp ; item 2) holds, because
Gp is a partial cube. To fulfill item 3), we extend the labeling
lp : Va 7→ {0, 1}dimGp to a labeling la : Va 7→ {0, 1}dimGa , where
yet undefined dimGa should exceed dimGp only by the smallest
amount necessary to ensure that la (ua ) , la (va ) whenever ua , va .
The gap dimGa − dimGp depends on the size of the largest part in
the partition induced by µ(·).
Definition 4.1 (dimGa ). Let µ : Va 7→ Vp be a mapping. We set:
dimGa = dimGa (µ) = dimGp + maxvp ∈Vp dlog2 |µ
−1(vp )|e . (6)
For any va ∈ Va , la (va ) is a bitvector of length dimGa ; its first
dimGp entries coincide with lp (va ), see above, and its last dimGa −
dimGp entries serve to make the labeling unique. We denote the
bitvector formed by the last dimGa −dimGp entries by le (va ). Here,
the subscript e in le (·) stands for “extension”. To summarize,
la (va ) = lp (va ) ◦ le (va ) for all va ∈ Va ,where (7)
◦ stands for concatenation. Except for temporary permutations of
the labels’ entries, the set L := l(Va ) of labels will remain the
same. A label swap between ua and va alters µ(·) if and only if
lp (ua ) , lp (va ). The balance of the partition of Va , as induced by
µ(·), is preserved by swapping the labels of Va .
Computing le (·) is straightforward. First, the vertices in each
µ−1(vp ), vp ∈ Vp , are numbered from 0 to |µ−1(vp )| − 1. Second,
these decimal numbers are then interpreted as bitvectors/binary num-
bers. Finally, the entries of the corresponding bitvectors are shuffled,
so as to provide a good random starting point for the improvement
of µ, see Lemma 5.1 in Section 5.
5 EXTENSION OF THE OBJECTIVE
FUNCTION
Given the labeling of vertices inVa , i. e., la (·) = lp (·)◦le (·), it is easy
to see that solely lp (·) determines the value of Coco(·). (This fact
results from lp (·) encoding the distances between vertices in Gp .)
On the other hand, due to the uniqueness of the labels la (·), le (·)
restricts lp (·): le (ua ) = le (va ) implies lp (ua ) , lp (va ), i. e., that ua
and va are mapped to different PEs.
The plan of the following is to ease this restriction by avoiding
as many cases of le (ua ) = le (va ) as possible. To this end, we
incorporate le (·) into the objective function, thus replacing Coco(·)
by a modified one. Observe that lp (·) and le (·) give rise to two
disjoint subsets of Ea , i. e. subsets of edges, the two end vertices of
which agree on lp (·) and le (·), respectively:
E
p
a = E
p
a (la ) := {ea = {ua ,va } ∈ Ea | lp (ua ) = lp (va )},
Eea = E
e
a (la ) := {ea = {ua ,va } ∈ Ea | le (ua ) = le (va )}.
In general these two sets do not form a partition of Ea , since there
can be edges whose end vertices disagree both on lp (·) and on le (·).
With h(·, ·) denoting the Hamming distance, optimizing Coco(·) in
Eq. (3) can be rewritten as follows. Find
l∗a := argmin
la :Va 7→L
la bijective
Coco(la ), where (8)
Coco(la ) :=
∑
ea ∈Ea\Epa (la )
ea = {ua , va }
ωa (ea ) h(lp (ua ), lp (va )). (9)
For all ea = {ua ,va } ∈ Eea it follows that lp (ua ) , lp (va ), im-
plying that ua and va are mapped to different PEs. Thus, any edge
ea ∈ Eea (la ) increases the value of Coco(·) with a damage of
ωa (ua ,va ) h(lp (ua ), lp (va )) > 0. (10)
This suggests that reducing Eea may be good for minimizing
Coco(·). The crucial question here is whether reducing Eea can
obstruct our primary goal, i. e., growing Epa (see Eq. (9)). Lemma 5.1
below shows that this is not the case, at least for perfectly balanced
µ(·). A less technical version of Lemma 5.1 is the following: If µ(·)
is perfectly balanced, then two mappings, where one has bigger Epa
and one has smaller Eea (both w. r. t. set inclusion), can be combined
to a third mapping that has the bigger Epa and the smaller Eea . The
third mapping provides better prospects for minimizing Coco(·) than
the mapping from which it inherited big Epa , as, due to smaller Eea ,
the third mapping is less constrained by the uniqueness requirement:
LEMMA 5.1 (REDUCING Eea COMPATIBLE WITH GROWING E
e
p ).
Let µ : Va 7→ Vp be such that |µ−1(up )| = |µ−1(vp )| for all up ,vp ∈
Vp (perfect balance). Furthermore, let la (·) and l ′a (·) be bijective
labelings Va 7→ L that correspond to µ(·) as specified in Section 4.
Then, Epa (la ) ⊇ Epa (l ′a ) and Eea (l ′a ) ⊆ Eea (la ) implies that there exists
4
l∗a that also corresponds to µ(·) with (i) Epa (l∗a ) = Epa (la ) and (ii)
Eea (l∗a ) = Eea (l ′a ).
PROOF. Set l∗a (·) := lpa (·) ◦ lea (·). Then, l∗a (·) fulfills (i) and (ii) in
the claim, and the first dimGp entries of µ(·) specify µ(·). If remains
to show that the labeling l∗a (·) is unique onVa . This is a consequence
of (a) la (·) being unique and (b) Eea (l∗a ) = Eea (l ′a ) ⊆ Eea (la ). 
In practice we usually do not have perfect balance. Yet, the
balance is typically low, e. g. ϵ = 0.03. Thus, we still expect that
having small Eea (la ) is beneficial for minimizing Coco(·).
Minimization of the damage to Coco(·) from edges in Eea (la ), see
Eq. (10), amounts to maximizing the diversity of the label extensions
in Ga . Formally, in order to diversify, we want to find
l∗a := argmax
la :Va 7→L
la bijective
Div(la ), where (11)
Div(la ) :=
∑
ea ∈Ea\Eea (la )
ea = {ua , va }
ωa (ea ) h(le (ua ), le (va )). (12)
We combine our two objectives, i. e., minimization of Coco(·) and
maximization of Div(·), with the objective function Coco+(·):
l∗a := argmin
la :Va 7→L
la bijective
Coco+(la ), where (13)
Coco+(la ) := Coco(la ) − Div(la ). (14)
6 MULTI-HIERARCHICAL LABEL
SWAPPING
After formulating the mapping problem as finding an optimal label-
ing for the vertices in Va , we can now turn our attention to how to
find such a labeling – or at least a very good one. Our algorithm
is meant to improve mappings and resembles a key ingredient of
the classical graph partitioning algorithm by Kernighan and Lin
(KL) [16]: vertices of Ga swap their membership to certain subsets
of vertices. Our strategy differs from KL in that we rely on a rather
simple local search which is, however, performed on multiple (and
very diverse) random hierarchies on Ga . These hierarchies are obliv-
ious to Ga ’s edges and correspond to recursive bipartitions of Ga ,
which, in turn, are extensions of natural recursive bipartitions of Gp .
6.1 Algorithm TIMER
Our algorithm, see procedure TIMER in Algorithm 1, takes as
input (i) an application graph Ga , (ii) a processor graph Gp with the
partial cube property, (iii) an initial mapping µ : Va 7→ Vp and (iv)
the number of hierarchies, NH . NH controls the tradeoff between
running time and the quality of the results. The output of TIMER
consists of a bijective labeling la : Va 7→ L such that Coco+(la )
is low (but not necessarily optimal). Recall from Section 1 that
requiring µ(·) as input is no major limitation. An initial bijection
la (·) representing this µ(·) is found in lines 1, 2 of Algorithm 1.
In lines 3 through 21 we take NH attempts to improve la (·), where
each attempt uses another hierarchy on {0, 1}dimGa . Before the new
hierarchy is built, the old labeling is saved in case the label swapping
in the new hierarchy turns out to be a setback w. r. t. Coco+(·) (line
110
100
011
010
2
2
111
101
00000100
0111
0110
1000
1010
1101
1110
1111
0101
1011
1100
2
2
000
001
0010
0001
0011
2
2
21001
Figure 4: Graphs G1a on level 1 of the hierarchy and G2a on level 2
(G2a arises from G1a through contractions controlled by the labels) are
shown on the left and right, respectively. The first [last] two digits of
the labels on G1a ’s vertices indicate lp (·) [le (·)], respectively. Swapping
labels 000 and 001 on G2a yields a gain of 1 in diversity (see Eq. (12)).
The corresponding swaps in G1a are indicated by the dashed red lines
on the left.
4). This may occur, since the gain w. r. t. Coco+(·) on a coarser
level of a hierarchy is only an estimate of the gain on the finest level
(see below). The hierarchy and the current mapping are encoded by
(i) a sequence of graphs Gia = (V ia ,Eia ,ωia ), 1 ≤ i ≤ dimGa , with
G1a = Ga , (ii) a sequence of labelings l
i
a : Vi 7→ {0, 1}dimGia and
(iii) a vector, called parent, that provides the hierarchical relation
between the vertices. From now on, we interpret vertex labels as
integers whenever convenient. More precisely, an integer arises from
a label, i. e., a bitvector, by interpreting the latter as a binary number.
In lines 6 and 7, the entries of the vertex labels ofGa are permuted
according to a random permutation pi (·). The construction of the
hierarchy (lines 9 through 14) goes hand in hand with label swap-
ping (lines 10-12) and label coarsening (in line 13). The function
contract(·, ·, ·) contracts any pair of vertices of Gi−1a whose labels
agree on all but the last digit, thus generatingGia . The same function
also cuts off the last digit of l i−1a (v) for all v ∈ V i−1a , and creates
the labeling l ia (·) for the next coarser level. Finally, contract(·, ·, ·)
builds the vector parent (for encoding the hierarchy of the vertices).
In line 15, the call to assemble() derives a new labeling la (·) from
the labelings l ia (·) on the levels 1 ≤ i ≤ dimGa − 1 of a hierarchy
(read more in Section 6.2). The permutation of la ’s entries is undone
in line 16, and la (·) is kept only if better than lold (·), see lines 17 to
19. Figure 4 depicts a snapshot of TIMER on a small instance.
6.2 Function assemble()
Function assemble() (Algorithm 2) in line 15 of TIMER turns the
hierarchy of graphs Gia into a new labeling la (·) for Ga , digit by
digit, using labels l ia (·) (in Algorithm 2, “ i” denotes a left shift by
i digits). The least and the most significant digit of any l1a (v1) are
inherited from la (·) (line 7 of Algorithm 1) and do not change (lines
2, 17 and 18). The remaining digits are set in the loop from line 5
to 16. Whenever possible, digit i of l1a (v1) is set to the last digit of
the parent of v1 (= preferred digit) on level i, see lines 9, 11. This
might, however, lead to a label that is not in l1a (V 1a ) any more, which
would change the set of labels and may violate the balance constraint
coming from µ(·). To avoid such balance problems, we take the last
5
Algorithm 1 Procedure TIMER (Ga, Gp, µ(·), NH ) returns a bijection la : Va 7→ {0, 1}dimGa with a low value
of Coco(la ).
1: Find a labeling lp (·) of Vp ’s vertices, as described in Section 2
2: Using µ(·), extend lp (·) to a labeling la (·) of Va ’s vertices, as described in Section 4
3: for N ′ = 1 . . . , NH do
4: lold (·) ← la (·) . just in case la (·) gets worse w. r. t. Coco+(·)
5: parent← [] . parent will encode the hierarchy of the vertices
6: Pick a random permutation pi : {1, . . . , dimGa } 7→ {1, . . . , dimGa }
7: la (·) ← pi (la (·))
8: G1a ← Ga , l 1a (·) ← la (·)
9: for i = 2, . . . , dimGa − 1 do
10: for all u, v ∈ G i−1a with l i−1a (u)/2 = l i−1a (v)/2 do . only least sig. digit differs
11: Swap labels l i−1a (u) and l i−1a (v) if this decreases Coco+(l i−1a ) on G i−1a .
12: end for
13: (G ia, l ia, parent) ← contract(G i−1a , l i−1a , parent)
14: end for
15: la (·) ← assemble(G1a, . . . , GdimG−1a , l 1a, . . . , l
dimGa −1
a , parent)
16: la (·) ← pi−1(la (·))
17: if Coco+(la ) > Coco+(lold ) then
18: la (·) ← lold (·)
19: end if
20: end for
21: return la (·)
Algorithm 2 Function assemble(G1a, . . . , GdimG−1a , l 1a, . . . , l
dimGa −1
a , parent) returns a new labeling l 1a (·) of
the vertices of G1a = Ga and thus a new labeling la (·) of Ga ’s vertices.
1: for all v1 ∈ V 1a do
2: l 1a (v1) ← l 1a (v1)mod 2 . Write least significant digit
3: oldParent← v1
4: i ← 1
5: while i < dimG1a do . Write digits 2, . . . , dimG1 −1
6: newParent← parent(oldParent)
7: i ← i + 1
8: newParentLabel← l ia (newParent)
9: prefLabel← l 1a (v1) + (newParentLabel  (i − 1)) . Preferred i least sig. digits
10: if ∃ w ∈ V1 with l 1a (w )mod 2i = prefLabel then . Part of existing label?
11: l 1a (v1) ← l 1a (v1) + ((newParentLabelmod 2)  (i − 1)) . Write preferred digit
12: else
13: l 1a (v1) ← l 1a (v1) + ((1 − (newParentLabelmod 2))  (i − 1)) . Write other digit
14: end if
15: oldParent← newParent
16: end while
17: if l 1a (v1) ≥ 1  (dimG1 − 1) then
18: l 1a (v1) ← l 1a (v1) + (1  (dimG1 − 1)) . Write most significant digit
19: end if
20: end for
21: return l 1a (·)
digit of l ia (va ) if possible (in lines 9-11) or, if not, we switch to the
(old) inverted digit, see line 13. SinceG1a = Ga , new l
1
a (·) onG1a can
be taken as new la (·) on Ga , see line 18 in Algorithm 1.
6.3 Running time analysis
The expected asymptotic running time of function assemble() is
O(|Va | · dimVa ). Here, “expected” is due to the condition in line 10
that is checked in expected constant time. (We use a hashing-based
C++ std::unordered map to find a vertex with a certain label. A
plain array would be too large for large grids and tori, especially if
the blocks are large, too.) For Algorithm 1, the expected running
time is dominated by the loop between lines 9 and 14. The loop
between lines 10 and 12 takes amortized expected time O(|Ea |)
(“expected”, because we have to go from the labels to the vertices
and “amortized”, because we have to check the neighborhoods of all
u,v). The contraction in line 13 takes time O(|Ea |), too. Thus, the
6
loop between lines 9 and 14 takes time O(|Ea |dimGa ). In total, the
expected running time of Algorithm 1 is O(NH |Ea | dimGa ).
An effective first step toward a parallel version of our algorithm
would be simple loop parallelization in lines 10-12 of Algorithm 1.
To avoid stale data, label accesses need to be coordinated.
7 EXPERIMENTS
7.1 Description of experiments
In this section we specify our test instances, our experimental setup
and the way we evaluate the computed mappings.
The application graphs are the 15 complex networks used by
Safro et al. [26] for partitioning experiments and in [11] for mapping
experiments, see Table 1. Regarding the processor graphs, we follow
loosely current architectural trends. Several leading supercomputers
have a torus architecture [8], and grids (= meshes) experience rising
importance in emerging multicore chips [7]. As processor graphs
Gp = (Vp ,Ep )we therefore use a 2DGrid(16×16), a 3DGrid(8×8×8),
a 2DTorus(16 × 16), a 3DTorus(8 × 8 × 8) and, for more theoretical
reasons, an 8-dimensional hypercube. In our experiments, we set
the number of hierarchies (NH ) for TIMER to 50 and whenever is
needed for partitioning/mapping with state-of-the-art tools, the load
imbalance is set to 3%. All computations are based on sequential
C++ code. Each experiment is executed on a node with two Intel
XeonE5-2680 processors (Sandy Bridge) at 2.7 GHz equipped with
32 RAM and 8 cores per processor.
Baselines. For the evaluation, we use four different experimental
cases (c1 to c4), each of which assumes a different initial mapping
µ1(·) as an input to TIMER (Algorithm 1). The different cases
shall measure the improvement by TIMER compared to different
standalone mapping algorithms. In the following, we describe how
we obtain the initial mappings µ1(·) for each case separately.
In c1 we compare the improvement of TIMER on an initial map-
ping produced by SCOTCH. For that, we use the generic mapping
routine of SCOTCH with default parameters. It returns a mapping
µ1(·) of a given graph using a dual recursive bipartitioning algorithm.
In c2 we use the IDENTITY mapping that maps block i of the
application graph (or vertex i of the communication graph Gc ) to
node i of the processor graph Gp , 1 ≤ i ≤ |Gc | = |Gp |. IDENTITY
receives its solution from the initial partition computed with KAHIP.
This approach benefits from spatial locality in the partitions, so that
IDENTITY often yields surprisingly good solutions [11].
In c3 we use a mapping algorithm named GREEDYALLC that has
been previously proposed by a subset of the authors (implemented
on top of KAHIP). GREEDYALLC is an improvement of a previous
greedy algorithm [3] and is the best performing algorithm in [11]. It
builds on the idea of increasing a mapping by successively adding
assignments vc → vp such that (a) vc ∈ Gc has maximal commu-
nication volume with one or all of the already mapped vertices of
Gc and (b) vp ∈ Gp has minimal distance to one or all of the already
mapped vertices of Gp .
Finally, we compare against LIBTOPOMAP [12], a state-of-the-art
mapping tool that includes multiple mapping algorithm. More pre-
cisely we use the algorithm whose general idea follows the construct
method, in [3] . Subset of the authors has previous implemented the
above algorithm on top of the KAHIP tool (named GREEDYMIN).
As a result, and in order to accommodate comparisons with c2, c3
GREEDYMIN is used as the mapping algorithm for the experimental
case c4.
Labeling. Once the initial mappings µ1(·) are calculated, we need
to perform two more steps in order to get an initial labeling la (·):
(1) We compute a labeling lp : Vp 7→ {0, 1}dimGp , where lp (·)
and dimGp fulfill the conditions in Definition 2.2. In partic-
ular, dGp (up ,vp ) = dh (lp (up ), lp (vp )) for all up ,vp ∈ Vp ,
where dh (·, ·) denotes the Hamming distance. Due to the
sparsity of our processor graphs Gp (grids, tori, hyper-
cubes), we use the method outlined in Section 3.
(2) We extend the labels of Gp ’s vertices to labels of Ga ’s
vertices as described in Section 4.
Then, for each experimental case, TIMER is given the initial map-
ping µ1(·) and it generates a new mapping µ2(·). Here, we compare
the quality of mapping µ2(·) to µ1(·) in terms of our main objective
function Coco(·), but we also provide results for the edgecut metric
and for the running times.
Metrics and parameters. Since SCOTCH, KAHIP and TIMER
have randomized components, we run each experiment 5 times.
Over such a set of 5 repeated experiments we compute the minimum,
the arithmetic mean and the maximum of TIMER’s running time
(T ), edge cut (Cut) and communication costs Coco(·) (Co). Thus we
arrive at the valuesTmin ,Tmean , . . . ,Comax (9 values for each com-
bination of Ga , Gp , for each experimental case c1 to c4). Each of
these values is then divided by the min, mean, and max value before
the improvements by TIMER, except the running time of TIMER,
which is divided by the partitioning time of KAHIP for c2,c3,c4
and by the mapping time of SCOTCH for c1. Thus, we arrive at 9
quotients qTmin , . . . ,qComax . Except for the terms involving run-
ning times, a quotient smaller than one means that TIMER was
successful. Next we form the geometric means of the 9 quotients
over the application graphs of Table 1. Thus we arrive at 9 values
qT
дm
min , . . . ,qCo
дm
min for any combination of Gp and any experimen-
tal case. Additionally, we calculate the geometric standard deviation
as an indicator of the variance over the normalized results of the
application graphs.
7.2 Experimental Results
The detailed experimental results regarding quality metrics are dis-
played in Figures 5a through 5d (one for each experimental case),
while the running time results are given in Table 2. Here is our
summary and interpretation:
• When looking at the running times for the experimental
cases c2 to c4(in Table 2), we see that the running time
results of TIMER are on the same order of magnitude as
partitioning with KAHIP; more precisely, TIMER is on
average 42% faster. Thus, inserting TIMER into the par-
titioning/mapping pipeline of KAHIP would not increase
the overall running time significantly.
The comparison in case c1 needs to be different. Here
the initial mapping is produced by SCOTCH’s mapping
routine (using partitioning internally), so that the relative
timing results of TIMER are divided by SCOTCH’s map-
ping time. SCOTCH is much faster (on average 19x), but
7
its solution quality is also not good (see Co metric in Fig-
ure 5a). In informal experiments, we observed that only
ten hierarchies (parameter Nh ) are sufficient for TIMER to
improve the communication costs significantly compared
to SCOTCH– with a much lower running time penalty than
the one reported here. Finally, recall that parallelization
could reduce the running time of TIMER.
• Processor graphs do have an influence on running times.
The processor graphs, from top to bottom in Table 2, have
30, 21, 32, 24 and 8 convex cuts, respectively. Thus, if we
keep grids and tori apart, we can say that the time quotients
increase with the number of convex cuts. Recall that the
number of convex cuts equals the length of the labels of
Vp . Moreover, the length of the extension of Vp ’s labels to
those of Va depends also on the processor graph, because
a higher number of PEs (number of blocks) yields fewer
elements per block. However, this influence on the length
of the labels is small (increase of 1 in case of |Vp | = 256
compared to |Vp | = 512). Thus it is basically the length
of Va ’s labels that determines the time quotients. For the
experimental cases c2 to c4, this observation is in line with
the fact that KAHIP takes longer for higher values of |Vp |
(see Table 3 in Appendix A.1).
• TIMER successfully reduces communication costs in a
range from 6% to 34% over the different experimental cases
(see minCo,Co and maxCo values in Figure 7.2). It does
so at the expense of the edge cut metric with an average
increase between 2% to 11% depending on the experimen-
tal case. Note that for case c1 the edge cut increase is
minimum (Figure 5a). Moreover, for cases c2 to c4 this
increase is not surprising due to the different objectives of
the graph partitioner (KAHIP) and TIMER. On grids and
tori, the reduction of communication cost, as measured by
Coco(·), is respectively 18% and 13% (on average, over all
experimental cases).
The better the connectivity ofGp , the harder it gets to im-
prove Coco(·) (results are poorest on the hypercube). (Note
that qmin values can be larger than qmean and qmax values
due to the evaluation process described in Section 7.1.)
We observed before [11] that GREEDYALLC performs
better on tori than on grids; this is probably due to GREEDYALLC
“invading” the communication graph and the processor
graph. The resulting problem is that it may paint itself
into a corner of the processor graph (if it has corners, like
a grid). Thus, it is not surprising that for c2 the improve-
ment w. r. t. Coco(·) obtained by TIMER is greater for grids
than for tori. Likewise, we observe that TIMER is able
to decrease the communication costs significantly for c1
(even more than in the other cases). Apparently, the generic
Table 1: Complex networks used for benchmarking.
Name #vertices #edges Type
p2p-Gnutella 6 405 29 215 file-sharing network
PGPgiantcompo 10 680 24 316 largest connected component in network of PGP users
email-EuAll 16 805 60 260 network of connections via email
as-22july06 22 963 48 436 network of internet routers
soc-Slashdot0902 28 550 379 445 news network
loc-brightkite edges 56 739 212 945 location-based friendship network
loc-gowalla edges 196 591 950 327 location-based friendship network
citationCiteseer 268 495 1 156 647 citation network
coAuthorsCiteseer 227 320 814 134 citation network
wiki-Talk 232 314 1 458 806 network of user interactions through edits
coAuthorsDBLP 299 067 977 676 citation network
web-Google 356 648 2 093 324 hyperlink network of web pages
coPapersCiteseer 434 102 16 036 720 citation network
coPapersDBLP 540 486 15 245 729 citation network
as-skitter 554 930 5 797 663 network of internet service providers
Table 2: Running time results of each experimental case. For c1 results are relative to SCOTCH’s mapping time, while for c2, c3, c4, results are
relative to partitioning time with KAHIP (original values in Appendix A.1, Table 3)
SCOTCH (c1) IDENTITY (c2) GREEDYALLC (c3) GREEDYMIN (c4)
qTдmmin qT
дm
mean qT
дm
max qT
дm
min qT
дm
mean qT
дm
max qT
дm
min qT
дm
mean qT
дm
max qT
дm
min qT
дm
mean qT
дm
max
16 × 16 grid 30.2780 29.8388 31.8387 0.95310 1.00480 1.05286 0.97916 1.01791 1.05075 0.95448 1.00681 1.05500
8 × 8 × 8 grid 18.0226 18.0484 19.5701 0.47495 0.49364 0.51333 0.47525 0.49427 0.51606 0.48712 0.50654 0.52698
16 × 16 torus 21.1373 21.2507 22.5322 0.61627 0.64089 0.66334 0.61743 0.63765 0.66042 0.64834 0.66524 0.68270
8 × 8 × 8 torus 13.8136 14.0924 14.3866 0.33254 0.34167 0.35008 0.32952 0.33885 0.34855 0.33412 0.34493 0.35744
8-dim HQ 11.2948 11.4237 11.5842 0.36821 0.37977 0.38916 0.36631 0.37246 0.38005 0.37254 0.38093 0.39196
8
0.6
0.8
1.0
1.2
8−dimHQ grid16x16 grid8x8x8 torus16x16 torus8x8x8
topologies
re
la
tiv
e 
re
su
lts
Metrics
minCut
maxCut
Cut
minCo
maxCo
Co
(a)
0.6
0.8
1.0
1.2
8−dimHQ grid16x16 grid8x8x8 torus16x16 torus8x8x8
topologies
re
la
tiv
e 
re
su
lts
Metrics
minCut
maxCut
Cut
minCo
maxCo
Co
(b)
0.6
0.8
1.0
1.2
8−dimHQ grid16x16 grid8x8x8 torus16x16 torus8x8x8
topologies
re
la
tiv
e 
re
su
lts
Metrics
minCut
maxCut
Cut
minCo
maxCo
Co
(c)
0.6
0.8
1.0
1.2
8−dimHQ grid16x16 grid8x8x8 torus16x16 torus8x8x8
topologies
re
la
tiv
e 
re
su
lts
Metrics
minCut
maxCut
Cut
minCo
maxCo
Co
(d)
Figure 5: Quality results (Co and Cut) for experimental case (a) c1 (initial mapping with SCOTCH), (b) c2 (initial mapping with
IDENTITY), (c) c3 ((initial mapping with GREEDYALLC)), and (d) c4 (initial mapping with GREEDYMIN).
nature of SCOTCH’s mapping approach leaves room for
such an improvement.
8 CONCLUSIONS
We have presented a new method, TIMER, to enhance mappings of
computational tasks to PEs. TIMER can be applied whenever the
processor graph Gp is a partial cube. Exploiting this property, we
supply the vertices of the application graph with labels that encode
the current mapping and facilitate a straightforward assessment of
any gains/losses of local search moves. By doing so, we are able to
improve initial mappings using a multi-hierarchical search method.
9
Permuting the entries of the vertex labels in the application graph
gives rise to a plethora of very diverse hierarchies. These hierar-
chies do not reflect the connectivity of the application graph Ga ,
but correspond to recursive bipartitions of Ga , which, in turn, are
extensions of “natural” recursive bipartitions ofGp . This property of
TIMER suggests to use TIMER as a complementary tool to enhance
state-of-the-art methods for partitioning and mapping.
In our experiments we were able to improve state-of-the-art map-
pings of complex networks to different architectures by about 6%
to 34% in terms of Coco. More precisely, for grids we obtained, on
average, an improvement of 18% and for tori an improvement of
13% over the communication costs of the initial mappings.
The novelty of TIMER consists in the way it harnesses the fact
that many processor graphs are partial cubes: the local search method
itself is standard and simple. We assume that further improvements
over state-of the art mappings can be achieved by replacing the
simple local search by a more sophisticated method.
ACKNOWLEDGMENTS
This work is partially supported by German Research Foundation
(DFG) grant ME 3619/2-1. Large parts of this work were carried out
while H.M. was affiliated with Karlsruhe Institute of Technology.
REFERENCES
[1] Eric Aubanel. 2009. Resource-Aware Load Balancing of Parallel Applications.
In Handbook of Research on Grid Technologies and Utility Computing: Concepts
for Managing Large-Scale Applications, Emmanuel Udoh and Frank Zhigang
Wang (Eds.). Information Science Reference - Imprint of: IGI Publishing, 12–21.
[2] C. Bichot and P. Siarry (Eds.). 2011. Graph Partitioning. Wiley.
[3] B. Brandfass, T. Alrutz, and T. Gerhold. 2013. Rank Reordering for MPI
Communication Optimization. Computers & Fluids 80, 0 (2013), 372 – 380.
https://doi.org/10.1016/j.compfluid.2012.01.019
[4] Aydin Buluc¸, Henning Meyerhenke, Ilya Safro, Peter Sanders, and Christian
Schulz. 2016. Recent Advances in Graph Partitioning. In Algorithm Engineering
- Selected Results and Surveys, Lasse Kliemann and Peter Sanders (Eds.). Lecture
Notes in Computer Science, Vol. 9220. 117–158.
[5] Siew Yin Chan, Teck Chaw Ling, and Eric Aubanel. 2012. The Impact of
Heterogeneous Multi-Core Clusters on Graph Partitioning: An Empirical Study.
Cluster Computing 15, 3 (2012), 281–302.
[6] Ce´dric Chevalier and Ilya Safro. 2009. Comparison of Coarsening Schemes for
Multilevel Graph Partitioning. In Learning and Intelligent Optimization, Third
International Conference, LION 3, Trento, Italy, January 14-18, 2009. Selected
Papers (Lecture Notes in Computer Science), Thomas Stu¨tzle (Ed.), Vol. 5851.
Springer, 191–205.
[7] C. Clauss, S. Lankes, P. Reble, and T. Bemmerl. 2011. Evaluation and improve-
ments of programming models for the Intel SCC many-core processor. In 2011
International Conference on High Performance Computing Simulation. 525–532.
[8] M. Deveci, K. Kaya, B. Uc¸ar, and U. V. Catalyurek. 2015. Fast and High
Quality Topology-Aware Task Mapping. In 2015 IEEE International Parallel and
Distributed Processing Symposium. 197–206.
[9] D. Eppstein. 2011. Recognizing Partial Cubes in Quadratic Time. J. Graph
Algorithms Appl. 15, 2 (2011), 269–293.
[10] Michael R. Garey and David S. Johnson. 1979. Computers and Intractability: A
Guide to the Theory of NP-Completeness. W. H. Freeman & Co.
[11] R. Glantz, H. Meyerhenke, and A. Noe. 2015. Algorithms for Mapping Parallel
Processes onto Grid and Torus Architectures. In 23rd Euromicro International
Conference on Parallel, Distributed and Network-Based Processing, PDP 2015,
Turku, Finland. 236–243.
[12] T. Hoefler and M. Snir. 2011. Generic Topology Mapping Strategies for Large-
scale Parallel Architectures. In ACM International Conference on Supercomputing
(ICS’11). ACM, 75–85.
[13] W. Imrich. 1993. A simple O(mn) algorithm for recognizing Hamming graphs.
Bull. Inst. Comb. Appl (1993), 45–56.
[14] E. Jeannot, G. Mercier, and F. Tessier. 2013. Process Placement in Multicore
Clusters: Algorithmic Issues and Practical Techniques. IEEE Transactions on
Parallel and Distributed Systems PP, 99 (2013), 1–1. https://doi.org/10.1109/
TPDS.2013.104
[15] G. Karypis and V. Kumar. 1998. A Fast and High Quality Multilevel Scheme for
Partitioning Irregular Graphs. SIAM J. Sci. Comput. 20, 1 (1998), 359–392.
[16] B. W. Kernighan and S. Lin. 1970. An efficient heuristic procedure for partitioning
graphs. Bell Systems Technical Journal 49 (1970), 291–307.
[17] Young Man Kim and Ten-Hwang Lai. 1991. The Complexity of Congestion-1
Embedding in a Hypercube. Journal of Algorithms 12, 2 (1991), 246 – 280.
https://doi.org/10.1016/0196-6774(91)90004-I
[18] Shad Kirmani, JeongHyung Park, and Padma Raghavan. 2017. An embedded sec-
tioning scheme for multiprocessor topology-aware mapping of irregular applica-
tions. IJHPCA 31, 1 (2017), 91–103. https://doi.org/10.1177/1094342015597082
[19] Henning Meyerhenke, Burkhard Monien, and Thomas Sauerwald. 2009. A
New Diffusion-based Multilevel Algorithm for Computing Graph Partitions. J.
Parallel and Distributed Computing 69, 9 (2009), 750–761. https://doi.org/DOI:
10.1016/j.jpdc.2009.04.005
[20] Vitaly Osipov and Peter Sanders. 2010. n-Level Graph Partitioning. In Algorithms
- ESA 2010, 18th Annual European Symposium, Liverpool, UK, September 6-8,
2010. Proceedings, Part I (Lecture Notes in Computer Science), Mark de Berg
and Ulrich Meyer (Eds.), Vol. 6346. Springer, 278–289.
[21] S. Ovchinnikov. 2008. Partial cubes: Structures, characterizations, and construc-
tions. Discrete Mathematics 308 (2008), 5597–5621.
[22] Franc¸ois Pellegrini. 1994. Static Mapping by Dual Recursive Bipartitioning
of Process and Architecture Graphs. In Scalable High-Performance Computing
Conference (SHPCC). IEEE, 486–493.
[23] Franc¸ois Pellegrini. 2007. Scotch and libScotch 5.0 User’s Guide. Technical
Report. LaBRI, Universite´ Bordeaux I.
[24] Franc¸ois Pellegrini. 2011. Static Mapping of Process Graphs. In Graph Parti-
tioning, Charles-Edmond Bichot and Patrick Siarry (Eds.). John Wiley & Sons,
Chapter 5, 115–136.
[25] Xinyu Que, Fabio Checconi, Fabrizio Petrini, and John A. Gunnels. 2015. Scalable
Community Detection with the Louvain Algorithm. In 2015 IEEE International
Parallel and Distributed Processing Symposium, IPDPS 2015, Hyderabad, India,
May 25-29, 2015. IEEE Computer Society, 28–37.
[26] I. Safro, P. Sanders, and Ch. Schulz. 2012. Advanced Coarsening Schemes for
Graph Partitioning. In Proc. 11th Int. Symp. on Experimental Algorithms. Springer,
369–380.
[27] P. Sanders and C. Schulz. 2013. High Quality Graph Partitioning. In Proc. of
the 10th DIMACS Impl. Challenge Workshop: Graph Partitioning and Graph
Clustering. AMS, 1–17.
[28] Peter Sanders and Christian Schulz. 2013. KaHIP v0.53 - Karlsruhe High Quality
Partitioning - User Guide. CoRR abs/1311.1714 (2013).
[29] C. Schulz and J. L. Tra¨ff. 2017. Better Process Mapping and Sparse Quadratic
Assignment. CoRR abs/1702.04164 (2017).
[30] Roman Trobec and Gregor Kosec. 2015. Parallel Scientific Computing. Theory,
Algorithms, and Applications of Mesh Based and Meshless Methods. Springer
Intl. Publ. https://doi.org/10.1007/978-3-319-17073-2
[31] Bora Ucar, Cevdet Aykanat, Kamer Kaya, and Murat Ikinci. 2006. Task Assign-
ment in Heterogeneous Computing Systems. J. Parallel and Distrib. Comput. 66,
1 (2006), 32 – 46.
[32] Chris Walshaw and Mark Cross. 2001. Multilevel Mesh Partitioning for Hetero-
geneous Communication Networks. Future Generation Comp. Syst. 17, 5 (2001),
601–623.
[33] Hao Yu, I-Hsin Chung, and Jose Moreira. 2006. Topology Mapping for Blue
Gene/L Supercomputer. In Proceedings of the 2006 ACM/IEEE Conference on
Supercomputing (SC ’06). ACM, New York, NY, USA.
10
A APPENDIX
A.1 Additional experimental results
Table 3: Running times in seconds for KAHIP to partition the
complex networks in Table 1 into |Vp | = 256 and |Vp | = 512
parts, respectively. These partitions are used to construct the
starting solutions for the mapping algorithms for cases c2 to c4.
Name |Vp | = 256 |Vp | = 512
PGPgiantcompo 1.457 2.297
as-22july06 11.179 13.559
as-skitter 1439.316 2557.827
citationCiteseer 217.951 367.716
coAuthorsCiteseer 58.120 69.162
coAuthorsDBLP 157.871 233.000
coPapersCiteseer 780.491 841.656
coPapersDBLP 1517.283 2377.680
email-EuAll 22.919 17.459
loc-brightkite edges 113.720 155.384
loc-gowalla edges 461.583 1174.742
p2p-Gnutella04 16.377 17.400
soc-Slashdot0902 887.896 1671.585
web-Google 128.843 130.986
wiki-Talk 1657.273 4044.640
Arithmetic mean 498.152 911.673
Geometric mean 142.714 204.697
