Automated Tiling of Unstructured Mesh Computations with Application to
  Seismological Modelling by Luporini, Fabio et al.
1Automated Tiling of Unstructured Mesh Computations with
Application to Seismological Modelling
FABIO LUPORINI, Imperial College London
MICHAEL LANGE, Imperial College London
CHRISTIAN T. JACOBS, University of Southampton
GERARD J. GORMAN, Imperial College London
J. RAMANUJAM, Louisiana State University
PAUL H. J. KELLY, Imperial College London
Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike
traditional loop fusion or blocking, the loops may have dierent iteration spaces and access shared datasets
through indirect memory accesses, such as A[map[i]] – hence the name “sparse”. One notable example of
such loops arises in discontinuous-Galerkin nite element methods, because of the computation of numerical
integrals over dierent domains (e.g., cells, facets). The major challenge with sparse tiling is implementation –
not only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it
requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach
to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence
of compiler passes, the mathematical specication of a problem is progressively lowered, and eventually
sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing:
a revisited, more ecient sparse tiling algorithm; support for distributed-memory parallelism; a range of
ne-grained optimizations for increased run-time performance; implementation in a publicly-available library,
SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver
for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell
cores.
CCS Concepts: • Computing methodologies → Parallel algorithms; • Mathematics of computing →
Mathematical software performance; • Software and its engineering→Compilers;Domain specic
languages;
Additional Key Words and Phrases: Finite element method, unstructured mesh, compiler, performance opti-
mization, loop fusion, loop tiling, sparse tiling
ACM Reference format:
Fabio Luporini, Michael Lange, Christian T. Jacobs, Gerard J. Gorman, J. Ramanujam, and Paul H. J. Kelly. 2019.
Automated Tiling of Unstructured Mesh Computations with Application to Seismological Modelling. 1, 1,
Article 1 (June 2019), 30 pages.
This work was supported by the Engineering and Physical Sciences Research Council through grants EP/I00677X/1,
EP/L000407/1, EP/I012036/1], by the Imperial College London Department of Computing, and by the Imperial College
London Intel Parallel Computing Centre (IPCC). The work of J. Ramanujam is supported by the US National Science
Foundation award CCF-1619303, the Louisiana Board of Regents contract LEQSF(2016-19)-RD-B-03 and by Louisiana State
University. The authors would like to thank the HPC Service Support team at Imperial College London for their help with
the Helen cluster. The authors would also like to thank Gheorghe-Teodor Bercea, Lawrence Mitchell, and David Ham for
their suggestions during the development of this project..
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the
full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specic permission and/or a fee. Request permissions from permissions@acm.org.
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. XXXX-XXXX/2019/6-ART1 $15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
ar
X
iv
:1
70
8.
03
18
3v
2 
 [c
s.C
E]
  1
9 J
un
 20
19
1:2 F. Luporini et al.
DOI: 10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
In many unstructured mesh applications, for example those approximating the solution of partial
dierential equations (PDEs) using the nite volume or the nite element method, sequences of
numerical operators accessing common elds need to be evaluated. Usually, these operators are
implemented by iterating over sets of mesh elements and computing a kernel in each element.
In languages such as C or Fortran, the resulting sequence of loops is typically characterized
by heterogeneous iteration spaces and accesses to shared datasets (reads, writes, increments)
through indirect pointers, like A[map[i]]. One notable example of such operators/loops arises
in discontinuous-Galerkin nite element methods, in which numerical integrals over dierent
domains (e.g., cells, facets) are evaluated; here, A could represent a discrete function, whereas map
could store connectivity information (e.g., from mesh elements to degrees of freedom). In this
article, we devise compiler theory and technology to automate a sophisticated version of sparse
tiling, a technique to maximize data locality when accessing shared elds (like the A and map arrays
in the earlier example), which consists of fusing a sequence of loops by grouping iterations such
that all data dependencies are honored. The goal is to improve the overall application performance
with minimal disruption (none, if possible) to the source code.
Three motivating real-world applications for this work are Hydra, Volna and Seigen. Hydra [30]
is a nite-volume computational uid dynamics application used at Rolls Royce for the simulation
of next-generation components of jet engines. Volna [11] is a nite-volume computational uid
dynamics application for the modelling of tsunami waves. Seigen [17] aims to solve the elastic wave
equation using the discontinuous Galerkin nite element method for seismic exploration purposes.
All these applications are characterized by the presence of a time-stepping loop, in which several
loops over the mesh (thirty-three in Hydra, ten in Volna, twenty-ve in Seigen) are repeatedly
executed. These loops are characterized by the irregular dependence structure mentioned earlier,
with for example indirect increments in one loop (e.g., A[m[i]] += f(...)) followed by indirect
reads in one of the subsequent loops (e.g., b = g(A[n[j]])). The performance achievable by Seigen
through sparse tiling will extensively be studied in Section 7.
Although our work is general in nature, we are particularly interested in supporting increasingly
sophisticated seismological problems that will be developed on top of Seigen. This has led to the
following strategic decisions:
Automation, but no interest in legacy codes Sparse tiling is an “extreme optimization”.
An implementation in a low level language requires a great deal of eort, as a thought-
ful restructuring of the application is necessary. In common with many other low level
transformations, it also makes the source code impenetrable, aecting maintenance and
extensibility. We therefore aim for a fully automated system based on domain-specic
languages (DSLs), which abstracts sparse tiling through a simple interface (i.e., a single
construct to dene a scope of fusible loops) and a tiny set of parameters for performance
tuning (e.g., the tile size). We are not interested in automating sparse tiling in legacy codes,
in which the key computational aspects (e.g., mesh iteration, distributed-memory paral-
lelism) are usually hidden for software modularity, thus making such a transformation
almost impossible.
Unstructured meshes require mixed static/dynamic analysis Unstructured meshes are
often used to discretize the computational domain, since they allow for an accurate repre-
sentation of complex geometries. Their connectivity is stored by means of adjacency lists
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:3
(or equivalent data structure), which leads to indirect memory accesses within the loop
nests. Indirections break static analysis, thus making purely compiler-based approaches
insucient. Runtime data dependence analysis is essential for sparse tiling, so integration
of compiler and run-time tracking algorithms becomes necessary.
Realistic datasets not tting in a single node Real-world simulations often operate on
terabytes of data, hence execution on multi-node systems is often required. We have
extended the original sparse tiling algorithm to enable distributed-memory parallelism.
Sparse tiling does not change the semantics of a numerical method – only the order in which
some iterations are executed. Therefore, if most sections of a PDE solver suer from computational
boundedness and standard optimizations such as vectorization have already been applied, then
sparse tiling, which targets memory-boundedness, will only provide marginal benets (if any).
Likewise, if a global reduction is present in between two loops, then there is no way for sparse
tiling to be applied, unless the numerical method itself is rethought. This is regardless of whether
the reduction is explicit (e.g., the rst loop updates a global variable that is read by the second loop)
or implicit (i.e., within an external function, as occurs for example in most implicit nite element
solvers). These are probably the two greatest limitations of the technique; otherwise, sparse tiling
may provide substantial performance benets.
The rest of the article is structured as follows: in Section 2 we present the abstraction on
which sparse tiling relies. We then show, in Section 3, examples of how the algorithm works on
shared- and distributed-memory systems. This is followed by the formalization of the algorithms
(Sections 4, 5) and the implementation of the compiler that automates sparse tiling (Section 6). The
experimentation is described in Section 7. A discussion on the limitations of the algorithms and the
future work that we expect to carry out in the years to come conclude the article.
2 THE LOOP CHAIN ABSTRACTION FOR UNSTRUCTURED MESH APPLICATIONS
The loop chain is an abstraction introduced in [21]. Informally, a loop chain is a sequence of loops
with no global synchronization points, enriched with information to enable run-time data depen-
dence analysis – necessary since indirect memory accesses inhibit common static approaches to
loop optimization. The idea is to replace static with dynamic analysis, exploiting the information
carried by a loop chain. Loop chains must somehow be added to or automatically derived (e.g.,
exploiting a DSL) from the input code. A loop chain will then be used by an inspector/executor
scheme [32]. The inspector is an algorithm performing data dependence analysis using the informa-
tion carried by the loop chain, which eventually produces a sparse tiling schedule. This schedule is
used by the executor, a piece of code semantically equivalent to the original sequence of loops (i.e.,
computing the same result) executing the various loop iterations in a dierent order.
Before diving into the description of the loop chain abstraction, it is worth observing two aspects.
• The inspection phase introduces an overhead. In many scientic computations, the data
dependence pattern is static – or, more informally, “the topology does not change over
time”. This means that the inspection cost may be amortized over multiple iterations of the
executor. If instead the mesh changes over time (e.g., in case of adaptive mesh renement),
a new inspection must be performed.
• To adopt sparse tiling in a code there are two options. One possibility is to provide a library
and leave the application specialists with the burden of carrying out the implementation
(re-implementation in case of legacy code). A more promising alternative consists of raising
the level of abstraction: programs can be written using a DSL; loop chain, inspector, and
executor can then be automatically derived at the level of the intermediate representation.
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:4 F. Luporini et al.
Process 0 Process 1
core ow
ne
d
ex
ec
no
n-
ex
ec
coreow
ne
d
ex
ec
no
n-
ex
ec
Fig. 1. Two partitions of a mesh distributed to two neighboring processes, P0 and P1. The core region includes
all iterations that can be processed without reading halo data. The owned iterations can be processed only by
reading halo data. Exec is the set of iterations that must be executed because they indirectly write (increment)
the owned iterations. The union of the owned and exec regions is referred to as the boundary. The non-exec
region includes halo data which is indirectly read during the exec computation. The iterations in P0’s (P1’s)
exec region are, logically, the same iterations in P1’s (P0’s) owned region; thus, we say that these iterations are
“redundantly executed”. Matching colors across the two processes represent identical subsets of iterations in
the non-partitioned mesh. The image was inspired by an example in [27].
As we shall see in Section 6, the tools developed in this article enable both approaches,
though our primary interest is in the automated approach (i.e., via DSLs).
These points will be further elaborated in later sections.
The loop chain abstraction was originally dened as follows:
• A loop chain L = [L0,L1, ...,Ln−1] is an ordered sequence of n loops. There are no global
synchronization points in between the loops. Although there may be dependencies between
successive loops in the chain, the execution order of a loop’s iterations does not inuence
the result.
• D = {D0,D1, ...,Dm−1} is a collection ofm disjoint data spaces. Each loop accesses (reads
from, writes to) a subset of these data spaces. An access can be either direct (e.g., A[i]) or
indirect (e.g., A[map(i)]).
• RLl→Dd (i) andWLl→Dd (i) are access relations for a loop Ll over a data space Dd ∈ D. They
indicate which locations in the data space Dd an iteration i ∈ Ll reads from and writes to,
respectively. A loop chain must provide all access relations for all loops.
We here rene this denition, and specialize it for unstructured mesh applications. This allows the
introduction of new concepts, necessary to extend the sparse tiling algorithm presented in [34]. Some
terminology and ideas are inspired by the programming model of OP2, a library for unstructured
mesh applications [14] used to implement the already mentioned Hydra code.
• A loop chain L = [L0,L1, ...,Ln−1] is an ordered sequence of n loops. There are no global
synchronization points in between the loops. Although there may be dependencies between
successive loops in the chain, the execution order of a loop’s iterations does not inuence
the result.
• S = {S0, S1, ..., Sm−1} is a collection ofm disjoint iteration spaces. Possible iteration spaces
are the topological entities of the mesh (e.g., cells, vertices) or the degrees of freedom
associated with a function.
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:5
When using distributed-memory parallelism, an iteration space S is logically split into
three contiguous regions: core, boundary, and non-exec (see also Figure 1). Given a generic
process P executing a loop over S , these regions represent:
core the subset of iterations computed by P that does not depend on halo exchanges. In
other words, these are P ’s local iterations.
boundary the union of two sub-regions, owned and exec, which are dened next. The
boundary region requires up-to-date halo data. Like core, owned contains iterations
owned by P ; the data produced by owned are sent out through a halo exchange. The
exec iterations, instead, are executed because they indirectly write (increment) data in
P ’s owned sub-region.
non-exec the subset of iterations not computed by P mapping read-only data sent over
to P during a halo exchange.
An iteration space is uniquely identied by a name and the sizes of its three regions.
• The depth is an integer indicating the extent of the boundary region. This is constant across
all iteration spaces in S.
• M = {M0,M1, ...,Mo−1} is a set of o maps. A map of arity a is a vector-valued function
M : Si → Saj connecting elements in dierent iteration spaces. For example, we can express
the mapping of a triangular cell c to three vertices v0,v1,v2 as M(c) = [v0, v1, v2]; here
cells and vertices are iteration spaces, while c,v0,v1,v2 are iteration identiers (i.e., natural
numbers).
• A loopLi over the iteration space S is associated with one or more descriptors. A descriptor is
a 2-tuple <M, mode>. M is either a map from S to some other iteration spaces or the special
placeholder ⊥. In the former case, Li is accessing data associated with M(S) indirectly; in
the latter case, the data accesses are direct. mode is one of [r , w, i], indicating whether a
memory access is a read, write or increment.
There are a few crucial dierences in this rened denition for the unstructured mesh case.
One of them is the presence of iteration spaces in place of data spaces. In unstructured mesh
applications, loops tend to access multiple data spaces associated with the same iteration space. A
key observation is that if a loop is writing to some data spaces, then it is extremely likely that at
least a subset of them will be accessed by the subsequent loop in the chain. The idea, therefore,
is to rely on iteration spaces, rather than data spaces, to perform dependence analysis. This can
substantially reduce the inspection cost, since typically |S| << |D|. Obviously, this relaxation might
also create “false dependences”, thus potentially aecting data communication. This would be the
case if, for example, two consecutive, independent loops accessed dierent data elds associated
with the same iteration space (e.g., pressure and velocity dened over the same set of degrees of
freedom). In our experience, however, this rarely happens in practice (never in the case of the
already mentioned Volna, Hydra and Seigen).
Another fundamental addition is the characterization of iteration spaces into the three regions
core, boundary and non-exec. As we shall see, this separation is essential to enable distributed-
memory parallelism. The extent of the boundary regions is captured by the depth of the loop chain.
Informally, the depth tells how many extra “strips” of elements are provided by the neighboring
processes. This allows some redundant computation along the partition boundary and also limits
the depth of the loop chain (i.e., how many loops can be fused). The role of the parameter depth
will be clear by the end of Section 5.
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:6 F. Luporini et al.
for t = 0 to T {
// L0: loop over edges , increment vertices
for e = 0 to E {
x = X + e;
tmp_0 = edges2vertices[e + 0];
tmp_1 = edges2vertices[e + 1];
kernel1(x, tmp_0, tmp_1);
}
// L1: loop over cells , increment vertices
for c = 0 to C {
res = R + c;
tmp_0 = cells2vertices[c + 0];
tmp_1 = cells2vertices[c + 1];
tmp_2 = cells2vertices[c + 2];
kernel2(res , tmp_0 , tmp_1 , tmp_2);
}
// L2: loop over edges , read vertices
for e = 0 to E {
tmp_0 = edges2vertices[e + 0];
tmp_1 = edges2vertices[e + 1];
kernel3(tmp_0, tmp_1);
}
}
(a) Example sequence of sparse-tilable loops.
inspector = init_inspector (...);
// Three sets , edges , cells , and vertices
E = set(inspector , "edges", core_edges ,
boundary_edges , nonexec_edges , ...);
C = set(inspector , "cells", core_cells ,
boundary_cells , nonexec_cells , ...);
V = set(inspector , "verts", core_verts ,
boundary_verts , nonexec_verts , ...);
// Two maps , from edges to vertices and from cells
to vertices
e2vMap = map(inspector , E, V, edges2verts , ...);
c2vMap = map(inspector , C, V, cells2verts , ...);
// The loop chain comprises three loops
// Each loop has some descriptors
loop(inspector , E, { ⊥, "r"}, {e2vMap , "i"});
loop(inspector , C, { ⊥, "r"}, {c2vMap , "i"});
loop(inspector , E, { ⊥, "w"}, {e2vMap , "r"});
// Now can run the inspector
return inspection(mode , inspector , tile_size , ...);
(b) Loop chain for the example program.
Fig. 2. On the le, a “toy” program used as running example in Section 3 to illustrate the loop chain abstraction
and show how the sparse tiling algorithms (inspection, execution) work. Note that all parameters passed to the
kernels are pointers. On the right, a code snippet showing the loop chain corresponding to the program on the
le. The syntax is very close to the actual API of SLOPE, the sparse tiling library that we have implemented,
described in Section 6.
3 LOOP CHAIN, INSPECTION AND EXECUTION EXAMPLES
Using the example in Figure 2a, we describe the actions performed by our sparse tiling inspector.
The inspector takes as input the loop chain illustrated in Figure 2b. We show two variants, for
shared-memory and distributed-memory parallelism. The value of the variable mode in Figure 2b
determines the variant to be executed.
3.1 Overview
The inspector starts with partitioning the iteration space of a seed loop, for example L0. Partitions
are used to initialize tiles: the iterations of L0 falling in Pi – or, in other words, the edges in partition
Pi – are assigned to the tile Ti . Figure 3 displays the situation after the initial partitioning of L0 for
a given input mesh. There are four partitions, two of which (P0 and P3) are not connected through
any edge or cell. These four partitions correspond to four tiles, [T0, T1, T2, T3], with Pi = Ti .
As detailed in the next two sections, the inspection proceeds by populating Ti with iterations
from L1 and L2. The challenge of this task is guaranteeing that all data dependencies – read after
write, write after read, write after write – are honored. The output of the inspector is eventually
passed to the executor. The inspection carries sucient information for computing sets of tiles in
parallel.Ti is always executed by a single thread/process and the execution is atomic; that is, it does
not require communication with other threads/processes. When executing Ti , rst all iterations
from L0 are executed, then all iterations from L1 and nally those from L2.
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:7
P0
P1
P2
P3
Fig. 3. Partitioning of the seed loop over edges. The vertices are illustrated to make the connectivity of the
mesh clear, although they do not belong to any partition yet.
(a) A snapshot of the mesh aer tiling L0.
V0C0
C1 C2
V1
V2
(b) The vertices are wrien byL0, so a projection must
be computed before tiling L1. Here, the projection is
represented as colored vertices.
B
B
B
B
B
B
B B
B B
B
B B
B B
B
B
B
B
B
B
G
G
G
G
R
R
R
R R
R
R
R
RR
R
R
R
R
R
R
R
R
R
R
RR
R
R
R
R
R
R
(c) A snapshot of the mesh aer tiling L1. (d) A snapshot of the mesh aer tiling L2.
Fig. 4. Four passes of the inspection algorithm for shared-memory parallelism.
3.2 Inspection for Shared-Memory Parallelism
Similarly to OP2, to achieve shared-memory parallelism we use coloring. Two tiles that are given
the same color can be executed in parallel by dierent threads. Two tiles can have the same color
if they are not connected, because this ensures the absence of race conditions through indirect
memory accesses during parallel execution. In the example we can use three colors: red (R), green
(G), and blue (B). T0 and T3 are not connected, so they are assigned the same color. The colored
tiles are shown in Figure 4a. In the following, with the notationT ci we indicate that the i-th tile has
color c .
To populate [TG0 , T B1 , T R2 , TG3 ] with iterations from L1 and L2, we rst have to establish a total
ordering for the execution of partitions with dierent colors. Here, we assume the following order:
green (G), blue (B), and red (R). This implies, for instance, that all iterations assigned to T B1 must
be executed before all iterations assigned to T R2 . By “all iterations” we mean the iterations from L0
(determined by the seed partitioning) as well as the iterations that will later be assigned from tiling
L1 and L2. We assign integer positive numbers to colors to reect their ordering, where a smaller
number means higher execution priority. We can assign, for example, 0 to green, 1 to blue, and 2 to
red.
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:8 F. Luporini et al.
To schedule the iterations of L1 to [TG0 , T B1 , T R2 , TG3 ], we need to compute a projection for any
write or local reduction performed by L0. The projection required by L0 is a function ϕ : V → T
mapping the vertices in V – as indirectly incremented during the execution of L0, see Figure 2a –
to a tile T ci ∈ T. Consider the vertex v0 in Figure 4b. v0 has 7 incident edges, 2 of which belong to
TG0 , while the remaining 5 to T B1 . Since we established that G ≺ B, v0 can only be read after T B1 has
nished executing the iterations from L0 (i.e., the 5 incident blue edges). We express this condition
by setting ϕ(v0) = T B1 . Observe that we can compute ϕ by iterating over V and, for each vertex,
applying the maximum function (MAX) to the color of the adjacent edges.
We now use ϕ to schedule L1, a loop over cells, to the tiles. Consider again v0 and the ad-
jacent cells [c0, c1, c2] in Figure 4b. These three cells have in common the fact that they are
adjacent to both green and blue vertices. For c1, and similarly for the other cells, we compute
MAX(ϕ(v0), ϕ(v1), ϕ(v2)) = MAX(B,G,G) = B = 1. This establishes that c1 must be assigned
to T B1 , because otherwise (c1 assigned instead to TG0 ) a read to v0 would occur before the last
increment from T B1 took place. Indeed, we recall that the execution order, for correctness, must
be “all iterations from [L0,L1,L2] in the green tiles before all iterations from [L0,L1,L2] in the blue
tiles”. The scheduling of L1 to tiles is displayed in Figure 4c.
To schedule L2 to [TG0 , T B1 , T R2 , TG3 ] we employ a similar process. Vertices are again written by
L1, so a new projection over V will be necessary. Figure 4d shows the output of this last phase.
(a) Aer tiling L0
B
B
B
B
G G
B
G
B R
B
B
G
G
B
R
G
G
G
G
G
G
G
G G
G
G
R
RR
R
G
G
G
G
G
G
R
R
R
RG
R
R
R
R
R
R
G G
G
G
G
(b) Aer tiling L1 (c) Aer tiling L2
Fig. 5. Tiling the program in Figure 2a for shared-memory parallelism can lead to conflicts. Here, the two
green tiles eventually become adjacent, creating race conditions.
Conicting Colors. It is worth noting how T R2 “consumed” the frontier elements of all other tiles
every time a new loop was scheduled. Tiling a loop chain consisting of k loops has the eect of
expanding the frontier of a tile of at most k vertices. With this in mind, we re-inspect the loop
chain of the running example, although this time employing a dierent execution order – blue (B),
red (R), and green (G) – and a dierent seed partitioning. Figure 5 shows that, by applying the same
procedure described in this section, TG0 and TG3 will eventually become adjacent. This violates the
precondition that tiles can be given the same color, and thus run in parallel, as long as they are not
adjacent. Race conditions during the execution of iterations belonging to L2 are now possible. This
problem will be solved in Section 5.
3.3 Inspection for Distributed-Memory Parallelism
In the case of distributed-memory parallelism, the mesh is partitioned and distributed to a set of
processes. Neighboring processes typically exchange (MPI) messages before executing a loop Lj .
A message includes all “dirty” dataset values required by Lj modied by any Lk , with Lk ≺ Lj . In
the running example, L0 writes to vertices, so a subset of values associated with border vertices
must be communicated prior to the execution of L1. To apply sparse tiling, the idea is to push all
communications to the beginning of the loop chain: as we shall see, this increases the amount
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:9
of data to be communicated at a time, but also reduces the number of synchronizations (only 1
synchronization between each pair of neighboring processes per loop chain execution).
From Section 2 it is known that, in a loop chain, a set is logically split into three regions, core,
boundary, and non-exec. The boundary tiles, which originate from the seed partitioning of the
boundary region, will include all iterations that cannot be executed until the communications have
terminated. The procedure described for shared-memory parallelism – now performed individually
by each process on a partition of the input mesh – is modied as follows:
(1) The core region of the seed loop L0 is partitioned into tiles. Unless aiming for a mixed
distributed/shared-memory scheme, there is no need to assign identical colors to un-
connected tiles, as a process will execute its own tiles sequentially. Colors are assigned
increasingly, with Ti given color i . As long as tiles with contiguous ID are also physically
contiguous in the mesh, this assignment retains memory access locality when “jumping”
from executing Ti to Ti+1.
(2) The same process is applied to the boundary region. Thus, a situation in which a tile includes
iterations from both the core and the boundary regions is prevented by construction. Further,
all tiles within the boundary region are assigned colors higher than those used for the core
tiles. This constrains the execution order: no boundary tiles will be executed until all core
tiles are computed.
(3) We map the whole non-exec region of L0 to a single special tile,Tne . This tile has the highest
color and will actually never be executed.
core non-execexecowned
boundary
Process 0
Process 1core
boundarynon-exec
ownedexec
Fig. 6. A snapshot of the two mesh partitions on Process 0 and Process 1 aer inspecting the seed loop
L0 for distributed-memory parallelism. On each process, there are five tiles in total: two in the core region
(green and violet), two in the boundary region (red and light blue), and Tne . The boundary tiles can safely
cross the owned and exec sub-regions (i.e., the local iterations and the iterations to be redundantly computed,
respectively). However, no tile can include iterations from both the core and the boundary regions.
From this point on, the inspection proceeds as in the case of shared-memory parallelism. The
application of the MAX function when scheduling L1 and L2 makes higher color tiles (i.e., those
having lower priority) “expand over” lower color ones.
In Figure 6, a mesh is partitioned over two processes and a possible seed partitioning and tiling
of L0 illustrated. We observe that the two boundary tiles (the red and light blue ones) will expand
over the core tiles as L1 and L2 are tiled, which eventually results in the scheduling illustrated in
Figure 7. Roughly speaking, if a loop chain consists of n loops and, on each process, n − 1 extra
layers of iterations are provided (the exec regions in Figure 6), then all boundary tiles are correctly
computed.
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:10 F. Luporini et al.
Process 1core
boundarynon-exec
ownedexeccore non-execexecowned
boundary
Process 0
Fig. 7. A snapshot of the two mesh partitions on Process 0 and Process 1 at the end of the inspection
for distributed-memory parallelism. Tne expands over the boundary region, which minimizes the amount of
redundant computation to be performed. At the end of the execution phase, the orange edges will contain
“dirty values”, but correctness is not aected as the exec region only includes o-process data. The boundary
tiles expand over the core region: this is essential for correctness since none of the red and blue entities from
[L0, L1, L2] can be executed until the MPI communications have terminated.
The schedule produced by the inspector is subsequently used by the executor. On each process, the
executor starts with triggering the MPI communications required for the computation of boundary
tiles. All core tiles are then computed, since no data from the boundary region is necessary. Hence,
computation is overlapped with communication. As all core tiles are computed and the MPI
communications terminated, the boundary tiles can nally be computed.
Eciency considerations. The underlying hypothesis is that the increase in data locality will
outweigh the overhead induced by the redundant computation and by the bigger volume of data
exchanged. This is motivated by several facts: (i) the loops being memory-bound; (ii) the core region
being much larger than the boundary region; (iii) the amount of redundant computation being
minimized through the special tile Tne , which progressively expands over the boundary region,
thus avoiding unnecessary calculations.
4 DATA DEPENDENCY ANALYSIS
The loop chain abstraction, described in Section 2, provides the information to construct an
inspector capable of analyzing data dependencies and thus build a legal sparse tiling schedule.
The dependencies between two dierent loops may be of type ow (read after write), anti (write
after read), or output (write after write). Further, there may be “reduction dependencies” between
iterations of the same loop.
4.1 Cross-Loop Dependences
Assume that loop Lx , having iteration space Sx , precedes loop Ly , having iteration space Sy , in the
loop chain. Let ex be a generic iteration in Sx . Let M be a map of arity a between two iteration
spaces. Let mode ∈ {w, r , i} indicate whether an iteration is written, read, or incremented. We
represent direct and indirect accesses as follows.
Direct access ⊥modeSx (ex ) → {{ex }, ∅}. In particular, if ⊥modeSx (ex ) = ∅, then no direct
write/read/increment is performed by ex when computing Lx .
Indirect access MmodeSx→Sy (ex ) → {{ey0 , ..., eya−1 }, ∅}. As per direct accesses, MmodeSx→Sy (ex ) = ∅
means that ex does not indirectly write/read/increment Sy when computing Lx .
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:11
A direct access is a special case of indirect access when y = x and MSx→Sy is the identity
mapping. However, we here keep the distinction between the two types of access explicit due to
their relevance in the sparse tiling algorithms, as explained in Section 5.
By considering pairs of points (ex , ey ) in the iteration spaces of the two loops Lx and Ly , namely
ex ∈ Sx and ey ∈ Sy , we can enumerate all possible dependences. For brevity, we do not distinguish
between increments and writes; we also assume that at least one of the two loops accesses data
indirectly. Let Sz be a generic iteration space in the loop chain. Hence, the ow dependences are:
{ex → ey | (⊥wSx (ex ) ∩MrSy→Sx (ey ))︸                          ︷︷                          ︸
À direct w, indirect r
∪ (MwSx→Sy (ex )∩ ⊥rSy (ey ))︸                          ︷︷                          ︸
Á indirect w, direct r
∪ (MwSx→Sz (ex ) ∩MrSy→Sz (ey ))︸                               ︷︷                               ︸
Â indirect w, indirect r
, ∅}.
In essence, there is a ow dependence between two iterations from dierent loops when one of
those iterations writes to an element and the other iteration reads from the same element, directly
or indirectly. To capture all these ow dependences, the inspection algorithm builds projections from
one loop to another. We saw an example in Section 3.2: the loop over cells (Sx ) performed an indirect
increment to a dataset associated with vertices (Sz ), which was then read by the subsequent loop
over edges (Sy ). Such ow dependence was of typeÂ (see denition above). For each vertex ez ∈ Sz ,
the projection (illustrated in Figure 4b) captured the last tile indirectly writing to (incrementing)
ez , exploiting the color (i.e., the scheduling priority) of the source iterations. A ow dependence
of type À would be even simpler to deal with, as it would not require the use of the indirect map
MSx→Sz to update the color of the iterations.
Likewise, we can enumerate the anti and output dependences:
{ex → ey |(⊥rSx (ex ) ∩MwSy→Sx (ey )) ∪ (MrSx→Sy (ex )∩ ⊥wSy (ey )) ∪ (MrSx→Sz (ex ) ∩MwSy→Sz (ey )) , ∅}.
{ex → ey |(⊥wSx (ex ) ∩MwSy→Sx (ey )) ∪ (MwSx→Sy (ex )∩ ⊥wSy (ey )) ∪ (MwSx→Sz (ex ) ∩MwSy→Sz (ey )) , ∅}.
Projections for this type of dependences are built analogously to that described above.
The inspection algorithm building projections for all ow, anti, and output dependences is
provided in Algorithm 3 and discussed in Section 5.1.4. How the inspector leverages data dependence
analysis (i.e., projections) to schedule iterations to tiles (i.e., the tiling function) is formalized in
Algorithm 4 and commented in Section 5.1.5.
4.2 Intra-Loop Dependences
There also are local reductions, or “reduction dependencies”, between two or more iterations of the
same loop when those iterations increment the same location(s); that is, when they read, modify
with a commutative and associative operator, and write to the same location(s). The reduction
dependencies in Lx are:
{ex1 → ex2 |M iSx→Sz (ex1 ) ∩M iSx→Sz (ex2 ) , ∅}.
A reduction dependency between two iterations within the same loop indicates that those two
iterations must be executed atomically with respect to each other. As we explained in Section 3.2,
the inspection algorithm uses coloring to ensure atomic increments.
5 ALGORITHMS
The pseudo-code for the sparse tiling inspector is shown in Algorithm 1. Given a loop chain and a
seed tile size, the algorithm produces a schedule suitable for mixed distributed/shared-memory
parallelism. This schedule – in essence, a set of populated tiles – is used by the executor to perform
the sparse-tiled computation. The executor pseudo-code is displayed in Algorithm 2. The next two
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:12 F. Luporini et al.
Symbol Meaning
L The loop chain
Lj The j-th loop in L
seed The index of the seed loop
Sj The iteration space of Lj
Scj , S
b
j , S
ne
j The core, boundary, and non-exec regions of Sj
D A descriptor of a loop
T The collection of tiles
T[i] Accessing the i-th tile, or Ti
T ci , T
b
i , T
ne
i The core, boundary, and non-exec regions of Ti
ϕS A projection ϕS : S → T
Φ The collection of projections
σj A tiling function σj : Sj → T for Lj
ts The seed tile size
C The matrix of conicting colors
Ax:y Algorithm x, line y
Table 1. Summary of the notation used throughout Section 5.
sections elaborate on the main steps of these algorithms. The notation is summarized in Table 1; the
syntax Ax:y is a shortcut for “Algorithm x, line y”. The implementation is discussed in Section 6.
ALGORITHM 1: The inspection algorithm
Input: The loop chain L = [L0, L1, ..., Ln−1], a seed tile size ts
Output: A collection of tiles T, populated with iterations from L
1 seed← 0, Φ← ∅, C←⊥;
2 σseed , T← partition(Sseed , ts);
3 seed_map← find_map(Sseed , L);
4 do
5 conflicts← false;
6 color(T, seed_map);
7 for j = 1 to n − 1 do
8 project(Lj−1 , σj−1 , Φ, C);
9 σj ← tile(Lj , Φ);
10 assign(σj , T);
11 end
12 if has_conflicts(C) then
13 conflicts← true;
14 add_fake_connection(seed_map, C);
15 end
16 while conflicts;
17 compute_local_maps(T);
18 return T
ALGORITHM 2: The executor algorithm
Input: A collection of tiles T
Result: Execute the loop chain
1 Tc , Tb ← group_tiles_by_region(T);
2 start_MPI_comm();
3 foreach color do
4 foreach T ∈ Tc s.t. T .color == color do
5 execute_tile(T );
6 end
7 end
8 end_MPI_comm();
9 foreach color do
10 foreach T ∈ Tb s.t. T .color == color do
11 execute_tile(T );
12 end
13 end
5.1 Inspector
5.1.1 Choice of the seed loop. The seed loop Lseed is used to initialize the tiles. Theoretically,
any loop in the chain can be chosen as seed. Supporting distributed-memory parallelism, however,
is cumbersome if Lseed , L0. This is because more general schemes for partitioning and coloring
would be needed to ensure that no iterations in any Sbj are assigned to a core tile. A limitation of
our inspector algorithm in the case of distributed-memory parallelism is that it must be Lseed = L0.
In the special case in which there is no need to distinguish between core and boundary tiles
(because a program is executed on a single shared-memory system), Lseed can be chosen arbitrarily.
If we however pick Lseed in the middle of the loop chain, that is L0 ≺ ... ≺ Lseed ≺ ... ≺ Ln−1, a
mechanism for constructing tiles in the reverse direction (“backwards”), from Lseed towards L0, is
necessary. In [34], we propose two “symmetric” algorithms to solve this problem, forward tiling
and backward tiling, with the latter using the MIN function in place of MAX when computing
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:13
projections. For ease of exposition, and since in the fundamental case of distributed-memory
parallelism we are imposing Lseed = L0, we here neglect this distinction1.
5.1.2 Tiles initialization. Let ts be the user-specied seed tile size. The algorithm starts with
partitioning Scseed into m subsets {P0, P1, ..., Pm−1} such that |Pi | = ts (except possibly for Pm−1),
Pi ∩ Pj = ∅, and ∪m−1i=0 Pi = Scseed. Among all possible legal partitionings, we choose the one that
splits Scseed into blocks of ts contiguous iterations, with P0 = {0, ..., ts − 1}, P1 = {ts, ..., 2ts − 1}, and
so on. We analogously partition Sbseed into k subsets. We createm + k + 1 tiles, one for each of these
partitions and one extra tile for Sneseed, namely T = [T c0 , ...,T cm−1,T bm , ...,T bm+k−1,T nem+k ]. At this point
we have an assignment of iterations to tiles for Lseed; that is, a tiling function σseed : Sseed → T.
This initial partitioning phase occurs at A1:2.
A tile Ti has four elds, as summarized in Table 2.
• The region is used by the executor to schedule tiles in a given order. This eld is set right
after the partitioning of Lseed, as a tile (by construction) exclusively belongs to Scseed, S
b
seed,
or Sneseed.• The iteration lists contain the iterations in L that Ti will have to execute. There is one
iteration list for each Lj ∈ L, indicated as [Ti ]j . At this stage of the inspection we have
[Ti ]seed = [Ti ]0 = Pi , whereas still [Ti ]j = ∅ for j = 1, ...,n − 1.
• Local maps may be used for performance optimization by the executor in place of the global
maps provided through the loop chain; this will be discussed in more detail in Section 7
and in the Supplementary Materials.
• The color provides a tile with a scheduling priority. If shared-memory parallelism is re-
quested, adjacent tiles are given dierent colors (the adjacency relation is determined
through the maps available in L). Otherwise, colors are assigned in increasing order (i.e.,Ti
is given color i). The boundary tiles are always given colors higher than that of core tiles;
the non-exec tile has the highest color. The assignment of colors is carried out by A1:6.
Field Possible values
region core, boundary, non-exec
iterations lists one list of iterations [Ti ]j for each Lj ∈ L
local maps one list of local maps for each Lj ∈ L; one local map for each map used in Lj
color an integer representing the execution priority
Table 2. The tile data structure.
5.1.3 The inspection loop. The inspection loop, starting at A1:7, schedules the remaining Lj ∈ L
by alternating dependence analysis and construction of tiling functions. The input is σseed. As seen
in the previous sections, a projection is a function ϕS : S → T that captures data dependences
across loops. Initially, the projections set Φ is empty (A1:1). Once a new loop is tiled, Φ is updated
by adding new projections or changing existing ones (see Section 5.1.4). Using Φ, a new tiling
function σj for Lj is derived (see Section 5.1.5).
5.1.4 Deriving a projection from a tiling function. Algorithm 3 takes as input (the descriptors
of) an Lj and its tiling function σj : S j → T to update Φ. The algorithm also updates the conicts
matrix C ∈ Nm×m , which indicates whether two tiles having the same color will become adjacent.
A new projection ϕS is needed if S is written by Lj . As explained in Section 4, ϕS carries the
necessary information to tile a subsequent loop accessing S . Let us consider the non-trivial case in
1The algorithm implemented in SLOPE, the library presented in Section 6.1, supports backwards tiling for shared-memory
parallelism.
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:14 F. Luporini et al.
which Lj writes indirectly to S through a map M : S j → Sa . To compute ϕS , we rst determine the
inverse map M−1 (A3:5; an example is shown in Figure 8). Then, we iterate over all elements e ∈ S
and we set ϕS [e] to the last tile writing to e . This is accomplished by applying the MAX function
over the color of the tiles accessing e (see A3:11), obtained through M−1. This procedure was used,
for example, to compute the projection in Figure 4b.
ALGORITHM 3: Projection of a tiled loop
Input: A loop Lj , a tiling function σj , the projections set Φ, the
conicts matrix C
Result: Update Φ and C
1 foreach D ∈ Lj .descriptors do
2 if D.map == ⊥ then
3 Φ[Sj ] ← σj ;
4 else
5 M−1 ← map_invert(D.map);
6 S , Sj , values, oset←M−1 .unpack ();
7 ϕS ←⊥;
8 for e in S do
9 for k = oset[e] to oset[e + 1] do
10 Tlast = T[values[k ]];
11 max←MAX(ϕS [e].color, Tlast .color);
12 if max , ϕS [e].color then
13 ϕS [e] ← Tlast ;
14 end
15 end
16 end
17 update_color_conflicts(C, T, ϕS );
18 Φ[S ] = ϕS ;
19 end
20 end
ALGORITHM 4: Building a tiling function
Input: A loop Lj , the projections set Φ
Output: The tiling function σj
1 σj ←⊥;
2 foreach D ∈ Lj .descriptors do
3 if D.map == ⊥ then
4 ϕSj ← Φ[Sj ];
5 for e in Sj do
6 max←MAX(σj [e].color, ϕSj [e].color);
7 if max , σj [e].color then
8 σj [e] ← ϕSj [e];
9 end
10 end
11 else
12 arity← D.map.arity;
13 ϕS ← Φ[D.map.S ];
14 for e in Sj do
15 σj [e] ← T⊥ ;
16 for k = 0 to arity do
17 T ← ϕS [D.map.values[e ∗ arity + k ]];
18 max←MAX(σj [e].color, T .color);
19 if max , σj [e].color then
20 σj [e] ← T ;
21 end
22 end
23 end
24 end
25 end
26 return σj
3
9
7
1
map
inverse
map
3 7 9
cell 1
offset
3 4
cells incident
 to vertex 3
1… …
2 6 75 8 9
cells incident
 to vertex 7
…1
10
cells incident
 to vertex 9
…… 1
Fig. 8. Representation of an inverse map. The original map shows that the triangular cell 1 is adjacent to
three vertices, namely 3, 7, and 9. The inverse map associates vertices to cells. Since the mesh is unstructured,
dierent vertices can be incident to a dierent number of cells. The array offset provides the distance
between two consecutive vertices in the inverse map. For instance, all entries in the inverse map between
offset[3] and offset[4] are cells incident to vertex 3.
5.1.5 Deriving a tiling function from the available projections. Using Φ, we compute σj as de-
scribed in Algorithm 4. The algorithm is similar to the projection of a tiled loop (e.g., maps are used
to access the neighborhood of a target iteration). The main dierence is that now the projections
in Φ are used, rather than computed, to schedule iterations to tiles so that data dependences are
honored. Finally, σj is used to populate the iteration lists [Ti ]j , for all Ti ∈ T (see A1:10).
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:15
5.1.6 Detection of conflicts. If C indicates the presence of at least one conict, say between Ti1
and Ti2 , we add a “fake connection” between these two tiles and loop back to the coloring stage. Ti1
and Ti2 are now connected, so they will be assigned dierent colors.
5.1.7 On the history of the algorithm. The rst algorithm for generalized sparse tiling inspection
was introduced in [34]. In this section, a new, enhanced version of that algorithm has been presented.
In essence, the major dierences are: (i) support for distributed-memory parallelism; (ii) use of
coloring instead of a task graph for tile scheduling; (iii) speculative inspection with backtracking if
a coloring conict is detected; (iv) use of sets, instead of datasets, for data dependency analysis; (v)
use of inverse maps for parallelization of the projection and tiling routines; (vi) computation of
local maps. Most of these changes contributed to reduce the inspection cost, as discussed later.
5.2 Executor
The sparse tiling executor is illustrated in Algorithm 2. It consists of four main phases: (i) exchange of
halo regions amongst neighboring processes through non-blocking communications; (ii) execution
of core tiles (in overlap with communication); (iii) wait for the termination of the communications;
(iv) execution of boundary tiles.
As explained in Sections 3.3, a suciently deep halo region enables correct computation of
the boundary tiles. Further, tiles are executed atomically, meaning that all iterations in a tile are
computed without ever synchronizing with other processes. The depth of the boundary region,
which aects the amount of o-process data to be redundantly computed, increases with the number
n of loops to be fused. In the example in Figure 6, there are n = 3 loops, and three “strips” of extra
vertices are necessary for correctly computing the fused loops without tile-to-tile synchronizations.
We recall from Section 2 that the depth of the loop chain indicates the extent of the boundary
region. This parameter imposes a limit to the number of fusible loops. If L includes more loops
than the available boundary region – that is, if n > depth – then L will have to be split into
shorter loop chains, to be fused individually. As we shall see (Section 6), in our inspector/executor
implementation the depth is controlled by the Firedrake’s DMPlex module.
6 IMPLEMENTATION: SLOPE, PYOP2, AND FIREDRAKE
The implementation of automated sparse tiling is distributed over three open-source software
modules.
Firedrake An established framework for the automated solution of PDEs through the nite
element method [28].
PyOP2 A module used by Firedrake to apply numerical kernels over sets of mesh components.
Parallelism is handled at this level.
SLOPE A library for writing inspector/executor schemes, with primary focus on sparse tiling.
PyOP2 uses SLOPE to apply sparse tiling to loop chains.
The SLOPE library is an open source embodiment of the algorithms presented in this article.
The interplay amongst Firedrake, PyOP2 and SLOPE is outlined in Figure 9 and discussed in more
detail in the following sections.
6.1 SLOPE: a Library for Sparse Tiling Irregular Computations
SLOPE is an open source software that provides an interface to build loop chains and to express
inspector/executor schemes for sparse tiling2.
2SLOPE is available at https://github.com/coneoproject/SLOPE
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:16 F. Luporini et al.
1B
Slope
p
y
t
h
o
n
3
4
1C C programPyOP2 program
Firedrake
1A UFL specification
loop _chain
PyOP2
loop _chain
2
Fig. 9. Sparse tiling in the Firedrake-PyOP2-SLOPE framework. There are three ways of sparse tiling a loop
chain: decorating a Firedrake program (1A), decorating a sequence of loops in a PyOP2 program (1B), writing
both the loop chain and the inspector/executor codes explicitly in C through calls to SLOPE (1C). Both (1A)
and (1B) use the loop_chain interface (details in Section 6.2). PyOP2 derives the loop chain (2), essentially sets,
maps and loops (see Section 2 and example in Figure 2b), from the kernels produced within the loop_chain
context. The loop chain is provided to SLOPE through its Python interface (3). SLOPE performs the inspection
and returns its output, a tiles schedule, to PyOP2 (4). Eventually, the executor is generated and run by PyOP2.
The loop chain abstraction implemented by SLOPE has been formalized in Section 2. In essence,
a loop chain comprises some sets (including the separation into core, boundary, and non-exec
regions), maps between sets, and a sequence of loops. Each loop has one or more descriptors
specifying what and how dierent sets are accessed. The example in Figure 2b illustrates the
interface exposed by the library. SLOPE implements Algorithms 1, 3 and 4 from Section 5. Further,
it provides additional features to estimate the eectiveness and to verify the correctness of sparse
tiling, including generation of VTK les (suitable for Paraview [2]), to visualize the partitioning
of the mesh into colored tiles, as well as insightful inspection summaries (showing, for example,
number and average size of tiles, total number of colors used, time spent in critical inspection
phases).
In the case of shared-memory parallelism, the following sections of code are parallelized through
OpenMP:
Inspection The projection and tiling algorithms; in particular, the loops over set elements in
Algorithm 3 and Algorithm 4).
Execution The computation of same colored tiles; that is, the two loops over tiles in Algo-
rithm 2.
6.2 PyOP2: Lazy Evaluation and Interfaces
PyOP2 is a Python library oering abstractions to model an unstructured mesh – in terms of sets
(e.g. vertices, edges), maps between sets (e.g., a map from edges to vertices to express the mesh
topology), and datasets associating data to each set element (e.g. 3D coordinates to each vertex)
– and applying numerical kernels to sets of entities [28]. In this section, we focus on the three
relevant contributions to PyOP2 made through our work: (i) the interface to identify loop chains;
(ii) the lazy evaluation mechanism that allows loop chains to be built; (iii) the interaction with
SLOPE to automatically build and execute inspector/executor schemes.
To apply sparse tiling to a sequence of loops, the loop_chain interface was added to PyOP2. This
interface, exemplied in Figure 10, is also exposed to the higher layers, for example in Firedrake. In
the listing, the name uniquely identies a loop chain. Other parameters (most of them optional) are
useful for performance evaluation and performance tuning. Amongst them, the most important are
the tile_size and the fusion_scheme. The tile_size species the initial average size for the
seed partitions. The fusion_scheme allows to specify how to break a long sequence of loops into
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:17
smaller loop chains, which makes it possible to experiment with a full set of sparse tiling strategies
without having to modify the source code.
with loop_chain(name , tile_size , fusion_scheme , ...):
<some PyOP2 parallel loops are expressed/generated here >
Fig. 10. The loop_chain interface in PyOP2.
PyOP2 exploits lazy evaluation of parallel loops to generate an inspector/executor scheme.
The parallel loops encountered during the program execution – or, analogously, those generated
through Firedrake – are pushed into a queue, instead of being executed immediately. The sequence
of parallel loops in the queue is called the trace. If a dataset f needs to be read, for example because
a user wants to inspect its values or a global linear algebra operation is performed, then the trace is
traversed – from the most recent parallel loop to the oldest one – and a new sub-trace produced.
The sub-trace includes all parallel loops that must be executed to evaluate f correctly. The sub-trace
can then be executed or further pre-processed.
All loops in a trace that were created within a loop_chain scope are sparse tiling candidates. In
detail, the interaction between PyOP2 and SLOPE is as follows:
(1) Figure 10 shows that a loop_chain denes a new scope. As this scope is entered, a stamp
s1 of the trace is generated. This happens “behind the scenes”, because the loop_chain is
a Python context manager, which can execute pre-specied routines prior and after the
execution of the body. As the loop_chain’s scope is exited, a new stamp s2 of the trace is
computed. All parallel loops in the trace generated between s1 and s2 are placed into a
sub-trace for pre-processing.
(2) The pre-processing consists of two steps: (i) “simple” fusion – consecutive parallel loops
iterating over the same iteration space that do not present indirect data dependencies are
merged; (ii) generation of a loop chain representation for SLOPE.
(3) In (ii), PyOP2 inspects the sequence of parallel loops and translates their metadata (sets,
maps, loops) into a format suitable for the SLOPE’s Python interface. SLOPE performs an
inspection for the received loop chain and returns a tiles schedule to PyOP2 (i.e., it runs
Algorithm 1).
(4) A “software cache” mapping loop_chains to inspections is used. This whole process needs
therefore be executed only once for a given loop_chain.
(5) The executor is generated, compiled and run directly by PyOP2, with the help of an API
provided by SLOPE. To run the executor, the tiles schedule produced in (3) is used.
6.3 Firedrake/DMPlex: the S-depth Mechanism for Extended Halo Regions
Firedrake uses DMPlex [23] to handle meshes. DMPlex is responsible for partitioning, distributing
over multiple processes, and locally reordering a mesh. The MPI parallelization is therefore managed
through Firedrake/DMPlex.
During the start-up phase, each MPI process receives a contiguous partition of the original
mesh from DMPlex. The required PyOP2 sets, which can represent either topological components
(e.g., cells, vertices) or function spaces, are created. As intuitively shown in Figure 1, these sets
distinguish between multiple regions: core, owned, exec, and non-exec. Firedrake initializes the
four regions exploiting the information provided by DMPlex.
To support the loop chain abstraction, Firedrake must be able to allocate arbitrarily deep halo
regions. Both Firedrake and DMPlex have been extended to support this feature [19]. A parameter
called S-depth (the name has historical origins, see for instance [7]) regulates the extent of the halo
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:18 F. Luporini et al.
regions. A value S-depth = n indicates the presence of n strips of o-process data elements in each
set. The default value is S-depth = 1, which enables computation-communication overlap when
executing a single loop at the price of a small amount of redundant computation along partition
boundaries. This is the default execution model in Firedrake.
7 PERFORMANCE EVALUATION
7.1 The Seigen Computational Framework
Seigen is a seismological modelling framework capable of solving the elastic wave equation on
unstructured meshes. Exploiting the well-known velocity-stress formulation [39], the seismic model
is expressible as two rst-order linear PDEs, which we refer to as velocity and stress. These
governing equations are discretized in space through the discontinuous-Galerkin nite element
method. The evolution over time is obtained by using a fourth-order explicit leapfrog scheme based
on a truncated Taylor series expansion of the velocity and stress elds. The particular choice of
spatial and temporal discretizations has been shown to be non-dissipative [8]. More details can
be found in [17]. Seigen, which is built on top of Firedrake, is part of OPESCI, an ecosystem of
software for seismic imaging based on automated code generation [15].
Seigen has a set of test cases, which dier in various aspects, such as the initial conditions of the
system and the propagation of waves. However, they are all based upon the same seismological
model; from a computational viewpoint, this means that, in a time step, the same sequence of loops
is executed. In the following, we focus on the explosive_source test case (see the work by [12]
for background details).
7.2 Implementation and Validation
In a time loop iteration, eight linear systems need to be solved, four from velocity and four from
stress. Each solve consists of three macro-steps: assembling a global matrix A; assembling a global
vector b; computing x in the systemAx = b. There are two global “mass” matrices, one for velocity
and one for stress. Both matrices are time invariant, so they are assembled before entering the
time loop, and block-diagonal, as a consequence of the spatial discretization employed (a block
belongs to an element in the mesh). The inverse of a block-diagonal matrix is again block-diagonal
and is determined by computing the inverse of each block. The solution of the linear systemAx = b,
expressible as x = bA−1, can therefore be evaluated by looping over the mesh and computing a
“small” matrix-vector product in each element, where the matrix is a block in A−1. Assembling
the global vectors boils down to executing a set of loops over mesh entities, particularly over
cells, interior facets, and exterior facets. Overall, twenty-ve loops are executed in a time loop
iteration. Thanks to the hierarchy of “software caches” employed by Firedrake, the translation from
mathematical syntax into loops is only performed once.
Introducing sparse tiling into Seigen was relatively straightforward. Three steps were required:
(i) embedding the time stepping loop in a loop_chain context (see Section 6.2), (ii) propagating
user input relevant for performance tuning, (iii) creating a set of fusion schemes. A fusion scheme
establishes which sub-sequences of loops within a loop_chain will be fused and the respective seed
tile sizes. If no fusion schemes were specied, all of the twenty-ve loops would be fused using a
default tile size. As we shall see, operating with a set of small loop chains and heterogeneous tile
sizes is often more eective than fusing long sequences of loops.
Seigen has several mechanisms to validate the correctness of the seismological model and the
test cases. The numerical results of all code versions (with and without tiling) were checked and
compared. Paraview was also used to verify the simulation output. Snapshots of the simulation
output are displayed in Figure 11.
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:19
(a) Right aer the explosion. (b) Wave propagation. (c) Final snapshot.
Fig. 11. Three phases of the explosive_source test case in Seigen, following an explosion at a point source.
7.3 Computational Analysis of the Loops
We here discuss computational aspects of the twenty-ve fusible loops. The following considerations
derive from an analytical study of the data movement in the loop chain, extensive proling through
the Intel VTune Amplier tool [16], and rooine models (available in [17]).
Our initial hypothesis was that Seigen would have beneted from sparse tiling. Not only does
data reuse arise within single loops (e.g., by accessing vertex coordinates from adjacent cells), but
also across consecutive loops, through indirect data dependencies. This seemed to make Seigen a
natural t for sparse tiling. The eight “solver” loops perform matrix-vector multiplications in each
mesh element. It is well established that linear algebra operations of this kind are memory-bound.
Four of these loops arise from velocity, the others from stress. There is signicant data reuse
amongst the four velocity loops and amongst the four stress loops, since the same blocks in the
global inverse matrices are accessed. We therefore hypothesized performance gains if these loops
were fused through sparse tiling.
We also observed that because of the particular mesh employed, the exterior facet loops, which
implement the boundary conditions of the variational problems, have negligible cost. However, the
cells and facets loops do have signicant cost, and data reuse across them arises. Six loops iterate
over the interior mesh facets to evaluate facet integrals, which ensure the propagation of information
between adjacent cells in discontinuous-Galerkin methods. The operational intensity of these loops
is much lower than that of cell loops, and memory-boundedness is generally expected. Consecutive
facet and cell integral loops share elds, which creates cross-loop data reuse opportunities, thus
strengthening the hypothesis about the potential of sparse tiling in Seigen.
All computational kernels generated in Seigen are optimized through COFFEE [24], a system
used in Firedrake that, in essence, (i) minimizes the operation count by restructuring expressions
and loop nests, and (ii) maximizes auto-vectorization opportunities by applying transformations
such as array padding and by enforcing data alignment.
7.4 Setup and Reproducibility
There are two parameters that we can vary in explosive_source: the polynomial order of the
method, q, and the input mesh. We test the spectrum q ∈ {1, 2, 3, 4}. To test higher polynomial
orders, changes to both the spatial and temporal discretizations would be necessary. For the spatial
discretization, the most obvious choice would be tensor product function spaces, which at the
moment of writing is still unavailable in Firedrake. We use as input a two-dimensional rectangular
domain of xed size 300×150 tessellated with unstructured triangles (explosive_source only
supports two-dimension domains); to increase the number of elements in the domain, thus allowing
to weak scale, we vary the mesh spacing h.
The generality of the sparse tiling algorithms, the exibility of the loop_chain interface, and
the S-depth mechanism made it possible to experiment with a variety of fusion schemes without
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:20 F. Luporini et al.
changes to the source code. Five fusion schemes were devised, based on the following criteria: (i)
amount of data reuse, (ii) amount of redundant computation over the boundary region, (iii) memory
footprint (the larger, the smaller the tile size to t in cache). The fusion schemes are summarized
in Table 4. The full specication, along with the seed tile size for each sub loop chain, is available
at [46].
The experimentation was conducted on two platforms, whose specication is reported in Table 3.
The two platforms, Erebus (the “development machine”) and Helen (a Broadwell-based architecture
in the Helen cluster [33]) were idle and exclusive access had been obtained when the runs were
performed. Support for shared-memory parallelism is discontinued in Firedrake, so only distributed-
memory parallelism with 1 MPI process per physical core was tested. The MPI processes were
pinned to cores. The hyperthreading technology was tried, but found to be generally non-protable.
The code used for running the experiments was archived with the Zenodo data repository service:
Firedrake [42], PETSc [43], PETSc4py [44], FIAT [41], UFL [49], TSFC [48], PyOP2 [45], COFFEE [40],
SLOPE [47], and Seigen [46].
System Erebus Helen
Node
1x4-core
Intel I7-2600 3.4GHz
2x14-core
Intel Xeon E5-2680 v4 2.40GHz
DRAM 16 GB 128 GB (node)
Cache hierarchy L1=32KB, L2=256KB, L3=8MB L1=32KB, L2=256KB,L3/socket=35MB
Compilers Intel icc 16.0.2 Intel icc 16.0.3
Compiler ags -O3 -xHost -ip -O3 -xHost -ip
MPI version Open MPI 1.6.5 SGI MPT 2.14
Table 3. Systems specification.
Fusion
scheme
Number of
loop chains Criterion S-depth
fs1 3 Fuse costly cellsand facets loops 2
fs2 8 More aggressivethan fs1 2
fs3 6 fs2, include allsolver loops 2
fs4 3 More aggressivethan fs3 3
fs5 2 velocity, stress 4
Table 4. Fusion schemes summary.
In the following, for each experiment, we collect three measurements.
Overall completion time – OT Used to compute the maximum application speed-up when
sparse tiling is applied.
Average compute time – ACT Sparse tiling impacts the kernel execution time by increasing
data locality. Communication is also inuenced, especially in aggressive fusion schemes:
the rounds of communication decrease, while the data volume exchanged may increase.
ACT isolates the gain due to increased data locality from (i) the communication cost and (ii)
any other action performed in Firedrake (executing Python code) between the invocation
of kernels. This value is averaged across the processes.
Average compute and communication time – ACCT As opposed to ACT, the communi-
cation cost is also included in ACCT. By comparing ACCT and ACT, the communication
overhead can be derived.
As we shall see, all of these metrics will be essential for a complete understanding of the sparse
tiling performance.
To collect OT, ACT and ACCT, the following conguration was adopted. All experiments were
executed with “warm cache”; that is, with all kernels retrieved directly from the Firedrake’s software
cache of compiled kernels, so code generation and compilation times are not counted. All of the
non-tiled explosive_source tests were repeated three times. The minimum times are reported
(negligible variance). The cost of global matrix assembly – an operation that takes place before
entering the time loop – is not included in OT. Firedrake needs to be extended to assemble block-
diagonal matrices directly into vectors (an entry in the vector would represent a matrix block).
Currently, this is instead obtained in two steps: rst, by assembling into a CSR matrix; then, by
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:21
h q OT original (s) OT tiled (s) ΩOT ΩACT ΩACCT
1.0
1 687 596 1.15 1.17 1.16
2 1453 1200 1.21 1.25 1.25
3 3570 2847 1.25 1.28 1.27
4 7057 5827 1.21 1.22 1.22
1.2
1 419 365 1.15 1.17 1.16
2 870 715 1.22 1.26 1.26
3 1937 1549 1.25 1.28 1.27
4 4110 3390 1.21 1.23 1.22
Table 5. OT, ACT and ACCT on Erebus, with 1 MPI process per core.
explicitly copying the diagonal into a vector (a Python operation). The assembly per se never
takes more than 3 seconds, so it was reasonable to exclude this temporary overhead from our
timing. The inspection cost due to sparse tiling is included in OT, and its overhead will be discussed
appropriately in a later section. Extra costs were minimized: no check-pointing, only two I/O
sessions (at the beginning and at the end of the computation), and minimal logging. The time loop
has a xed duration, while the time step depends on the mesh spacing h to satisfy the Courant-
Friedrichs-Lewy necessary condition (i.e., CFL limit) for the numerical convergence. In essence,
ner meshes require proportionately smaller time steps to ensure convergence.
We now proceed with discussing the achieved results. Below, a generic instance of explosive_source
optimized with sparse tiling will be referred to as a “tiled version”, otherwise the term “original
version” will be used.
7.5 Single-node Experimentation
Hundreds of single-node experiments were executed on Erebus, which was easily accessible in
exclusive mode and much quicker to use for VTune proling. The rationale of these experiments
was to assess how sparse tiling impacts the application performance by improving data locality.
We only show parallel runs at maximum capacity (1 MPI process per physical core); the benets
of sparse tiling in sequential runs tend to be negligible (if any), because (i) the memory access
latency is only marginally aected when a large proportion of bandwidth is available to a single
process, (ii) hardware prefetching is impaired by the small iteration space of tiles, (iii) translation
lookaside buer (TLB) misses are more frequent due to tile expansion. Points (ii) and (iii) will be
elaborated upon.
Table 5 reports execution times and speed-ups, indicated with the symbol Ω, of the tiled version
over the original version for the three metrics OT, ACT and ACCT. We report the best speed-up
obtained after varying a number of parameters (tile size, fusion scheme and other optimizations
discussed below).
The parameters that we empirically varied to obtain these results were: (i) the fusion scheme,
fsX, X ∈ {1, 2, 3, 4, 5}, see Table 4; (ii) the seed tile size – for each fs and q, up to four dierent
values chosen to maximize the likelihood of tting in L2 or L3 cache, were tried3.
Further, a smaller experimentation varying (i) type of indirection maps (local or global, see
Section 5) and (ii) tile shape (through dierent mesh partitioning algorithms), as well as introducing
(iii) software prefetching and (iv) extended boundary region (to minimize redundant computation)
led to the conclusion that these optimizations may either improve or worsen the execution time, in
ways that are too dicult to predict beforehand. We therefore decided to exclude these parameters
from the full search space. This simplies the analysis that will follow and also allowed the
3A less extensive experimentation with “more adverse” tile sizes showed that: (i) a very small value causes dramatic
slow-downs (up to 8× slower than the original versions); (ii) larger values cause proportionately greater performance drops.
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:22 F. Luporini et al.
140 250 320 400
Tile size factor
1.00
1.05
1.10
1.15
1.20
1.25
1.30
A
C
T
 s
p
e
e
d
-u
p
[q = 1, h = 1.0]
fs1 fs2 fs3 fs4 fs5
70 140 200 300
Tile size factor
1.00
1.05
1.10
1.15
1.20
1.25
1.30
A
C
T
 s
p
e
e
d
-u
p
[q = 2, h = 1.0]
fs1 fs2 fs3 fs4 fs5
45 60 75
Tile size factor
1.00
1.05
1.10
1.15
1.20
1.25
1.30
A
C
T
 s
p
e
e
d
-u
p
[q = 3, h = 1.0]
fs1 fs2 fs3 fs4 fs5
20 45 70
Tile size factor
1.00
1.05
1.10
1.15
1.20
1.25
1.30
A
C
T
 s
p
e
e
d
-u
p
[q = 4, h = 1.0]
fs1 fs2 fs3 fs4 fs5
140 250 320 400
Tile size factor
1.00
1.05
1.10
1.15
1.20
1.25
1.30
A
C
T
 s
p
e
e
d
-u
p
[q = 1, h = 1.2]
fs1 fs2 fs3 fs4 fs5
70 140 200 300
Tile size factor
1.00
1.05
1.10
1.15
1.20
1.25
1.30
A
C
T
 s
p
e
e
d
-u
p
[q = 2, h = 1.2]
fs1 fs2 fs3 fs4 fs5
45 60 75
Tile size factor
1.00
1.05
1.10
1.15
1.20
1.25
1.30
A
C
T
 s
p
e
e
d
-u
p
[q = 3, h = 1.2]
fs1 fs2 fs3 fs4 fs5
20 45 70
Tile size factor
1.00
1.05
1.10
1.15
1.20
1.25
1.30
A
C
T
 s
p
e
e
d
-u
p
[q = 4, h = 1.2]
fs1 fs2 fs3 fs4 fs5
Fig. 12. Search space exploration on Erebus, with h ∈ {1.0, 1.2} and q ∈ {1, 2, 3, 4}. Each plot shows the
average compute time (ACT) speed-up achieved by multiple tiled versions over the original (non-tiled) version.
The seed tile size of a loop chain in an fs, in terms of seed loop iterations, is the product of the “tile size
factor” (x-axis) and a pre-established multiplier (an integer in the range [1, 4]; full list available at [46]).
execution of the whole test suite in less than ve days. The interested reader is invited to refer to
the Supplementary Materials attached to this article for more information.
In Figure 12, we summarize the results of the search space exploration. A plot shows the ACT
speed-ups achieved by multiple tiled versions over the non-tiled version for given q and h. In these
single-node experiments, the ACCT trend was basically identical (as one can infer from Table 5),
with variations in speed-up smaller than 1%.
PyOP2 was enhanced with a loop chain analyzer (LCA) capable of estimating the best- and
worst-case tile memory footprint, as well as the percentage of data reuse ideally achievable4. We use
this tool, as well as Intel VTune, to explain the results shown in Figure 12. We make the following
observations.
• fs, unsurprisingly, is the parameter having largest impact on the ACT. By inuencing the
fraction of data reuse convertible into data locality, the amount of redundant computation
and the data volume exchanged, fusion schemes play a fundamental role in sparse tiling.
This makes automation much more than a desired feature: without any changes to the
source code, multiple sparse tiling strategies could be studied and tuned. Automation is one
of our major contributions, and this performance exploration justies the implementation
eort.
• There is a non-trivial relationship between ACT, q and fs. The aggressive fusion schemes
are more eective with high q – that is, with larger memory footprints – while they tend
to be less ecient, or even deleterious, when q is low. The extreme case is fs5, which fuses
two long sequences of loops (twelve and thirteen loops each). In Figure 12 (Erebus), fs5 is
never a winning choice, although the dierence between fs3/fs4 and fs5 decreases as q
grows. If this trend continued with q > 4, then the gain from sparse tiling could become
increasingly larger.
4It is part of our future plans to integrate this tool with SLOPE, where the eects of tile expansion can be taken into account
to provide better estimates.
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:23
• A non-aggressive scheme fuses only a few small subsets of loops. As discussed in later
sections, these fusion schemes, despite aording larger tile sizes than the more aggressive
ones (due to the smaller memory footprint), suer from limited cross-loop data reuse. For
fs1, LCA determines that the percentage of reusable data in the three fused loop chains
decreases from 25% (q = 1) to 13% (q = 4). The drop is exacerbated by the fact that no
reuse can be exploited for the maps. Not only are these ideal values, but also a signicant
number of loops are left outside of loop chains. The combination of these factors motivate
the lack of substantial speed-ups. With fs2, a larger proportion of loops are fused and the
amount of shared data increases. The peak ideal reuse in a loop chain reaches 54%, which
translates into better ACTs. A similar growth in data reuse can be appreciated in more
aggressive fusion schemes, with a peak of 61% in one of the fs5’s loop chains. Nevertheless,
the performance of fs5 is usually worse than fs4. As we shall clarify in Section 7.8, this
is mainly due to the excessive memory footprint, which in turn leads to very small tiles.
Speculatively, we tried running a sixth fusion scheme: a single loop chain including all of
the 25 loops in a time iteration. In spite of an ideal data reuse of about 70%, the ACT was
always signicantly higher than all other schemes.
• Figure 12 shows the ACT for a “good selection” of tile size candidates. Our approach was as
follows. We took a very small set of problem instances and we tried a large range of seed
tile sizes. Very small tile sizes caused dramatic slow-downs, mostly because of ineective
hardware prefetching and TLB misses. Tile sizes larger than a certain fraction of L3 cache
(usually slightly larger than what a core should ideally own) also led to increasingly higher
ACTs. If we consider q = 4 in Figure 12, we observe that the ACT of fs2, fs3, and fs4
grows when the initial number of iterations in a tile is as big as 70. Here, LCA shows that
the tile memory footprint is, in multiple loop chains, higher than 3MB, with a peak of 6MB
in fs3. This exceeds the proportion of L3 cache that a process owns (on average), which
explains the performance drop.
7.6 Multi-node Experimentation
Weak-scaling experiments were carried out on thirty-two nodes split over two racks in the Helen
cluster at Imperial College London [33, 38]. The node architecture as well as the employed software
stack are summarized in Table 3. The rationale of these experiments was to assess whether the
change in communication pattern and amount of redundant computation caused by sparse tiling
could aect the run-time performance.
For eachq in the usual range [1−4], fs3 generally resulted in the best performance improvements,
due to its trade-o between gained data locality and restrained redundant computation (whose
eect obviously worsen as q grows). Figure 13 summarizes the ACCT speed-ups achieved by the
best tiled version (i.e., the one found by empirically varying the tile size, with same tile size factor
as in Figure 12) over the original version. The weak scaling trend is remarkable, with only a small
drop in the case q = 1 when switching from one rack (448 cores) to two racks (896 cores), which
disappears as soon as the per-process workload becomes signicant. For instance, with q = 2, each
process already computes over more than 150k degrees of freedom for the velocity and stress elds.
The peak speed-up over the original version, 1.28×, was obtained in the test case q = 3 when
running with 448 processes. The performance achieved was 1.84 TFLOPs/s (158 GFLOPs required
at each time step, for a total of 1000 time steps and an ACCT of 86 seconds); this corresponds
to roughly 15% of the theoretical machine peak (assuming AVX base frequency; the architecture
ideally performs 16 double-precision FLOPs per cycle).
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:24 F. Luporini et al.
28
(119K)
56
(240K)
112
(477K)
224
(953K)
448
(1.91M)
896
(3.80M)
Number of processes
(Mesh elements)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
A
C
C
T
 s
p
e
e
d
-u
p
 (
o
ri
g
in
a
l 
/ 
ti
lin
g
)
Weak scaling on Helen
q = 1 q = 2 q = 3 q = 4
Fig. 13. Weak scaling performance of Seigen’s explosive_source on the Helen cluster. The ACCT speed-up
is relative to a single process. Results for q ∈ {1, 2, 3, 4} are displayed. The simulations ran for 1000 timesteps.
7.7 Negligible Inspection Overhead
As explained in Section 7.4, the inspection cost was included in OT. In this section, we quantify
this overhead in a representative problem instance. In all other problem instances, the overhead
was either signicantly smaller than or essentially identical (for reasons discussed below) to that
reported here. We consider explosive_source on Erebus with h = 1.0, q = 4, and fs5. With this
conguration, the time step was ∆t = 481 · 10−6 (we recall from Section 7.4 that ∆t is a function
of h). Given the simulation duration T = 2.5, in this test case 5198 time steps were performed. A
time step took on average 1.15 seconds. In each time step, twenty-ve loops, fused as specied by
fs5, are executed. We know that in fs5 there are two sub loop chains, which respectively consist
of thirteen and twelve loops. To inspect these two loop chains, 1.4 and 1.3 seconds were needed
(average across the four MPI ranks, with negligible variance). Roughly 98% of the inspection time
was due to the projection and tiling functions, while only 0.2% was spent on the tile initialization
phase (see Section 5). These proportions are consistent across other fusion schemes and test cases.
After 200 time steps (less than 4% of the total) the inspection overhead was already close to 1%.
Consequently, the inspection cost, in this test case, was eventually negligible. The inspection cost
increases with the number of fused loops, which motivates the choice of fs5 for this analysis.
7.8 On the Main Performance Limiting Factor
The tile structure, namely its shape and size, is the key factor aecting sparse tiling in Seigen.
The seed loop partitioning and the mesh numbering determine the tile structure. The simplest
way of creating tiles consists of “chunking” the seed iteration space every ts iterations, with ts
being the initial tile size (see Section 5). Hence, the chunk partitioning inherits the original mesh
numbering. In Firedrake, and therefore in Seigen, meshes are renumbered during initialization
applying the reverse Cuthill-McKee (RCM) algorithm. Using chunk partitioning on top of an RCM-
renumbered mesh has the eect of producing thin, rectangular tiles, as displayed in Figure 14a.
This dramatically aects tile expansion, as a large proportion of elements will lie on the tile border.
There are potential solutions to this problem. The most promising would consist of using a Hilbert
curve, rather than RCM, to renumber the mesh. This would lead to more regular polygonal tiles
when applying chunk partitioning. Figures 14b and 14c show the tile structure that we would
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:25
(a) chunk partitioning in Seigen. (b) Ideal Hilbert partitioning. (c) Zoomed in Hilbert partitioning.
Fig. 14. Representation of seed iteration space partitioning strategies via SLOPE and Paraview.
ideally want in Seigen, from a Hilbert-renumbered mesh produced outside of Firedrake. As later
discussed, introducing support for Hilbert curves in Firedrake is part of our future work.
The memory footprint of a tile grows quite rapidly with the number of fused loops. In particular,
the matrices accessed in the velocity and stress solver loops have considerable size. The larger
the memory footprint, the smaller ts for the tile to t in some level of cache. Allocating small tiles
has unfortunately multiple implications. First, the proportion of iterations lying on the border grows,
which worsen the tile expansion phenomenon discussed, thus aecting data locality. Secondly, a
small ts impairs hardware prefetching, since the virtual address streams become more irregular.
Finally, using small tiles implies that a proportionately larger number of tiles are needed to “cover”
the iteration space; in the case of shared-memory parallelism, this increases the probability of
coloring conicts, which result in higher inspection costs.
As reiterated in the previous sections, the maximum length of a loop chain is dictated by the
extent of the boundary region. Unfortunately, a long loop chain has two undesirable eects. First,
the redundant computation overhead tends to be non-negligible if some of the fused loops are
compute-bound. Sometimes, the overhead could be even larger than the gain due to increased data
locality. Second, the size of an S-level (a “strip” of boundary elements) grows with the depth of the
boundary region, as outer levels include more elements than inner levels. This increases not only
the amount of redundant computation, but also the volume of data to be communicated. Overall,
this suggests that multiple, short loop chains may guarantee the best performance improvement,
as indicated by the results in Sections 7.5 and 7.6.
8 RELATED WORK
The data dependence analysis that we have developed in this article is based on the loop chain
abstraction, which was originally presented in [21]. This abstraction is suciently general to
capture data dependencies in programs structured as arbitrary sequences of loops, particularly to
create inspector/executor schemes for many unstructured mesh application. Inspector/executor
strategies were rst formalized by [32]. They have been used to exploit data reuse and to expose
shared-memory parallelism in several studies [9, 10, 20, 36].
Sparse tiling is a technique based upon inspection/execution. The term was coined by [36, 37] in
the context of the Gauss-Seidel algorithm and also used in [35] in the Moldyn benchmark. However,
the technique was initially proposed by [10] to parallelize computations over unstructured meshes,
taking the name of unstructured cache blocking. In this work, the mesh was initially partitioned; the
partitioning represented the tiling in the rst sweep over the mesh. Tiles would then shrink by one
layer of vertices for each iteration of the loop. This shrinking represented what parts of the mesh
could be accessed in later iterations of the outer loop without communicating with the processes
executing other tiles. The unstructured cache blocking technique also needed to execute a serial
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:26 F. Luporini et al.
clean-up tile at the end of the computation. [1] also developed an algorithm very similar to sparse
tiling, to parallelize Gauss-Seidel computations. The main dierence between [36, 37] and [10] was
that in the former work the tiles fully covered the iteration space, so a sequential clean-up phase at
the end could be avoided. All these approaches were either specic to individual benchmarks or
not capable of scheduling across heterogeneous loops (e.g., one over cells and another over degrees
of freedom). These limitations had been addressed in [34].
The automated code generation technique presented in [29] examines the data anity among
loops and performs partitioning with the goal of minimizing inter-process communication, while
maintaining load balancing. This technique supports unstructured mesh applications (being based
on an inspector/executor strategy) and targets distributed memory systems, although it does not
exploit the loop chain abstraction and does not introduce any sort of loop reordering transformation.
Automated code generation techniques based on polyhedral compilers have been applied to
structured grid benchmarks or proxy applications [5]. However, there has been very little eort in
providing evidence that these tools can be eective in real-world applications. Time-loop diamond
tiling was applied in [4] to a proxy application, but experimentation was limited to shared-memory
parallelism. In [31], instead, an approach based on raising the level of abstraction, similar to the
one presented in this paper, is described and evaluated. The experimentation is conducted using
realistic stencil codes ported to the OPS library. The main dierence with respect to our work is
the focus on structured grids (i.e., dierent types of numerical methods are targeted).
In structured codes, multiple layers of halo, or “ghost” elements, are often used to reduce
communication [3]. Overlapped tiling exploits the very same idea: trading communication for
redundant computation along the boundary [50]. Several works tackle overlapped tiling within
single regular loop nests (mostly stencil-based computations), for example [6, 22, 25]. Techniques
known as “communication avoiding” [9, 26] also fall in this category. To the best of our knowledge,
overlapped tiling for unstructured mesh applications has only been studied analytically, by [13].
Further, we are not aware of any prior techniques for automation.
9 FUTURE WORK
Using a Hilbert curve numbering will lead to dramatically better tile shapes, thus mitigating the
performance penalties due to tile expansion, TLB misses and hardware prefetching described in
Section 7. This extension is at the top of our future work priorities.
Shared-memory parallelism was not as carefully tested as distributed-memory parallelism. First
of all, we would like to replace the current OpenMP implementation in SLOPE with the MPI Shared
Memory (SHM) model introduced in MPI-3. Not only does a unied programming model provide
signicant benets in terms of maintainability and complexity, but the performance may also be
greater as suggested by recent developments in the PETSc community. Secondly, some extra work
would be required for a fair comparison of this new hybrid MPI+MPI programming model with
and without sparse tiling.
The experimentation was carried out on a number of “conventional” Intel Xeon architectures;
we aim to repeat the experimentation on the Intel’s Knights Landing soon.
Finally, a cost model for automatic derivation of fusion schemes and tile sizes is still missing.
10 CONCLUSIONS
Sparse tiling aims to turn the data reuse in a sequence of loops into data locality. In this article,
three main problems have been addressed: the specialization of sparse tiling to unstructured mesh
applications via a revisited loop chain abstraction, automation via DSLs, eective support for
shared- and distributed-memory parallelism. The major strength of this work lies in the fact that
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:27
all algorithmic and technological contributions presented derive from an in-depth study of realistic
application needs. The performance issues we found through Seigen would never have been exposed
by a set of simplistic benchmarks, as often used in the literature. Further experimentation will be
necessary when 3D domains and high-order discretizations will be supported by Seigen. In essence,
the performance experimentation shows systematic speed-ups, in the range 1.10×-1.30×. This is
hopefully improvable by switching to Hilbert curve numberings and by exploiting shared memory
through a suitable paradigm. Finally, our opinion is that sparse tiling is an “extreme” optimization:
at least for unstructured mesh application, it is unlikely that it will lead to speed-ups in the order
of magnitudes. However, through automation via DSLs, and with suitable optimization and tuning,
it may still play a key role in improving the performance of real-world computations.
11 SUPPLEMENTARY MATERIALS
11.1 Summary of the Optimizations Aempted in Seigen
A number of potential optimizations were attempted when applying sparse tiling to Seigen. The
experimentation with these optimizations was, however, inconclusive. Although performance
improvements were observed in various problem instances, in a signicant number of other cases
either a detrimental eect or no benets were noticed at all. Below we briey discuss the impact of
four dierent execution strategies: (i) use of global maps, (ii) variation in tile shape, (iii) software
prefetching, (iv) use of an extended boundary region to minimize redundant computation.
Global and local maps Algorithm 1 computes so called local maps to avoid an extra level
of indirection in the executor. Although no data reuse is available for the local maps (each
fused loop has its own local maps), there might be benets from improved hardware
prefetching and memory latency. We compared the use of global and local maps (i.e., the
former are normally constructed by Firedrake and provided in the loop chain specication,
the latter are computed by SLOPE), but no denitive conclusion could be drawn, as both
performance improvements and deteriorations within a 5% range were observed.
Tile shape In Section 7.8 we have explained that a Hilbert-renumbered mesh might sub-
stantially improve the tile shape quality. Hilbert curves, however, are not supported in
Firedrake yet. An alternative consists of partitioning the seed iteration space with a library
like METIS [18] before applying RCM. We experimented and discovered that this approach
too was not exempt from side eects. The main cause was increased translation lookaside
buer (TLB) misses, which occur whenever the CPU cannot retrieve the mapping to the
physical page corresponding to a given virtual page. Since the page table has a hierarchical
structure, handling a TLB miss usually requires multiple accesses to memory. Hence, TLB
misses are much more costly than cache misses. Sparse tiling increases the TLB miss/hit
ratio because of the fragmented streams of virtual addresses. This is evident (and more pro-
nounced) when the tile size is small, in which case a TLB miss is quite likely to occur when
jumping to executing a new loop. This problem is exacerbated by the metis partitioning (in
contrast to chunk), which leads to irregular tile shapes. Here, tile expansion may eventually
incorporate iterations living in completely dierent virtual pages. VTune experimentation
with q = 1 and q = 2 versions of explosive_source showed that chunk- and metis-based
sparse tiling suer from an increase in TLB misses of roughly 16% and 35%, respectively.
To mitigate this issue, we explored the possibility of using larger virtual pages through
Linux’s Transparent Huge Pages mechanism, which was enabled to automatically allocate
memory in virtual pages of 2MB (instead of the default 4KB) – as long as the base array
addresses were properly aligned. However, no signicant dierences were observed, and a
deeper investigation is still necessary.
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:28 F. Luporini et al.
Software prefetching In a loop, there is usually more than a single stream of memory ac-
cesses amenable to hardware prefetching (e.g., accesses to the indirection maps; direct
accesses to data values; indirect accesses to data values if the mesh has a good numbering).
Sparse tiling, unfortunately, impairs hardware prefetching for two reasons: (i) the virtual
addresses streams are considerably shorter; (ii) tile expansion introduces irregularities in
these streams. Software prefetching can be used together with hardware prefetching to
minimize memory stalls. PyOP2 and SLOPE have been extended to emit intrinsics instruc-
tions to prefetch the iteration i’s maps and data values while executing the iteration i − k
at distance k . No compelling evidence that this further transformation could systematically
improve the performance was found.
Extended boundary region The special non-exec tileTne (see Sections 3 and 5) reduces the
amount of redundant computation in long loop chains by expanding over boundary tiles.
There are two ways of creating Tne : either an extra layer of data is added to the boundary
region (e.g., see Figure 6) or during inspection, by searching for mesh boundaries. The
current implementation only supports the rst option. Manually derivingTne would be not
only algorithmically complex, but also potentially very expensive.
REFERENCES
[1] M. F. Adams and J. Demmel. 1999. Parallel Multigrid Solver Algorithms and Implementations for 3D Unstructured
Finite Element Problem. In Proceedings of SC99. Portland, Oregon.
[2] Utkarsh Ayachit. 2015. The ParaView Guide (Full Color Version): A Parallel Visualization Application (paraview 4.3 ed.).
Kitware, Incorporated. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/1930934300
[3] Federico Bassetti, Kei Davis, and Daniel J. Quinlan. 1998. Optimizing Transformations of Stencil Operations for Parallel
Object-Oriented Scientic Frameworks on Cache-Based Architectures. In Proceedings of the Second International
Symposium on Computing in Object-Oriented Parallel Environments (ISCOPE ’98). Springer-Verlag, London, UK, UK,
107–118. http://dl.acm.org/citation.cfm?id=646894.709707
[4] Uday Bondhugula, Vinayaka Bandishti, Albert Cohen, Guillain Potron, and Nicolas Vasilache. 2014. Tiling and
Optimizing Time-iterated Computations on Periodic Domains. In Proceedings of the 23rd International Conference on
Parallel Architectures and Compilation (PACT ’14). ACM, New York, NY, USA, 39–50. DOI:https://doi.org/10.1145/
2628071.2628106
[5] Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral
Parallelizer and Locality Optimizer. In Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language
Design and Implementation (PLDI ’08). ACM, New York, NY, USA, 101–113. DOI:https://doi.org/10.1145/1375581.
1375595
[6] Li Chen, Zhao-Qing Zhang, and Xiao-Bing Feng. 2002. Redundant computation partition on distributed-memory
systems. In Algorithms and Architectures for Parallel Processing, 2002. Proceedings. Fifth International Conference on.
252–260. DOI:https://doi.org/10.1109/ICAPP.2002.1173583
[7] A. T. Chronopoulos. 1991. s-Step Iterative Methods for (Non)Symmetric (In)Denite Linear Systems. SIAM J. Numer.
Anal. 28, 6 (1991), 1776–1789. DOI:https://doi.org/10.1137/0728088 arXiv:http://dx.doi.org/10.1137/0728088
[8] Sarah Delcourte, Loula Fezoui, and Nathalie Glinsky-Olivier. 2009. A high-order Discontinuous Galerkin method for
the seismic wave propagation. ESAIM: Proceedings 27 (2009), 70–89. DOI:https://doi.org/10.1051/proc/2009020
[9] James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick. 2008. Avoiding Communication in
Sparse Matrix Computations. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS).
IEEE Computer Society.
[10] Craig C. Douglas, Jonathan Hu, Markus Kowarschik, Ulrich Rüde, and Christian Weiß. 2000. Cache Optimization for
Structured and Unstructured Grid Multigrid. Electronic Transaction on Numerical Analysis (February 2000), 21–40.
[11] Denys Dutykh, Raphaël Poncet, and Frédéric Dias. 2011. The VOLNA code for the numerical modeling of tsunami
waves: Generation, propagation and inundation. European Journal of Mechanics - B/Fluids 30, 6 (2011), 598 – 615.
Special Issue: Nearshore Hydrodynamics.
[12] W. W. Garvin. 1956. Exact Transient Solution of the Buried Line Source Problem. Proceedings of the Royal Society of
London, Series A 234, 1199 (1956). DOI:https://doi.org/10.1098/rspa.1956.0055
[13] M. B. Giles, G. R. Mudalige, C. Bertolli, P. H. J. Kelly, E. Laszlo, and I. Reguly. 2012. An Analytical Study of Loop
Tiling for a Large-Scale Unstructured Mesh Application. In Proceedings of the 2012 SC Companion: High Performance
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
Automated Tiling of Unstructured Mesh Computations 1:29
Computing, Networking Storage and Analysis (SCC ’12). IEEE Computer Society, Washington, DC, USA, 477–482. DOI:
https://doi.org/10.1109/SC.Companion.2012.68
[14] M. B. Giles, G. R. Mudalige, Z. Sharif, G. Markall, and P. H. J. Kelly. 2011. Performance Analysis of the OP2 Framework
on Many-core Architectures. SIGMETRICS Perform. Eval. Rev. 38, 4 (March 2011), 9–15. DOI:https://doi.org/10.1145/
1964218.1964221
[15] Gerard Gorman, Marcos de Aguiar, David Ham, Felix Herrmann, Christian Jacobs, Paul Kelly, Michael Lange, Chris
Pain, Matthew Piggott, Spencer Sherwin, Felippe Vieira Zacarias, Mike Warner, Fabio Luporini, and Navjot Kukreja.
2015. OPESCI project web page. http://www.opesci.org. (2015).
[16] Intel Corporation. 2016. Intel VTune Performance Analyzer. software.intel.com/en-us/intel-vtune-amplier-xe. (2016).
[17] Christian T. Jacobs, Michael Lange, Fabio Luporini, and Gerard J. Gorman. 2017. Application of code generation to
high-order seismic modelling with the discontinuous Galerkin nite element method. (2017). In preparation.
[18] George Karypis and Vipin Kumar. 2011. MeTis: Unstructured Graph Partitioning and Sparse Matrix Ordering System,
Version 5.0. http://www.cs.umn.edu/~metis. (2011).
[19] Matthew G. Knepley, Michael Lange, and Gerard Gorman. 2015. Unstructured Overlapping Mesh Distribution in
Parallel. Submitted to ACM Transactions on Mathematical Software (TOMS) (2015). arXiv:1506.06194 http://arxiv.org/
abs/1506.06194
[20] Christopher D. Krieger and Michelle Mills Strout. 2012. Executing Optimized Irregular Applications Using Task Graphs
Within Existing Parallel Models. In Proceedings of the Second Workshop on Irregular Applications: Architectures and
Algorithms (IA3) held in conjunction with SC12.
[21] Christopher D. Krieger, Michelle Mills Strout, Catherine Olschanowsky, Andrew Stone, Stephen Guzik, Xinfeng
Gao, Carlo Bertolli, Paul Kelly, Gihan Mudalige, Brian Van Straalen, and Sam Williams. 2013. Loop Chaining: A
Programming Abstraction For Balancing Locality and Parallelism. In Proceedings of the 18th International Workshop on
High-Level Parallel Programming Models and Supportive Environments (HIPS). Boston, Massachusetts, USA.
[22] Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam, Atanas Rountev, and P Sadayappan.
2007. Eective Automatic Parallelization of Stencil Computations. SIGPLAN Not. 42, 6 (June 2007), 235–244. DOI:
https://doi.org/10.1145/1273442.1250761
[23] Michael Lange, Lawrence Mitchell, Matthew G. Knepley, and Gerard J. Gorman. 2016. Ecient mesh management in
Firedrake using PETSc-DMPlex. SIAM Journal on Scientic Computing 38, 5 (2016), S143–S155. DOI:https://doi.org/10.
1137/15M1026092 arXiv:cs.MS/1506.07749
[24] Fabio Luporini, Ana Lucia Varbanescu, Florian Rathgeber, Gheorghe-Teodor Bercea, J. Ramanujam, David A. Ham,
and Paul H. J. Kelly. 2015. Cross-Loop Optimization of Arithmetic Intensity for Finite Element Local Assembly. ACM
Trans. Archit. Code Optim. 11, 4, Article 57 (Jan. 2015), 25 pages. DOI:https://doi.org/10.1145/2687415
[25] Jiayuan Meng and Kevin Skadron. 2009. Performance Modeling and Automatic Ghost Zone Optimization for Iterative
Stencil Loops on GPUs. In Proceedings of the 23rd International Conference on Supercomputing (ICS ’09). ACM, New
York, NY, USA, 256–265. DOI:https://doi.org/10.1145/1542275.1542313
[26] Marghoob Mohiyuddin, Mark Hoemmen, James Demmel, and Katherine Yelick. 2009. Minimizing communication
in sparse matrix solvers. In Proceedings of the Conference on High Performance Computing Networking, Storage and
Analysis (SC ’09). ACM, Article 36, 12 pages. DOI:https://doi.org/10.1145/1654059.1654096
[27] Florian Rathgeber. 2014. Productive and ecient computational science through domain-specic abstractions. Ph.D.
Dissertation. Imperial College London.
[28] Florian Rathgeber, David A. Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew T. T. Mcrae, Gheorghe-
Teodor Bercea, Graham R. Markall, and Paul H. J. Kelly. 2016. Firedrake: Automating the Finite Element Method by
Composing Abstractions. ACM Trans. Math. Softw. 43, 3, Article 24 (Dec. 2016), 27 pages. DOI:https://doi.org/10.1145/
2998441
[29] Mahesh Ravishankar, John Eisenlohr, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2012.
Code generation for parallel execution of a class of irregular loops on distributed memory systems. In Proc. Intl. Conf.
on High Perf. Comp., Net., Sto. & Anal. Article 72, 11 pages.
[30] I. Z. Reguly, G. R. Mudalige, C. Bertolli, M. B. Giles, A. Betts, P. H. J. Kelly, and D. Radford. 2016. Acceleration of a
Full-Scale Industrial CFD Application with OP2. IEEE Transactions on Parallel and Distributed Systems 27, 5 (May 2016),
1265–1278. DOI:https://doi.org/10.1109/TPDS.2015.2453972
[31] István Z Reguly, Gihan R Mudalige, and Mike B Giles. 2017. Loop Tiling in Large-Scale Stencil Codes at Run-time with
OPS. arXiv preprint arXiv:1704.00693 (2017).
[32] Joel H. Salz, Ravi Mirchandaney, and Kay Crowley. 1991. Run-Time Parallelization and Scheduling of Loops. IEEE
Trans. Comput. 40, 5 (1991), 603–612.
[33] Imperial College High Performance Computing Service. 2017. The cx2/Helen cluster. (2017). DOI:https://doi.org/10.
14469/hpc/2232
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
1:30 F. Luporini et al.
[34] M.M. Strout, F. Luporini, C.D. Krieger, C. Bertolli, G.-T. Bercea, C. Olschanowsky, J. Ramanujam, and P.H.J. Kelly. 2014.
Generalizing Run-Time Tiling with the Loop Chain Abstraction. In Parallel and Distributed Processing Symposium,
2014 IEEE 28th International. 1136–1145. DOI:https://doi.org/10.1109/IPDPS.2014.118
[35] Michelle Mills Strout, Larry Carter, and Jeanne Ferrante. 2003. Compile-time Composition of Run-time Data and
Iteration Reorderings. In Proc. ACM SIGPLAN Conf. Prog. Lang. Des. & Impl. (PLDI). ACM, New York, NY, USA.
[36] Michelle Mills Strout, Larry Carter, Jeanne Ferrante, Jonathan Freeman, and Barbara Kreaseck. 2002. Combining
Performance Aspects of Irregular Gauss-Seidel via Sparse Tiling. In Proceedings of the 15th Workshop on Languages
and Compilers for Parallel Computing (LCPC). Springer.
[37] Michelle Mills Strout, Larry Carter, Jeanne Ferrante, and Barbara Kreaseck. 2004. Sparse Tiling for Stationary Iterative
Methods. International Journal of High Performance Computing Applications 18, 1 (February 2004), 95–114.
[38] TOP500. 2016. cx2/Helen cluster specication in the TOP500 ranking. https://www.top500.org/system/178845. (2016).
[39] Jean Virieux. 1986. P-SV wave propagation in heterogeneous media; velocity-stress nite-
dierence method. Geophysics 51, 4 (1986), 889–901. DOI:https://doi.org/10.1190/1.1442147
arXiv:http://geophysics.geoscienceworld.org/content/51/4/889.full.pdf
[40] Zenodo/COFFEE. 2017. coneoproject/COFFEE: A Compiler for Fast Expression Evaluation. (July 2017). DOI:https:
//doi.org/10.5281/zenodo.836678
[41] Zenodo/FIAT. 2017. redrakeproject/at: The Finite Element Automated Tabulator. (July 2017). DOI:https://doi.org/
10.5281/zenodo.836679
[42] Zenodo/Firedrake. 2017. redrakeproject/redrake: an automated nite element system. (July 2017). DOI:https:
//doi.org/10.5281/zenodo.836680
[43] Zenodo/PETSc. 2017. redrakeproject/petsc: Portable, Extensible Toolkit for Scientic Computation. (July 2017). DOI:
https://doi.org/10.5281/zenodo.836685
[44] Zenodo/PETSc4Py. 2017. redrakeproject/petsc4py: The Python interface to PETSc. (July 2017). DOI:https://doi.org/
10.5281/zenodo.836684
[45] Zenodo/PyOP2. 2017. OP2/PyOP2: Framework for performance-portable parallel computations on unstructured
meshes. (July 2017). DOI:https://doi.org/10.5281/zenodo.836688
[46] Zenodo/Seigen. 2017. OPESCI/Seigen: An elastic wave equation solver for seismological problems based on the nite
element method . (Aug. 2017). DOI:https://doi.org/10.5281/zenodo.840000
[47] Zenodo/SLOPE. 2017. coneoproject/SLOPE: A run-time system to tile irregular loops. (July 2017). DOI:https:
//doi.org/10.5281/zenodo.836738
[48] Zenodo/TSFC. 2017. redrakeproject/tsfc: The Two Stage Form Compiler redrakeproject/tsfc: The Two Stage Form
Compiler. (July 2017). DOI:https://doi.org/10.5281/zenodo.836677
[49] Zenodo/UFL. 2017. redrakeproject/u: The Unied Form Language. (July 2017). DOI:https://doi.org/10.5281/zenodo.
836683
[50] Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H. Kuhn, Yang Ni, and David Padua. 2012. Hierarchical
Overlapped Tiling. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO
’12). ACM, New York, NY, USA, 207–218. DOI:https://doi.org/10.1145/2259016.2259044
, Vol. 1, No. 1, Article 1. Publication date: June 2019.
