Sparse tiling is a technique to fuse loops that access common data, thus increasing data locality. Unlike traditional loop fusion or blocking, the loops may have different iteration spaces and access shared datasets through indirect memory accesses, such as A[map[i]]-hence the name "sparse." One notable example of such loops arises in discontinuous-Galerkin finite element methods, because of the computation of numerical integrals over different domains (e.g., cells, facets). The major challenge with sparse tiling is implementationnot only is it cumbersome to understand and synthesize, but it is also onerous to maintain and generalize, as it requires a complete rewrite of the bulk of the numerical computation. In this article, we propose an approach to extend the applicability of sparse tiling based on raising the level of abstraction. Through a sequence of compiler passes, the mathematical specification of a problem is progressively lowered, and eventually sparse-tiled C for-loops are generated. Besides automation, we advance the state-of-the-art by introducing a revisited, more efficient sparse tiling algorithm; support for distributed-memory parallelism; a range of fine-grained optimizations for increased runtime performance; implementation in a publicly available library, SLOPE; and an in-depth study of the performance impact in Seigen, a real-world elastic wave equation solver for seismological problems, which shows speed-ups up to 1.28× on a platform consisting of 896 Intel Broadwell cores.
INTRODUCTION
In many unstructured mesh applications, for example those approximating the solution of partial differential equations (PDEs) using the finite volume or the finite element method, sequences of numerical operators accessing common fields need to be evaluated. Usually, these operators are implemented by iterating over sets of mesh elements and computing a kernel in each element. In languages such as C or Fortran, the resulting sequence of loops is typically characterized by heterogeneous iteration spaces and accesses to shared datasets (reads, writes, increments) through indirect pointers, like A[map[i]]. One notable example of such operators/loops arises in discontinuous-Galerkin finite element methods, in which numerical integrals over different domains (e.g., cells, facets) are evaluated; here, A could represent a discrete function, whereas map could store connectivity information (e.g., from mesh elements to degrees of freedom). In this article, we devise compiler theory and technology to automate a sophisticated version of sparse tiling, a technique to maximize data locality when accessing shared fields (like the A and map arrays in the earlier example), which consists of fusing a sequence of loops by grouping iterations such that all data dependencies are honored. The goal is to improve the overall application performance with minimal disruption (none, if possible) to the source code.
Three motivating real-world applications for this work are Hydra, Volna, and Seigen. Hydra [30] is a finite-volume computational fluid dynamics application used at Rolls Royce for the simulation of next-generation components of jet engines. Volna [11] is a finite-volume computational fluid dynamics application for the modeling of tsunami waves. Seigen [17] aims to solve the elastic wave equation using the discontinuous-Galerkin finite element method for seismic exploration purposes. All these applications are characterized by the presence of a time-stepping loop, in which several loops over the mesh (33 in Hydra, 10 in Volna, 25 in Seigen) are repeatedly executed. These loops are characterized by the irregular dependence structure mentioned earlier, with, for example, indirect increments in one loop (e.g., A[m[i]] += f(...)) followed by indirect reads in one of the subsequent loops (e.g., b = g(A[n[j]])). The performance achievable by Seigen through sparse tiling will extensively be studied in Section 7.
Although our work is general in nature, we are particularly interested in supporting increasingly sophisticated seismological problems that will be developed on top of Seigen. This has led to the following strategic decisions:
Automation, but no interest in legacy codes. Sparse tiling is an "extreme optimization." An implementation in a low level language requires a great deal of effort, as a thoughtful restructuring of the application is necessary. In common with many other low level transformations, it also makes the source code impenetrable, affecting maintenance and extensibility. We therefore aim for a fully automated system based on domain-specific languages (DSLs), which abstracts sparse tiling through a simple interface (i.e., a single construct to define a scope of fusible loops) and a tiny set of parameters for performance tuning (e.g., the tile size). We are not interested in automating sparse tiling in legacy codes, in which the key computational aspects (e.g., mesh iteration, distributed-memory parallelism) are usually hidden for software modularity, thus making such a transformation almost impossible.
Unstructured meshes require mixed static/dynamic analysis. Unstructured meshes are often used to discretize the computational domain, since they allow for an accurate representation of complex geometries. Their connectivity is stored by means of adjacency lists (or equivalent data structure), which leads to indirect memory accesses within the loop nests. Indirections break static analysis, thus making purely compiler-based approaches insufficient. Runtime data dependence analysis is essential for sparse tiling, so integration of compiler and runtime tracking algorithms becomes necessary. Realistic datasets not fitting in a single node. Real-world simulations often operate on terabytes of data, hence execution on multi-node systems is often required. We have extended the original sparse tiling algorithm to enable distributed-memory parallelism.
Sparse tiling does not change the semantics of a numerical method-only the order in which some iterations are executed. Therefore, if most sections of a PDE solver suffer from computational boundedness and standard optimizations such as vectorization have already been applied, then sparse tiling, which targets memory-boundedness, will only provide marginal benefits (if any). Likewise, if a global reduction is present in between two loops, then there is no way for sparse tiling to be applied, unless the numerical method itself is rethought. This is regardless of whether the reduction is explicit (e.g., the first loop updates a global variable that is read by the second loop) or implicit (i.e., within an external function, as occurs, for example, in most implicit finite element solvers). These are probably the two greatest limitations of the technique; otherwise, sparse tiling may provide substantial performance benefits.
The rest of the article is structured as follows: in Section 2, we present the abstraction on which sparse tiling relies. We then show, in Section 3, examples of how the algorithm works on sharedand distributed-memory systems. This is followed by the formalization of the algorithms (Sections 4 and 5) and the implementation of the compiler that automates sparse tiling (Section 6). The experimentation is described in Section 7. A discussion on the limitations of the algorithms and the future work that we expect to carry out in the years to come conclude the article.
THE LOOP CHAIN ABSTRACTION FOR UNSTRUCTURED MESH APPLICATIONS
The loop chain is an abstraction introduced in [21] . Informally, a loop chain is a sequence of loops with no global synchronization points, enriched with information to enable runtime data dependence analysis-necessary since indirect memory accesses inhibit common static approaches to loop optimization. The idea is to replace static with dynamic analysis, exploiting the information carried by a loop chain. Loop chains must somehow be added to or automatically derived (e.g., exploiting a DSL) from the input code. A loop chain will then be used by an inspector/executor scheme [32] . The inspector is an algorithm performing data dependence analysis using the information carried by the loop chain, which eventually produces a sparse tiling schedule. This schedule is used by the executor, a piece of code semantically equivalent to the original sequence of loops (i.e., computing the same result) executing the various loop iterations in a different order.
Before diving into the description of the loop chain abstraction, it is worth observing two aspects.
-The inspection phase introduces an overhead. In many scientific computations, the data dependence pattern is static-or, more informally, "the topology does not change over time." This means that the inspection cost may be amortized over multiple iterations of the executor. If instead the mesh changes over time (e.g., in case of adaptive mesh refinement), a new inspection must be performed. -To adopt sparse tiling in a code there are two options. One possibility is to provide a library and leave the application specialists with the burden of carrying out the implementation (re-implementation in case of legacy code). A more promising alternative consists of raising the level of abstraction: programs can be written using a DSL; loop chain, inspector, and executor can then be automatically derived at the level of the intermediate representation.
As we shall see in Section 6, the tools developed in this article enable both approaches, though our primary interest is in the automated approach (i.e., via DSLs).
These points will be further elaborated in later sections. The loop chain abstraction was originally defined as follows:
is an ordered sequence of n loops. There are no global synchronization points in between the loops. Although there may be dependencies between successive loops in the chain, the execution order of a loop's iterations does not influence the result.
is a collection of m disjoint data spaces. Each loop accesses (reads from, writes to) a subset of these data spaces. An access can be either direct (e.g.,
They indicate which locations in the data space D d an iteration i ∈ L l reads from and writes to, respectively. A loop chain must provide all access relations for all loops.
We here refine this definition, and specialize it for unstructured mesh applications. This allows the introduction of new concepts, necessary to extend the sparse tiling algorithm presented in [34] . Some terminology and ideas are inspired by the programming model of OP2, a library for unstructured mesh applications [14] used to implement the already mentioned Hydra code.
-A loop chain L = [L 0 , L 1 , . . . , L n−1 ] is an ordered sequence of n loops. There are no global synchronization points in between the loops. Although there may be dependencies between successive loops in the chain, the execution order of a loop's iterations does not influence the result. -S = {S 0 , S 1 , . . . , S m−1 } is a collection of m disjoint iteration spaces. Possible iteration spaces are the topological entities of the mesh (e.g., cells, vertices) or the degrees of freedom associated with a function. When using distributed-memory parallelism, an iteration space S is logically split into three contiguous regions: core, boundary, and non-exec (see also Figure 1 ). Given a generic process P executing a loop over S, these regions represent the following: core: the subset of iterations computed by P that does not depend on halo exchanges. In other words, these are P's local iterations. boundary: the union of two sub-regions, owned and exec, which are defined next. The boundary region requires up-to-date halo data. Like core, owned contains iterations owned by P; the data produced by owned are sent out through a halo exchange. The exec iterations, instead, are executed because they indirectly write (increment) data in P's owned sub-region. non-exec: the subset of iterations not computed by P mapping read-only data sent over to P during a halo exchange. An iteration space is uniquely identified by a name and the sizes of its three regions.
-The depth is an integer indicating the extent of the boundary region. This is constant across all iteration spaces in S.
A map of arity a is a vector-valued function M : S i → S a j connecting elements in different iteration spaces. For example, we can express Fig. 1 . Two partitions of a mesh distributed to two neighboring processes, P 0 and P 1 . The core region includes all iterations that can be processed without reading halo data. The owned iterations can be processed only by reading halo data. Exec is the set of iterations that must be executed because they indirectly write (increment) the owned iterations. The union of the owned and exec regions is referred to as the boundary. The non-exec region includes halo data which is indirectly read during the exec computation. The iterations in P 0 's (P 1 's) exec region are, logically, the same iterations in P 1 's (P 0 's) owned region; thus, we say that these iterations are "redundantly executed." Matching colors across the two processes represent identical subsets of iterations in the non-partitioned mesh. The image was inspired by an example in [27] .
the mapping of a triangular cell c to three vertices
; here cells and vertices are iteration spaces, while c, v 0 , v 1 , v 2 are iteration identifiers (i.e., natural numbers). -A loop L i over the iteration space S is associated with one or more descriptors. A descriptor is a 2-tuple <M, mode>. M is either a map from S to some other iteration spaces or the special placeholder ⊥. In the former case, L i is accessing data associated with M (S ) indirectly; in the latter case, the data accesses are direct. mode is one of [r , w, i], indicating whether a memory access is a read, write, or increment.
There are a few crucial differences in this refined definition for the unstructured mesh case. One of them is the presence of iteration spaces in place of data spaces. In unstructured mesh applications, loops tend to access multiple data spaces associated with the same iteration space. A key observation is that if a loop is writing to some data spaces, then it is extremely likely that at least a subset of them will be accessed by the subsequent loop in the chain. The idea, therefore, is to rely on iteration spaces, rather than data spaces, to perform dependence analysis. This can substantially reduce the inspection cost, since typically |S| << |D|. Obviously, this relaxation might also create "false dependences", thus potentially affecting data communication. This would be the case if, for example, two consecutive, independent loops accessed different data fields associated with the same iteration space (e.g., pressure and velocity defined over the same set of degrees of freedom). In our experience, however, this rarely happens in practice (never in the case of the already mentioned Volna, Hydra, and Seigen).
Another fundamental addition is the characterization of iteration spaces into the three regions core, boundary, and non-exec. As we shall see, this separation is essential to enable distributedmemory parallelism. The extent of the boundary regions is captured by the depth of the loop chain. Informally, the depth tells how many extra "strips" of elements are provided by the neighboring processes. This allows some redundant computation along the partition boundary and also limits the depth of the loop chain (i.e., how many loops can be fused). The role of the parameter depth will be clear by the end of Section 5. On the left, a "toy" program used as running example in Section 3 to illustrate the loop chain abstraction and show how the sparse tiling algorithms (inspection, execution) work. Note that all parameters passed to the kernels are pointers. On the right, a code snippet showing the loop chain corresponding to the program on the left. The syntax is very close to the actual API of SLOPE, the sparse tiling library that we have implemented, described in Section 6.
LOOP CHAIN, INSPECTION, AND EXECUTION EXAMPLES
Using the example in Figure 2 (a), we describe the actions performed by our sparse tiling inspector. The inspector takes as input the loop chain illustrated in Figure 2 (b). We show two variants, for shared-memory and distributed-memory parallelism. The value of the variable mode in Figure 2 (b) determines the variant to be executed.
Overview
The inspector starts with partitioning the iteration space of a seed loop, for example L 0 . Partitions are used to initialize tiles: the iterations of L 0 falling in P i -or, in other words, the edges in partition P i -are assigned to the tile T i . Figure 3 displays the situation after the initial partitioning of L 0 for a given input mesh. There are four partitions, two of which (P 0 and P 3 ) are not connected through any edge or cell. These four partitions correspond to four tiles,
As detailed in the next two sections, the inspection proceeds by populating T i with iterations from L 1 and L 2 . The challenge of this task is guaranteeing that all data dependencies-read after write, write after read, write after write-are honored. The output of the inspector is eventually passed to the executor. The inspection carries sufficient information for computing sets of tiles in parallel. T i is always executed by a single thread/process and the execution is atomic; that is, it does not require communication with other threads/processes. When executing T i , first all iterations from L 0 are executed, then all iterations from L 1 , and finally those from L 2 .
Inspection for Shared-Memory Parallelism
Similarly to OP2, to achieve shared-memory parallelism we use coloring. Two tiles that are given the same color can be executed in parallel by different threads. Two tiles can have the same color if they are not connected, because this ensures the absence of race conditions through indirect memory accesses during parallel execution. In the example, we can use three colors: red (R), green (G), and blue (B). T 0 and T 3 are not connected, so they are assigned the same color. The colored tiles are shown in Figure 4 (a). In the following, with the notation T c i we indicate that the i-th tile has color c.
To
with iterations from L 1 and L 2 , we first have to establish a total ordering for the execution of partitions with different colors. Here, we assume the following order: green (G), blue (B), and red (R). This implies, for instance, that all iterations assigned to T B 1 must be executed before all iterations assigned to T R 2 . By "all iterations" we mean the iterations from L 0 (determined by the seed partitioning) as well as the iterations that will later be assigned from tiling L 1 and L 2 . We assign integer positive numbers to colors to reflect their ordering, where a smaller number means higher execution priority. We can assign, for example, 0 to green, 1 to blue, and 2 to red.
To schedule the iterations of
, we need to compute a projection for any write or local reduction performed by L 0 . The projection required by L 0 is a function ϕ : V → T Figure 2 (a) for shared-memory parallelism can lead to conflicts. Here, the two green tiles eventually become adjacent, creating race conditions. mapping the vertices in V -as indirectly incremented during the execution of L 0 (see Figure 2 (a))to a tile T c i ∈ T . Consider the vertex v 0 in Figure 4 (b) . v 0 has seven incident edges, two of which belong to T G 0 , while the remaining five belong to T B 1 . Since we established that G ≺ B, v 0 can only be read after T B 1 has finished executing the iterations from L 0 (i.e., the five incident blue edges). We express this condition by setting ϕ (v 0 ) = T B 1 . Observe that we can compute ϕ by iterating over V and, for each vertex, applying the maximum function (MAX) to the color of the adjacent edges.
We now use ϕ to schedule L 1 , a loop over cells, to the tiles. Consider again v 0 and the adjacent cells [c 0 , c 1 , c 2 ] in Figure 4 (b). These three cells have in common the fact that they are adjacent to both green and blue vertices. For c 1 , and similarly for the other cells, we compute
This establishes that c 1 must be assigned to T B 1 , because otherwise (c 1 assigned instead to T G 0 ) a read to v 0 would occur before the last increment from T B 1 took place. Indeed, we recall that the execution order, for correctness, must be "all iterations from [L 0 , L 1 , L 2 ] in the green tiles before all iterations from [L 0 , L 1 , L 2 ] in the blue tiles." The scheduling of L 1 to tiles is displayed in Figure 4 (c).
To schedule
we employ a similar process. Vertices are again written by L 1 , so a new projection over V will be necessary. Figure 4 Conflicting Colors. It is worth noting how T R 2 "consumed" the frontier elements of all other tiles every time a new loop was scheduled. Tiling a loop chain consisting of k loops has the effect of expanding the frontier of a tile of at most k vertices. With this in mind, we re-inspect the loop chain of the running example, although this time employing a different execution order-blue (B), red (R), and green (G)-and a different seed partitioning. Figure 5 shows that, by applying the same procedure described in this section, T G 0 and T G 3 will eventually become adjacent. This violates the precondition that tiles can be given the same color, and thus run in parallel, as long as they are not adjacent. Race conditions during the execution of iterations belonging to L 2 are now possible. This problem will be solved in Section 5.
Inspection for Distributed-Memory Parallelism
In the case of distributed-memory parallelism, the mesh is partitioned and distributed to a set of processes. Neighboring processes typically exchange (MPI) messages before executing a loop L j . A message includes all "dirty" dataset values required by L j modified by any L k , with L k ≺ L j . In the running example, L 0 writes to vertices, so a subset of values associated with border vertices must be communicated prior to the execution of L 1 . To apply sparse tiling, the idea is to push all communications to the beginning of the loop chain: as we shall see, this increases the amount of data to be communicated at a time, but also reduces the number of synchronizations (only one synchronization between each pair of neighboring processes per loop chain execution). 6 . A snapshot of the two mesh partitions on Process 0 and Process 1 after inspecting the seed loop L 0 for distributed-memory parallelism. On each process, there are five tiles in total: two in the core region (green and violet), two in the boundary region (red and light blue), and T ne . The boundary tiles can safely cross the owned and exec sub-regions (i.e., the local iterations and the iterations to be redundantly computed, respectively). However, no tile can include iterations from both the core and the boundary regions.
From Section 2 it is known that, in a loop chain, a set is logically split into three regions: core, boundary, and non-exec. The boundary tiles, which originate from the seed partitioning of the boundary region, will include all iterations that cannot be executed until the communications have terminated. The procedure described for shared-memory parallelism-now performed individually by each process on a partition of the input mesh-is modified as follows:
(1) The core region of the seed loop L 0 is partitioned into tiles. Unless aiming for a mixed distributed/shared-memory scheme, there is no need to assign identical colors to unconnected tiles, as a process will execute its own tiles sequentially. Colors are assigned increasingly, with T i given color i. As long as tiles with contiguous ID are also physically contiguous in the mesh, this assignment retains memory access locality when "jumping" from executing T i to T i+1 . (2) The same process is applied to the boundary region. Thus, a situation in which a tile includes iterations from both the core and the boundary regions is prevented by construction. Further, all tiles within the boundary region are assigned colors higher than those used for the core tiles. This constrains the execution order: no boundary tiles will be executed until all core tiles are computed. (3) We map the whole non-exec region of L 0 to a single special tile, T ne . This tile has the highest color and will actually never be executed.
From this point on, the inspection proceeds as in the case of shared-memory parallelism. The application of the MAX function when scheduling L 1 and L 2 makes higher color tiles (i.e., those having lower priority) "expand over" lower color ones.
In Figure 6 , a mesh is partitioned over two processes and a possible seed partitioning and tiling of L 0 illustrated. We observe that the two boundary tiles (the red and light blue ones) will expand over the core tiles as L 1 and L 2 are tiled, which eventually results in the scheduling illustrated in Figure 7 . Roughly speaking, if a loop chain consists of n loops and, on each process, n − 1 extra layers of iterations are provided (the exec regions in Figure 6 ), then all boundary tiles are correctly computed.
The schedule produced by the inspector is subsequently used by the executor. On each process, the executor starts with triggering the MPI communications required for the computation of boundary tiles. All core tiles are then computed, since no data from the boundary region is necessary. Hence, computation is overlapped with communication. As all core tiles are computed and the MPI communications terminated, the boundary tiles can finally be computed. Fig. 7 . A snapshot of the two mesh partitions on Process 0 and Process 1 at the end of the inspection for distributed-memory parallelism. T ne expands over the boundary region, which minimizes the amount of redundant computation to be performed. At the end of the execution phase, the orange edges will contain "dirty values," but correctness is not affected as the exec region only includes off-process data. The boundary tiles expand over the core region: this is essential for correctness since none of the red and blue entities from [L 0 , L 1 , L 2 ] can be executed until the MPI communications have terminated.
Efficiency Considerations. The underlying hypothesis is that the increase in data locality will outweigh the overhead induced by the redundant computation and by the bigger volume of data exchanged. This is motivated by several facts: (i) the loops being memory-bound; (ii) the core region being much larger than the boundary region; and (iii) the amount of redundant computation being minimized through the special tile T ne , which progressively expands over the boundary region, thus avoiding unnecessary calculations.
DATA DEPENDENCY ANALYSIS
The loop chain abstraction, described in Section 2, provides the information to construct an inspector capable of analyzing data dependencies and thus build a legal sparse tiling schedule. The dependencies between two different loops may be of type flow (read after write), anti (write after read), or output (write after write). Further, there may be "reduction dependencies" between iterations of the same loop.
Cross-Loop Dependences
Assume that loop L x , having iteration space S x , precedes loop L y , having iteration space S y , in the loop chain. Let e x be a generic iteration in S x . Let M be a map of arity a between two iteration spaces. Let mode ∈ {w, r , i} indicate whether an iteration is written, read, or incremented. We represent direct and indirect accesses as follows.
A direct access is a special case of indirect access when y = x and M S x →S y is the identity mapping. However, we here keep the distinction between the two types of access explicit due to their relevance in the sparse tiling algorithms, as explained in Section 5.
By considering pairs of points (e x , e y ) in the iteration spaces of the two loops L x and L y , namely, e x ∈ S x and e y ∈ S y , we can enumerate all possible dependences. For brevity, we do not distinguish between increments and writes; we also assume that at least one of the two loops accesses data indirectly. Let S z be a generic iteration space in the loop chain. Hence, the flow dependences are
In essence, there is a flow dependence between two iterations from different loops when one of those iterations writes to an element and the other iteration reads from the same element, directly or indirectly. To capture all these flow dependences, the inspection algorithm builds projections from one loop to another. We saw an example in Section 3.2: the loop over cells (S x ) performed an indirect increment to a dataset associated with vertices (S z ), which was then read by the subsequent loop over edges (S y ). Such flow dependence was of type ➂ (see definition above). For each vertex e z ∈ S z , the projection (illustrated in Figure 4 (b)) captured the last tile indirectly writing to (incrementing) e z , exploiting the color (i.e., the scheduling priority) of the source iterations. A flow dependence of type ➀ would be even simpler to deal with, as it would not require the use of the indirect map M S x →S z to update the color of the iterations. Likewise, we can enumerate the anti and output dependences:
Projections for this type of dependences are built analogously to that described above. The inspection algorithm building projections for all flow, anti, and output dependences is provided in Algorithm 3 and discussed in Section 5.1.4. How the inspector leverages data dependence analysis (i.e., projections) to schedule iterations to tiles (i.e., the tiling function) is formalized in Algorithm 4 and commented in Section 5.1.5.
Intra-Loop Dependences
There also are local reductions, or "reduction dependencies," between two or more iterations of the same loop when those iterations increment the same location(s); that is, when they read, modify with a commutative and associative operator, and write to the same location(s). The reduction dependencies in L x are
A reduction dependency between two iterations within the same loop indicates that those two iterations must be executed atomically with respect to each other. As we explained in Section 3.2, the inspection algorithm uses coloring to ensure atomic increments.
ALGORITHMS
The pseudo-code for the sparse tiling inspector is shown in Algorithm 1. Given a loop chain and a seed tile size, the algorithm produces a schedule suitable for mixed distributed/shared-memory parallelism. This schedule-in essence, a set of populated tiles-is used by the executor to perform the sparse-tiled computation. The executor pseudo-code is displayed in Algorithm 2. The next two sections elaborate on the main steps of these algorithms. The notation is summarized in Table 1 ; the syntax Ax:y is a shortcut for "Algorithm x, line y." The implementation is discussed in Section 6. 
The core, boundary, and non-exec regions of S j D A descriptor of a loop T
The collection of tiles
The core, boundary, and non-exec regions of
The seed tile size C
The matrix of conflicting colors Ax:y Algorithm x, line y The seed loop L seed is used to initialize the tiles. Theoretically, any loop in the chain can be chosen as seed. Supporting distributed-memory parallelism, however, is cumbersome if L seed L 0 . This is because more general schemes for partitioning and coloring would be needed to ensure that no iterations in any S b j are assigned to a core tile. A limitation of our inspector algorithm in the case of distributed-memory parallelism is that it must be L seed = L 0 .
In the special case in which there is no need to distinguish between core and boundary tiles (because a program is executed on a single shared-memory system), L seed can be chosen arbitrarily. If we however pick L seed in the middle of the loop chain, that is L 0 ≺ . . . ≺ L seed ≺ . . . ≺ L n−1 , a mechanism for constructing tiles in the reverse direction ("backward"), from L seed toward L 0 , is necessary. In [34] , we propose two "symmetric" algorithms to solve this problem, forward tiling and backward tiling, with the latter using the MIN function in place of MAX when computing 
Among all possible legal partitionings, we choose the one that splits S c seed into blocks of ts contiguous iterations, with P 0 = {0, . . . , ts − 1}, P 1 = {ts, . . . , 2ts − 1}, and so on. We analogously partition S b seed into k subsets. We create m + k + 1 tiles, one for each of these partitions and one extra tile for S ne seed , namely,
. At this point we have an assignment of iterations to tiles for L seed ; that is, a tiling function σ seed : S seed → T . This initial partitioning phase occurs at A1:2.
A tile T i has four fields, as summarized in Table 2 .
-The region is used by the executor to schedule tiles in a given order. This field is set right after the partitioning of L seed , as a tile (by construction) exclusively belongs to S c seed , S b seed , or S ne seed . -The iteration lists contain the iterations in L that T i will have to execute. There is one iteration list for each L j ∈ L, indicated as [T i ] j . At this stage of the inspection, we have
-Local maps may be used for performance optimization by the executor in place of the global maps provided through the loop chain; this will be discussed in more detail in Section 7 and in the Supplementary Materials. -The color provides a tile with a scheduling priority. If shared-memory parallelism is requested, adjacent tiles are given different colors (the adjacency relation is determined through the maps available in L). Otherwise, colors are assigned in increasing order (i.e., T i is given color i). The boundary tiles are always given colors higher than that of core tiles; the non-exec tile has the highest color. The assignment of colors is carried out by A1:6.
The Inspection
Loop. The inspection loop, starting at A1:7, schedules the remaining L j ∈ L by alternating dependence analysis and construction of tiling functions. The input is σ seed . As seen in the previous sections, a projection is a function ϕ S : S → T that captures data dependences across loops. Initially, the projections set Φ is empty (A1:1). Once a new loop is tiled, Φ is updated by adding new projections or changing existing ones (see Section 5.1.4). Using Φ, a new tiling function σ j for L j is derived (see Section 5.1.5).
Deriving a Projection from a Tiling
Function. Algorithm 3 takes as input (the descriptors of) an L j and its tiling function σ j : S j → T to update Φ. The algorithm also updates the conflicts matrix C ∈ N m×m , which indicates whether two tiles having the same color will become adjacent. Fig. 8 . Representation of an inverse map. The original map shows that the triangular cell 1 is adjacent to three vertices, namely, 3, 7, and 9. The inverse map associates vertices to cells. Since the mesh is unstructured, different vertices can be incident to a different number of cells. The array offset provides the distance between two consecutive vertices in the inverse map. For instance, all entries in the inverse map between offset [3] and offset [4] are cells incident to vertex 3.
A new projection ϕ S is needed if S is written by L j . As explained in Section 4, ϕ S carries the necessary information to tile a subsequent loop accessing S. Let us consider the non-trivial case in which L j writes indirectly to S through a map M : S j → S a . To compute ϕ S , we first determine the inverse map M −1 (A3:5; an example is shown in Figure 8 ). Then, we iterate over all elements e ∈ S and we set ϕ S [e] to the last tile writing to e. This is accomplished by applying the MAX function over the color of the tiles accessing e (see A3:11), obtained through M −1 . This procedure was used, for example, to compute the projection in Figure 4 Using Φ, we compute σ j as described in Algorithm 4. The algorithm is similar to the projection of a tiled loop (e.g., maps are used to access the neighborhood of a target iteration). The main difference is that now the projections in Φ are used, rather than computed, to schedule iterations to tiles so that data dependences are honored. Finally, σ j is used to populate the iteration lists [T i ] j , for all T i ∈ T (see A1:10).
Detection of Conflicts.
If C indicates the presence of at least one conflict, say between T i 1 and T i 2 , we add a "fake connection" between these two tiles and loop back to the coloring stage. T i 1 and T i 2 are now connected, so they will be assigned different colors.
5.1.7
On the History of the Algorithm. The first algorithm for generalized sparse tiling inspection was introduced in [34] . In this section, a new, enhanced version of that algorithm has been presented. In essence, the major differences are (i) support for distributed-memory parallelism; (ii) use of coloring instead of a task graph for tile scheduling; (iii) speculative inspection with backtracking if a coloring conflict is detected; (iv) use of sets, instead of datasets, for data dependency analysis; (v) use of inverse maps for parallelization of the projection and tiling routines; and (vi) computation of local maps. Most of these changes contributed to reduce the inspection cost, as discussed later.
Executor
The sparse tiling executor is illustrated in Algorithm 2. It consists of four main phases: (i) exchange of halo regions amongst neighboring processes through non-blocking communications; (ii) execution of core tiles (in overlap with communication); (iii) wait for the termination of the communications; and (iv) execution of boundary tiles.
As explained in Section 3.3, a sufficiently deep halo region enables correct computation of the boundary tiles. Further, tiles are executed atomically, meaning that all iterations in a tile are computed without ever synchronizing with other processes. The depth of the boundary region, which affects the amount of off-process data to be redundantly computed, increases with the number n of loops to be fused. In the example in Figure 6 , there are n = 3 loops, and three "strips" of extra vertices are necessary for correctly computing the fused loops without tile-to-tile synchronizations.
We recall from Section 2 that the depth of the loop chain indicates the extent of the boundary region. This parameter imposes a limit to the number of fusible loops. If L includes more loops than the available boundary region-that is, if n > depth-then L will have to be split into shorter loop chains, to be fused individually. As we shall see (Section 6), in our inspector/executor implementation the depth is controlled by the Firedrake's DMPlex module.
IMPLEMENTATION: SLOPE, PYOP2, AND FIREDRAKE
The implementation of automated sparse tiling is distributed over three open-source software modules.
Firedrake: An established framework for the automated solution of PDEs through the finite element method [28] . PyOP2: A module used by Firedrake to apply numerical kernels over sets of mesh components. Parallelism is handled at this level. SLOPE: A library for writing inspector/executor schemes, with primary focus on sparse tiling.
PyOP2 uses SLOPE to apply sparse tiling to loop chains.
The SLOPE library is an open source embodiment of the algorithms presented in this article. The interplay amongst Firedrake, PyOP2, and SLOPE is outlined in Figure 9 and discussed in more detail in the following sections.
SLOPE: A Library for Sparse Tiling Irregular Computations
SLOPE is an open source software that provides an interface to build loop chains and to express inspector/executor schemes for sparse tiling. 2 Fig. 9 . Sparse tiling in the Firedrake-PyOP2-SLOPE framework. There are three ways of sparse tiling a loop chain: decorating a Firedrake program (1A), decorating a sequence of loops in a PyOP2 program (1B), and writing both the loop chain and the inspector/executor codes explicitly in C through calls to SLOPE (1C). Both (1A) and (1B) use the loop_chain interface (details in Section 6.2). PyOP2 derives the loop chain (2), essentially sets, maps, and loops (see Section 2 and example in Figure 2(b) ), from the kernels produced within the loop_chain context. The loop chain is provided to SLOPE through its Python interface (3) . SLOPE performs the inspection and returns its output, a tiles schedule, to PyOP2 (4). Eventually, the executor is generated and run by PyOP2. The loop chain abstraction implemented by SLOPE has been formalized in Section 2. In essence, a loop chain comprises some sets (including the separation into core, boundary, and non-exec regions), maps between sets, and a sequence of loops. Each loop has one or more descriptors specifying what and how different sets are accessed. The example in Figure 2 (b) illustrates the interface exposed by the library. SLOPE implements Algorithms 1, 3, and 4 from Section 5. Further, it provides additional features to estimate the effectiveness and to verify the correctness of sparse tiling, including generation of VTK files (suitable for Paraview [2] ), to visualize the partitioning of the mesh into colored tiles, as well as insightful inspection summaries (showing, e.g., number and average size of tiles, total number of colors used, time spent in critical inspection phases).
In the case of shared-memory parallelism, the following sections of code are parallelized through OpenMP:
Inspection: The projection and tiling algorithms; in particular, the loops over set elements in Algorithms 3 and 4. Execution: The computation of same colored tiles; that is, the two loops over tiles in Algorithm 2.
PyOP2: Lazy Evaluation and Interfaces
PyOP2 is a Python library offering abstractions to model an unstructured mesh-in terms of sets (e.g., vertices, edges), maps between sets (e.g., a map from edges to vertices to express the mesh topology), and datasets associating data to each set element (e.g., 3D coordinates to each vertex)and applying numerical kernels to sets of entities [28] . In this section, we focus on the three relevant contributions to PyOP2 made through our work: (i) the interface to identify loop chains;
(ii) the lazy evaluation mechanism that allows loop chains to be built; and (iii) the interaction with SLOPE to automatically build and execute inspector/executor schemes. To apply sparse tiling to a sequence of loops, the loop_chain interface was added to PyOP2. This interface, exemplified in Figure 10 , is also exposed to the higher layers, for example in Firedrake. In the listing, the name uniquely identifies a loop chain. Other parameters (most of them optional) are useful for performance evaluation and performance tuning. Amongst them, the most important are the tile_size and the fusion_scheme. The tile_size specifies the initial average size for the seed partitions. The fusion_scheme allows one to specify how to break a long sequence of loops into smaller loop chains, which makes it possible to experiment with a full set of sparse tiling strategies without having to modify the source code.
PyOP2 exploits lazy evaluation of parallel loops to generate an inspector/executor scheme. The parallel loops encountered during the program execution-or, analogously, those generated through Firedrake-are pushed into a queue, instead of being executed immediately. The sequence of parallel loops in the queue is called the trace. If a dataset f needs to be read, for example because a user wants to inspect its values or a global linear algebra operation is performed, then the trace is traversed-from the most recent parallel loop to the oldest one-and a new sub-trace produced. The sub-trace includes all parallel loops that must be executed to evaluate f correctly. The sub-trace can then be executed or further pre-processed.
All loops in a trace that were created within a loop_chain scope are sparse tiling candidates. In detail, the interaction between PyOP2 and SLOPE is as follows:
(1) Figure 10 shows that a loop_chain defines a new scope. As this scope is entered, a stamp s 1 of the trace is generated. This happens "behind the scenes," because the loop_chain is a Python context manager, which can execute pre-specified routines prior and after the execution of the body. As the loop_chain's scope is exited, a new stamp s 2 of the trace is computed. All parallel loops in the trace generated between s 1 and s 2 are placed into a sub-trace for pre-processing. (2) The pre-processing consists of two steps: (i) "simple" fusion-consecutive parallel loops iterating over the same iteration space that do not present indirect data dependencies are merged; (ii) generation of a loop chain representation for SLOPE. (3) In (ii), PyOP2 inspects the sequence of parallel loops and translates their metadata (sets, maps, loops) into a format suitable for the SLOPE's Python interface. SLOPE performs an inspection for the received loop chain and returns a tiles schedule to PyOP2 (i.e., it runs Algorithm 1). (4) A "software cache" mapping loop_chains to inspections is used. This whole process needs, therefore, to be executed only once for a given loop_chain. (5) The executor is generated, compiled, and run directly by PyOP2, with the help of an API provided by SLOPE. To run the executor, the tiles schedule produced in (3) is used.
Firedrake/DMPlex: The S-Depth Mechanism for Extended Halo Regions
Firedrake uses DMPlex [23] to handle meshes. DMPlex is responsible for partitioning, distributing over multiple processes, and locally reordering a mesh. The MPI parallelization is therefore managed through Firedrake/DMPlex. During the start-up phase, each MPI process receives a contiguous partition of the original mesh from DMPlex. The required PyOP2 sets, which can represent either topological components (e.g., cells, vertices) or function spaces, are created. As intuitively shown in Figure 1 , these sets distinguish between multiple regions: core, owned, exec, and non-exec. Firedrake initializes the four regions exploiting the information provided by DMPlex.
To support the loop chain abstraction, Firedrake must be able to allocate arbitrarily deep halo regions. Both Firedrake and DMPlex have been extended to support this feature [19] . A parameter called S-depth (the name has historical origins; see, for instance, [7] ) regulates the extent of the halo regions. A value S-depth = n indicates the presence of n strips of off-process data elements in each set. The default value is S-depth = 1, which enables computation-communication overlap when executing a single loop at the price of a small amount of redundant computation along partition boundaries. This is the default execution model in Firedrake.
PERFORMANCE EVALUATION 7.1 The Seigen Computational Framework
Seigen is a seismological modeling framework capable of solving the elastic wave equation on unstructured meshes. Exploiting the well-known velocity-stress formulation [39] , the seismic model is expressible as two first-order linear PDEs, which we refer to as velocity and stress. These governing equations are discretized in space through the discontinuous-Galerkin finite element method. The evolution over time is obtained by using a fourth-order explicit leapfrog scheme based on a truncated Taylor series expansion of the velocity and stress fields. The particular choice of spatial and temporal discretizations has been shown to be non-dissipative [8] . More details can be found in [17] . Seigen, which is built on top of Firedrake, is part of OPESCI, an ecosystem of software for seismic imaging based on automated code generation [15] .
Seigen has a set of test cases, which differ in various aspects, such as the initial conditions of the system and the propagation of waves. However, they are all based upon the same seismological model; from a computational viewpoint, this means that, in a time step, the same sequence of loops is executed. In the following, we focus on the explosive_source test case (see the work by [12] for background details).
Implementation and Validation
In a time loop iteration, eight linear systems need to be solved, four from velocity and four from stress. Each solve consists of three macro-steps: assembling a global matrix A; assembling a global vector b; and computing x in the system Ax = b. There are two global "mass" matrices, one for velocity and one for stress. Both matrices are time invariant, so they are assembled before entering the time loop, and block-diagonal, as a consequence of the spatial discretization employed (a block belongs to an element in the mesh). The inverse of a block-diagonal matrix is again block-diagonal and is determined by computing the inverse of each block. The solution of the linear system Ax = b, expressible as x = bA −1 , can therefore be evaluated by looping over the mesh and computing a "small" matrix-vector product in each element, where the matrix is a block in A −1 . Assembling the global vectors boils down to executing a set of loops over mesh entities, particularly over cells, interior facets, and exterior facets. Overall, 25 loops are executed in a time loop iteration. Thanks to the hierarchy of "software caches" employed by Firedrake, the translation from mathematical syntax into loops is only performed once.
Introducing sparse tiling into Seigen was relatively straightforward. Three steps were required: (i) embedding the time stepping loop in a loop_chain context (see Section 6.2), (ii) propagating user input relevant for performance tuning, and (iii) creating a set of fusion schemes. A fusion scheme establishes which sub-sequences of loops within a loop_chain will be fused and the respective seed tile sizes. If no fusion schemes were specified, all of the 25 loops would be fused using a default tile size. As we shall see, operating with a set of small loop chains and heterogeneous tile sizes is often more effective than fusing long sequences of loops.
Seigen has several mechanisms to validate the correctness of the seismological model and the test cases. The numerical results of all code versions (with and without tiling) were checked and compared. Paraview was also used to verify the simulation output. Snapshots of the simulation output are displayed in Figure 11 .
Computational Analysis of the Loops
We here discuss the computational aspects of the 25 fusible loops. The following considerations derive from an analytical study of the data movement in the loop chain, extensive profiling through the Intel VTune Amplifier tool [16] , and roofline models (available in [17] ). Fig. 11 . Three phases of the explosive_source test case in Seigen, following an explosion at a point source.
Our initial hypothesis was that Seigen would have benefited from sparse tiling. Not only does data reuse arise within single loops (e.g., by accessing vertex coordinates from adjacent cells), but also across consecutive loops, through indirect data dependencies. This seemed to make Seigen a natural fit for sparse tiling. The eight "solver" loops perform matrix-vector multiplications in each mesh element. It is well established that linear algebra operations of this kind are memory-bound. Four of these loops arise from velocity, the others from stress. There is significant data reuse amongst the four velocity loops and amongst the four stress loops, since the same blocks in the global inverse matrices are accessed. We therefore hypothesized performance gains if these loops were fused through sparse tiling.
We also observed that because of the particular mesh employed, the exterior facet loops, which implement the boundary conditions of the variational problems, have negligible cost. However, the cells and facets loops do have significant cost, and data reuse across them arises. Six loops iterate over the interior mesh facets to evaluate facet integrals, which ensure the propagation of information between adjacent cells in discontinuous-Galerkin methods. The operational intensity of these loops is much lower than that of cell loops, and memory-boundedness is generally expected. Consecutive facet and cell integral loops share fields, which creates cross-loop data reuse opportunities, thus strengthening the hypothesis about the potential of sparse tiling in Seigen.
All computational kernels generated in Seigen are optimized through COFFEE [24] , a system used in Firedrake that, in essence, (i) minimizes the operation count by restructuring expressions and loop nests, and (ii) maximizes auto-vectorization opportunities by applying transformations such as array padding and by enforcing data alignment.
Setup and Reproducibility
There are two parameters that we can vary in explosive_source: the polynomial order of the method, q, and the input mesh. We test the spectrum q ∈ {1, 2, 3, 4}. To test higher polynomial orders, changes to both the spatial and temporal discretizations would be necessary. For the spatial discretization, the most obvious choice would be tensor product function spaces, which at the moment of writing is still unavailable in Firedrake. We use as input a two-dimensional rectangular domain of fixed size 300 × 150 tessellated with unstructured triangles (explosive_source only supports two-dimension domains); to increase the number of elements in the domain, thus allowing to weak scale, we vary the mesh spacing h.
The generality of the sparse tiling algorithms, the flexibility of the loop_chain interface, and the S-depth mechanism made it possible to experiment with a variety of fusion schemes without changes to the source code. Five fusion schemes were devised, based on the following criteria: (i) amount of data reuse, (ii) amount of redundant computation over the boundary region, and (iii) memory footprint (the larger, the smaller the tile size to fit in cache). The fusion schemes are summarized in Table 4 . The full specification, along with the seed tile size for each sub loop chain, is available at [46] . The experimentation was conducted on two platforms, whose specification is reported in Table 3 . The two platforms, Erebus (the "development machine") and Helen (a Broadwell-based architecture in the Helen cluster [33] ) were idle and exclusive access had been obtained when the runs were performed. Support for shared-memory parallelism is discontinued in Firedrake, so only distributed-memory parallelism with one MPI process per physical core was tested. The MPI processes were pinned to cores. The hyperthreading technology was tried, but found to be generally non-profitable. The code used for running the experiments was archived with the Zenodo data repository service: Firedrake [42] , PETSc [43] , PETSc4py [44] , FIAT [41] , UFL [49] , TSFC [48] , PyOP2 [45] , COFFEE [40] , SLOPE [47] , and Seigen [46] .
In the following, for each experiment, we collect three measurements.
Overall completion time -OT: Used to compute the maximum application speed-up when sparse tiling is applied. Average compute time -ACT: Sparse tiling impacts the kernel execution time by increasing data locality. Communication is also influenced, especially in aggressive fusion schemes: the rounds of communication decrease, while the data volume exchanged may increase. ACT isolates the gain due to increased data locality from (i) the communication cost and (ii) any other action performed in Firedrake (executing Python code) between the invocation of kernels. This value is averaged across the processes. Average compute and communication time -ACCT: As opposed to ACT, the communication cost is also included in ACCT. By comparing ACCT and ACT, the communication overhead can be derived.
As we shall see, all of these metrics will be essential for a complete understanding of the sparse tiling performance. To collect OT, ACT, and ACCT, the following configuration was adopted. All experiments were executed with "warm cache"; that is, with all kernels retrieved directly from the Firedrake's software cache of compiled kernels, so code generation and compilation times are not counted. All of the non-tiled explosive_source tests were repeated three times. The minimum times are reported (negligible variance). The cost of global matrix assembly-an operation that takes place before entering the time loop-is not included in OT. Firedrake needs to be extended to assemble block-diagonal matrices directly into vectors (an entry in the vector would represent a matrix block). Currently, this is instead obtained in two steps: first, by assembling into a CSR matrix; then, by explicitly copying the diagonal into a vector (a Python operation). The assembly per se never takes more than 3 seconds, so it was reasonable to exclude this temporary overhead from our timing. The inspection cost due to sparse tiling is included in OT, and its overhead will be discussed appropriately in a later section. Extra costs were minimized: no check-pointing, only two I/O sessions (at the beginning and at the end of the computation), and minimal logging. The time loop has a fixed duration, while the time step depends on the mesh spacing h to satisfy the Courant-Friedrichs-Lewy necessary condition (i.e., CFL limit) for the numerical convergence. In essence, finer meshes require proportionately smaller time steps to ensure convergence. We now proceed with discussing the achieved results. Below, a generic instance of explosive_ source optimized with sparse tiling will be referred to as a "tiled version," otherwise the term "original version" will be used.
Single-Node Experimentation
Hundreds of single-node experiments were executed on Erebus, which was easily accessible in exclusive mode and much quicker to use for VTune profiling. The rationale of these experiments was to assess how sparse tiling impacts the application performance by improving data locality.
We only show parallel runs at maximum capacity (one MPI process per physical core); the benefits of sparse tiling in sequential runs tend to be negligible (if any), because (i) the memory access latency is only marginally affected when a large proportion of bandwidth is available to a single process, (ii) hardware prefetching is impaired by the small iteration space of tiles, and (iii) translation lookaside buffer (TLB) misses are more frequent due to tile expansion. Points (ii) and (iii) will be elaborated upon. Table 5 reports execution times and speed-ups, indicated with the symbol Ω, of the tiled version over the original version for the three metrics OT, ACT, and ACCT. We report the best speed-up obtained after varying a number of parameters (tile size, fusion scheme, and other optimizations discussed below).
The parameters that we empirically varied to obtain these results were (i) the fusion scheme, fsX , X ∈ {1, 2, 3, 4, 5} (see Table 4 ); and (ii) the seed tile size-for each fs and q, up to four different values chosen to maximize the likelihood of fitting in L2 or L3 cache, were tried. 3 Further, a smaller experimentation varying (i) type of indirection maps (local or global; see Section 5) and (ii) tile shape (through different mesh partitioning algorithms), as well as introducing (iii) software prefetching and (iv) extended boundary region (to minimize redundant computation) led to the conclusion that these optimizations may either improve or worsen the execution time, in ways that are too difficult to predict beforehand. We therefore decided to exclude these parameters from the full search space. This simplifies the analysis that will follow and also allowed the execution of the whole test suite in less than 5 days. The interested reader is invited to refer to the Supplementary Materials attached to this article for more information. The seed tile size of a loop chain in an fs, in terms of seed loop iterations, is the product of the "tile size factor" (x-axis) and a pre-established multiplier (an integer in the range [1, 4] ; full list available at [46] ).
In Figure 12 , we summarize the results of the search space exploration. A plot shows the ACT speed-ups achieved by multiple tiled versions over the non-tiled version for given q and h. In these single-node experiments, the ACCT trend was basically identical (as one can infer from Table 5) , with variations in speed-up smaller than 1%.
PyOP2 was enhanced with a loop chain analyzer (LCA) capable of estimating the best-and worstcase tile memory footprint, as well as the percentage of data reuse ideally achievable. 4 We use this tool, as well as Intel VTune, to explain the results shown in Figure 12 . We make the following observations.
fs, unsurprisingly, is the parameter having the largest impact on the ACT. By influencing the fraction of data reuse convertible into data locality, the amount of redundant computation and the data volume exchanged, fusion schemes play a fundamental role in sparse tiling. This makes automation much more than a desired feature: without any changes to the source code, multiple sparse tiling strategies could be studied and tuned. Automation is one of our major contributions, and this performance exploration justifies the implementation effort. -There is a non-trivial relationship between ACT, q and fs. The aggressive fusion schemes are more effective with high q-that is, with larger memory footprints-while they tend to be less efficient, or even deleterious, when q is low. The extreme case is fs5, which fuses two long sequences of loops (12 and 13 loops each). In Figure 12 (Erebus), fs5 is never a winning choice, although the difference between fs3/fs4 and fs5 decreases as q grows. If this trend continued with q > 4, then the gain from sparse tiling could become increasingly larger. -A non-aggressive scheme fuses only a few small subsets of loops. As discussed in later sections, these fusion schemes, despite affording larger tile sizes than the more aggressive ones (due to the smaller memory footprint), suffer from limited cross-loop data reuse. For fs1, LCA determines that the percentage of reusable data in the three fused loop chains decreases from 25% (q = 1) to 13% (q = 4). The drop is exacerbated by the fact that no reuse can be exploited for the maps. Not only are these ideal values, but also a significant number of loops are left outside of loop chains. The combination of these factors motivate the lack of substantial speed-ups. With fs2, a larger proportion of loops are fused and the amount of shared data increases. The peak ideal reuse in a loop chain reaches 54%, which translates into better ACTs. A similar growth in data reuse can be appreciated in more aggressive fusion schemes, with a peak of 61% in one of the fs5's loop chains. Nevertheless, the performance of fs5 is usually worse than fs4. As we shall clarify in Section 7.8, this is mainly due to the excessive memory footprint, which in turn leads to very small tiles. Speculatively, we tried running a sixth fusion scheme: a single loop chain including all of the 25 loops in a time iteration. In spite of an ideal data reuse of about 70%, the ACT was always significantly higher than all other schemes. - Figure 12 shows the ACT for a "good selection" of tile size candidates. Our approach was as follows. We took a very small set of problem instances and we tried a large range of seed tile sizes. Very small tile sizes caused dramatic slow-downs, mostly because of ineffective hardware prefetching and TLB misses. Tile sizes larger than a certain fraction of L3 cache (usually slightly larger than what a core should ideally own) also led to increasingly higher ACTs. If we consider q = 4 in Figure 12 , we observe that the ACT of fs2, fs3, and fs4 grows when the initial number of iterations in a tile is as big as 70. Here, LCA shows that the tile memory footprint is, in multiple loop chains, higher than 3MB, with a peak of 6MB in fs3. This exceeds the proportion of L3 cache that a process owns (on average), which explains the performance drop.
Multi-Node Experimentation
Weak-scaling experiments were carried out on 32 nodes split over two racks in the Helen cluster at Imperial College London [33, 38] . The node architecture as well as the employed software stack are summarized in Table 3 . The rationale of these experiments was to assess whether the change in communication pattern and amount of redundant computation caused by sparse tiling could affect the runtime performance. For each q in the usual range [1] [2] [3] [4] , fs3 generally resulted in the best performance improvements, due to its tradeoff between gained data locality and restrained redundant computation (whose effect obviously worsen as q grows). Figure 13 summarizes the ACCT speed-ups achieved by the best tiled version (i.e., the one found by empirically varying the tile size, with the same tile size factor as in Figure 12 ) over the original version. The weak scaling trend is remarkable, with only a small drop in the case q = 1 when switching from one rack (448 cores) to two racks (896 cores), which disappears as soon as the per-process workload becomes significant. For instance, with q = 2, each process already computes more than 150k degrees of freedom for the velocity and stress fields. The peak speed-up over the original version, 1.28×, was obtained in the test case q = 3 when running with 448 processes. The performance achieved was 1.84 TFLOPs/s (158 GFLOPs required at each time step, for a total of 1,000 time steps and an ACCT of 86 seconds); this corresponds to roughly 15% of the theoretical machine peak (assuming AVX base frequency; the architecture ideally performs 16 double-precision FLOPs per cycle).
Negligible Inspection Overhead
As explained in Section 7.4, the inspection cost was included in OT. In this section, we quantify this overhead in a representative problem instance. In all other problem instances, the overhead was either significantly smaller than or essentially identical (for reasons discussed below) to that reported here. We consider explosive_source on Erebus with h = 1.0, q = 4, and fs5. With this configuration, the time step was Δt = 481 · 10 −6 (we recall from Section 7.4 that Δt is a function of h). Given the simulation duration T = 2.5, in this test case 5,198 time steps were performed. A time step took on average 1.15 seconds. In each time step, 25 loops, fused as specified by fs5, are executed. We know that in fs5 there are two sub loop chains, which respectively consist of 13 and 12 loops. To inspect these two loop chains, 1.4 and 1.3 seconds were needed (average across the four MPI ranks, with negligible variance). Roughly 98% of the inspection time was due to the projection and tiling functions, while only 0.2% was spent on the tile initialization phase (see Section 5). These proportions are consistent across other fusion schemes and test cases. After 200 time steps (less than 4% of the total) the inspection overhead was already close to 1%. Consequently, the inspection cost, in this test case, was eventually negligible. The inspection cost increases with the number of fused loops, which motivates the choice of fs5 for this analysis.
On the Main Performance Limiting Factor
The tile structure, namely, its shape and size, is the key factor affecting sparse tiling in Seigen.
The seed loop partitioning and the mesh numbering determine the tile structure. The simplest way of creating tiles consists of "chunking" the seed iteration space every ts iterations, with ts being the initial tile size (see Section 5) . Hence, the chunk partitioning inherits the original mesh numbering. In Firedrake, and therefore in Seigen, meshes are renumbered during initialization applying the reverse Cuthill-McKee (RCM) algorithm. Using chunk partitioning on top of an RCMrenumbered mesh has the effect of producing thin, rectangular tiles, as displayed in Figure 14(a) . This dramatically affects tile expansion, as a large proportion of elements will lie on the tile border.
There are potential solutions to this problem. The most promising would consist of using a Hilbert curve, rather than RCM, to renumber the mesh. This would lead to more regular polygonal tiles when applying chunk partitioning. Figure 14(b) and (c) shows the tile structure that we would ideally want in Seigen, from a Hilbert-renumbered mesh produced outside of Firedrake. As later discussed, introducing support for Hilbert curves in Firedrake is part of our future work.
The memory footprint of a tile grows quite rapidly with the number of fused loops. In particular, the matrices accessed in the velocity and stress solver loops have considerable size. The larger the memory footprint, the smaller ts for the tile to fit in some level of cache. Allocating small tiles has unfortunately multiple implications. First, the proportion of iterations lying on the border grows, which worsen the tile expansion phenomenon discussed, thus affecting data locality. Secondly, a small ts impairs hardware prefetching, since the virtual address streams become more irregular. Finally, using small tiles implies that a proportionately larger number of tiles are needed to "cover" the iteration space; in the case of shared-memory parallelism, this increases the probability of coloring conflicts, which result in higher inspection costs.
As reiterated in the previous sections, the maximum length of a loop chain is dictated by the extent of the boundary region. Unfortunately, a long loop chain has two undesirable effects. First, the redundant computation overhead tends to be non-negligible if some of the fused loops are compute-bound. Sometimes, the overhead could be even larger than the gain due to increased data locality. Second, the size of an S-level (a "strip" of boundary elements) grows with the depth of the boundary region, as outer levels include more elements than inner levels. This increases not only the amount of redundant computation, but also the volume of data to be communicated. Overall, this suggests that multiple, short loop chains may guarantee the best performance improvement, as indicated by the results in Sections 7.5 and 7.6.
RELATED WORK
The data dependence analysis that we have developed in this article is based on the loop chain abstraction, which was originally presented in [21] . This abstraction is sufficiently general to capture data dependencies in programs structured as arbitrary sequences of loops, particularly to create inspector/executor schemes for many unstructured mesh applications. Inspector/executor strategies were first formalized by [32] . They have been used to exploit data reuse and to expose shared-memory parallelism in several studies [9, 10, 20, 36] .
Sparse tiling is a technique based upon inspection/execution. The term was coined by [36, 37] in the context of the Gauss-Seidel algorithm and also used in [35] in the Moldyn benchmark. However, the technique was initially proposed by [10] to parallelize computations over unstructured meshes, taking the name of unstructured cache blocking. In this work, the mesh was initially partitioned; the partitioning represented the tiling in the first sweep over the mesh. Tiles would then shrink by one layer of vertices for each iteration of the loop. This shrinking represented what parts of the mesh could be accessed in later iterations of the outer loop without communicating with the processes executing other tiles. The unstructured cache blocking technique also needed to execute a serial clean-up tile at the end of the computation. [1] also developed an algorithm very similar to sparse tiling, to parallelize Gauss-Seidel computations. The main difference between [36, 37] and [10] was that in the former work the tiles fully covered the iteration space, so a sequential clean-up phase at the end could be avoided. All these approaches were either specific to individual benchmarks or not capable of scheduling across heterogeneous loops (e.g., one over cells and another over degrees of freedom). These limitations had been addressed in [34] .
The automated code generation technique presented in [29] examines the data affinity among loops and performs partitioning with the goal of minimizing inter-process communication, while maintaining load balancing. This technique supports unstructured mesh applications (being based on an inspector/executor strategy) and targets distributed memory systems, although it does not exploit the loop chain abstraction and does not introduce any sort of loop reordering transformation.
Automated code generation techniques based on polyhedral compilers have been applied to structured grid benchmarks or proxy applications [5] . However, there has been very little effort in providing evidence that these tools can be effective in real-world applications. Time-loop diamond tiling was applied in [4] to a proxy application, but experimentation was limited to shared-memory parallelism. In [31] , instead, an approach based on raising the level of abstraction, similar to the one presented in this article, is described and evaluated. The experimentation is conducted using realistic stencil codes ported to the OPS library. The main difference with respect to our work is the focus on structured grids (i.e., different types of numerical methods are targeted).
In structured codes, multiple layers of halo, or "ghost" elements, are often used to reduce communication [3] . Overlapped tiling exploits the very same idea: trading communication for redundant computation along the boundary [50] . Several works tackle overlapped tiling within single regular loop nests (mostly stencil-based computations), for example [6, 22, 25] . Techniques known as "communication avoiding" [9, 26] also fall in this category. To the best of our knowledge, overlapped tiling for unstructured mesh applications has only been studied analytically, by [13] . Further, we are not aware of any prior techniques for automation.
FUTURE WORK
Using a Hilbert curve numbering will lead to dramatically better tile shapes, thus mitigating the performance penalties due to tile expansion, TLB misses, and hardware prefetching described in Section 7. This extension is at the top of our future work priorities.
Shared-memory parallelism was not as carefully tested as distributed-memory parallelism. First of all, we would like to replace the current OpenMP implementation in SLOPE with the MPI Shared Memory (SHM) model introduced in MPI-3. Not only does a unified programming model provide significant benefits in terms of maintainability and complexity, but the performance may also be greater as suggested by recent developments in the PETSc community. Secondly, some extra work would be required for a fair comparison of this new hybrid MPI+MPI programming model with and without sparse tiling.
The experimentation was carried out on a number of "conventional" Intel Xeon architectures; we aim to repeat the experimentation on Intel's Knights Landing soon.
Finally, a cost model for automatic derivation of fusion schemes and tile sizes is still missing.
CONCLUSIONS
Sparse tiling aims to turn the data reuse in a sequence of loops into data locality. In this article, three main problems have been addressed: the specialization of sparse tiling to unstructured mesh applications via a revisited loop chain abstraction, automation via DSLs, and effective support for shared-and distributed-memory parallelism. The major strength of this work lies in the fact that all algorithmic and technological contributions presented derive from an in-depth study of realistic application needs. The performance issues we found through Seigen would never have been exposed by a set of simplistic benchmarks, as often used in the literature. Further experimentation will be necessary when 3D domains and high-order discretizations will be supported by Seigen. In essence, the performance experimentation shows systematic speed-ups, in the range 1.10×-1.30×. This is hopefully improvable by switching to Hilbert curve numberings and by exploiting shared memory through a suitable paradigm. Finally, our opinion is that sparse tiling is an "extreme" optimization: at least for unstructured mesh application, it is unlikely that it will lead to speed-ups in the order of magnitudes. However, through automation via DSLs, and with suitable optimization and tuning, it may still play a key role in improving the performance of real-world computations.
SUPPLEMENTARY MATERIALS

Summary of the Optimizations Attempted in Seigen
A number of potential optimizations were attempted when applying sparse tiling to Seigen. The experimentation with these optimizations was, however, inconclusive. Although performance improvements were observed in various problem instances, in a significant number of other cases either a detrimental effect or no benefits were noticed at all. Below, we briefly discuss the impact of four different execution strategies: (i) use of global maps, (ii) variation in tile shape, (iii) software prefetching, and (iv) use of an extended boundary region to minimize redundant computation.
Global and local maps: Algorithm 1 computes so-called local maps to avoid an extra level of indirection in the executor. Although no data reuse is available for the local maps (each fused loop has its own local maps), there might be benefits from improved hardware prefetching and memory latency. We compared the use of global and local maps (i.e., the former are normally constructed by Firedrake and provided in the loop chain specification, and the latter are computed by SLOPE), but no definitive conclusion could be drawn, as both performance improvements and deteriorations within a 5% range were observed. Tile shape: In Section 7.8, we have explained that a Hilbert-renumbered mesh might substantially improve the tile shape quality. Hilbert curves, however, are not supported in Firedrake yet. An alternative consists of partitioning the seed iteration space with a library like METIS [18] before applying RCM. We experimented and discovered that this approach, too, was not exempt from side effects. The main cause was increased translation lookaside buffer (TLB) misses, which occur whenever the CPU cannot retrieve the mapping to the physical page corresponding to a given virtual page. Since the page table has a hierarchical structure, handling a TLB miss usually requires multiple accesses to memory. Hence, TLB misses are much more costly than cache misses. Sparse tiling increases the TLB miss/hit ratio because of the fragmented streams of virtual addresses. This is evident (and more pronounced) when the tile size is small, in which case a TLB miss is quite likely to occur when jumping to executing a new loop. This problem is exacerbated by the metis partitioning (in contrast to chunk), which leads to irregular tile shapes. Here, tile expansion may eventually incorporate iterations living in completely different virtual pages. VTune experimentation with q = 1 and q = 2 versions of explosive_source showed that chunk-and metis-based sparse tiling suffer from an increase in TLB misses of roughly 16% and 35%, respectively. To mitigate this issue, we explored the possibility of using larger virtual pages through Linux's Transparent Huge Pages mechanism, which was enabled to automatically allocate memory in virtual pages of 2MB (instead of the default 4KB)-as long as the base array addresses were properly aligned. However, no significant differences were observed, and a deeper investigation is still necessary. Software prefetching: In a loop, there is usually more than a single stream of memory accesses amenable to hardware prefetching (e.g., accesses to the indirection maps; direct accesses to data values; indirect accesses to data values if the mesh has a good numbering). Sparse tiling, unfortunately, impairs hardware prefetching for two reasons: (i) the virtual addresses streams are considerably shorter; and (ii) tile expansion introduces irregularities in these streams. Software prefetching can be used together with hardware prefetching to minimize memory stalls. PyOP2 and SLOPE have been extended to emit intrinsics instructions to prefetch the iteration i's maps and data values while executing the iteration i − k at distance k. No compelling evidence that this further transformation could systematically improve the performance was found. Extended boundary region: The special non-exec tile T ne (see Sections 3 and 5) reduces the amount of redundant computation in long loop chains by expanding over boundary tiles. There are two ways of creating T ne : either an extra layer of data is added to the boundary region (e.g., see Figure 6 ) or during inspection, by searching for mesh boundaries. The current implementation only supports the first option. Manually deriving T ne would be not only algorithmically complex, but also potentially very expensive.
