Modern polyhedral compilers excel at aggressively optimizing codes with static control parts, but the state-of-practice to find high-performance polyhedral transformations especially for different hardware targets still largely involves auto-tuning. In this work we propose a novel polyhedral scheduling technique, with the aim to reduce the need for auto-tuning while allowing to build customizable and specific transformation strategies. We design constraints and objectives that model several crucial aspects of performance such as stride optimization or the trade-off between parallelism and reuse, while taking into account important architectural features of the target machine. The developed set of objectives embody a Performance Vocabulary for loop transformation. The goal is to use this vocabulary, consisting of performance idioms, to construct transformation recipes adapted to a number of program classes. We evaluate our work using the PolyBench/C benchmark suite and experimentally validate it against large optimization spaces generated with the Pluto compiler on a 10-core Intel Core-i9 (Skylake-X). Our results show that we can achieve comparable or superior performance to Pluto on the majority of benchmarks, without implementing tiling in the source code nor using experimental autotuning.
Introduction
Several program optimizations such as loop tiling and locality enhancements [3, 21] , automatic coarse-grained parallelization [1, 15] , SIMD-vectorization [12, 14, 24] and accelerator targeting [2, 10, 16, 27] have been devised and implemented in research and production compilers using the polyhedral compilation framework [7, 9] . A typical limitation of most, if not all, of these strategies is their limited ability to adjust the program transformation approach to the specifics of the input program, and of the target architecture. In practice, typically knobs to control the internal heuristics are exposed (e.g., the loop fusion scheme, the tile sizes, etc.), and auto-tuning remains the de-facto approach to achieve good performance on a variety of targets and programs.
This problem is compounded when considering the variability and trends of upcoming multi-and many-core processors where the balance between the hardware parallelism available and the data bandwidth at different memory levels can very significantly differ across architectures, in turn requiring different optimization strategies. Consequently, performance portability for new hardware targets is often addressed via exploring a search space of candidate transformations, or at worst by a redesign of the program optimization scheme.
The exploration space that a compiler, auto-tuner or expert would have to consider typically includes loop tiling (and the associated tile size selection process), selecting a good loop fusion/distribution structure, the loop permutation order and often some unrolling factor.
Polyhedral compilers provide the capability to automatically find an arbitrarily complex loop transformation sequence by reasoning instead on the scheduling of operations, thereby alleviating the phase ordering issue of composing loop transformations. But severe issues remain, such as how to achieve seemingly opposing (or at least conflicting) goals: outer parallelism, inner parallelism for SIMD-vectorization and locality improvements. Each of these topics has been extensively studied, and several constraints and optimization criteria have been proposed, e.g. [1, 3, 9, 10, 12, 14, 21, 24, 27] . Nevertheless, these techniques still employ typically a onesize-fits-all scheduling approach, suffering from a lack of portability to different architectures.
We propose to tackle this problem by designing and implementing a performance oriented vocabulary for affine loop transformations. In this vocabulary we define several performance idioms, each of which addresses a specific performance aspect of the program. The vocabulary can then be used to customize the set of scheduling constraints and objectives to the particularities of the program and the target architecture. In practice, a subset of the performance idioms are embedded into a Single Integer Linear Program (ILP) [25] . This type of scheduling technique starts by constructing the legal space of transformations. In this space, each point represents a complete set of transformations applied to the code.
Some of the variants in this space could look very similar, and not differ in anything, differ only in the actual coefficients chosen (for instance a skewing factor) or be drastically different (e.g. two fused serial loops vs. two distributed parallel loops). We use simple metrics and properties of the code to classify benchmarks into broad categories or program classes, and then select the objectives and constraints that must be embedded into the system. The ILP is then optimally solved to find in a single step the final program transformation. This one-shot scheduling approach [20, 25] facilitates the encoding of the properties that we desire to produce on the transformed code. And essentially, this also allows to order and prioritize the cost functions according to the set of properties expected on each classification category.
In summary, we make the following contributions: i) we model in a single ILP numerous novel constraints and objectives for simultaneously extracting outer and inner (SIMDvector) parallelism, trading parallelism for reuse, referencelevel stride optimization and a fusion/distribution approach; ii) we develop a customized scheduling approach that modifies the set of objectives and constraints used in the ILP as a function of the specifics of the benchmark and the target architecture, and provide evidence that classifying the type of kernels according to simple metrics extracted from affine programs can be utilized to judiciously enable or disable constraints and objectives of the ILP; iii) we propose a novel unroll-and-jam cost model that maximizes cache-line usage and that accounts for outer parallelism and inner data reuse; iv) we perform an extensive experimental evaluation on a modern multi-core processor, and demonstrate that we can achieve comparable or superior performance to the Pluto compiler for most benchmarks evaluated, all without resorting to source code tiling nor autotuning.
The rest of this paper is organized as follows. The next section discusses the motivation of our work. Section 3 presents the background concepts of the polyhedral model. Section 4 describes the ILP formulation implemented in our transformation engine. Section 5 evaluates the efficiency of our transformations. Finally, Section 6 and Section 7 discuss the related work and the conclusions of this paper, respectively.
Motivation
Modern multi and many-core processors such as IBM Power9, Intel KNLs and Skylakes, exhibit sufficiently distinct architectural features from their earlier multi-core predecessors. Some of the most strikingly different features are much larger L2-caches, which in Skylake's becomes a 1MB private cache per core. Substantially larger L3 caches are also on the rise. Another important feature to consider is hardware parallelism, i.e. the number of cores and the width of the vector registers. Figure 1 shows the source code of the FDTD-2D kernel from the PolyBench/C benchmark suite. The plot on this figure represents the performance distribution of 567 different tile configurations by using the Pluto polyhedral compiler, and applying the wave-front parallelization strategy. These variants were ran on a 10-core Intel Core-i9 7900X. This performance looks far from impressive. As a reference, a 1024 3 DGEMM using LAPACK on this machine achieves 0.6 TF/s.
To achieve this performance, we have designed and developed a novel set of performance constraints and objectives on a polyhedral compiler. Each of the implemented objectives fulfills a specific role and purpose in the overall approach. Collectively, they comprise a Performance Vocabulary for Affine Transformations, and are meant to be used for a wide variety of program classes and architectures. Although this set is not comprenhensive yet, they can be seemlessly combined and used to build more powerful transformations. These are embedded into a single integer linear program (ILP) and allow to model several relevant features such as: a) inner-vector parallelism, b) inner-loop stride access, c) selection of outer parallel loop dimension and trade-off between inner data reuse, d) skewing minimization for stencil computations, e) automatic loop fusion/distribution selection, f) separation of independent statements and non-flow related statements, g) true dependence distance minimization. Unlike previous works, these objectives and constraints are not applied indiscriminately. Instead, we carefully craft and apply "transformation recipes" depending on a number of metrics extracted from the input program as well as the target architecture. To give a simple example, objective (c) is not essential for performance on a many-core processor such as Intel's KNL or IBM's Power9, but it is fundamental to achieve a high-fraction of the machine's effective peak performance on the Skylake processor. More importantly, our approach does not require extensive and costly empirical space exploration or autotuning in order to achieve this performance, nor does it use the loop tiling (blocking) transformation to achieve high-locality.
The hardware counters in (Table in Fig. 1 ) compare our transformed program to the best performing pluto variant in the space explored. Some differences are rather surprising. For instance, the best pluto variant incurs in 5x more memory loads and 4.2x more stores than our transformed code. Note in contrast to Pluto we do not use loop tiling, which often result in code more difficult for the back-end compiler to optimize, losing vectorization potential. Nonetheless, the ratio between incoming and evicted data for our optimized code is 3, 5 and 22, for the L1, L2 and L3 caches. This is a much higher ratio than Pluto's variant, which achieves ratios of 2, 0.6 and 12. This means that our transformation strategy, despite inducing a higher total volume, makes better use of the data brought in. Next, despite a 4x higher L2 request rate, we achieve 3x fewer misses on the L2 cache.
Background
Here we recap background concepts of the polyhedral model that we use in Section 4, namely iteration domains, access functions, dependence polyhedra, program schedules and the notation used to model our constraints and objectives. Iteration domains: In the polyhedral model, each syntactic statement S is associated to a set D S ∈ Z + comprised by the dynamic instances of the statement. Then, via Access functions, each iteration domain is mapped to the (multidimensional) data-space of arrays referenced in them. Each access function F A is represented as a M A × N S 2D array, for (t = 0; t < NT ; t ++){ for (j = 0; j < NY ; j ++) R :
for (i = 1; i < NX ; i ++) for (j = 0; j < NY ; j ++) S :
for (i = 0; i < NX ; i ++) for (j = 1; j < NY ; j ++) T :
for (i = 0; i < NX -1; i ++) for (j = 0; j < NY -1; j ++) U : hz [ 1  18  35  52  69  86  103  120  137  154  171  188  205  222  239  256  273  290  307  324  341  358  375  392  409  426  443  460  477  494  511  528  545 where M A is the number of space dimensions of the array and N S is the number of loops surrounding statement S. In our FDTD-2D example, the first read reference of array ex in statement U is represented as
Dependence polyhedra embody the semantic orderings of a program. Every program dependence in a SCoP is represented by one or more dependence polyhedra D R,S . These polyhedra define the ordering among points ì x R and ì y S from the iteration domains D R and D S , respectively. Lastly, program transformations are represented in the polyhedral model by scattering functions. These functions assign to each point of the iteration domain D S a timestamp which determines its execution order. Timestamps can be multidimensional and are implemented as matrices. The number of rows of the matrix must be defined a priori. For this work we use the 2d+1 representation, which assumes that the scattering matrix will consist of 2d+1 rows and d+1 columns, where d is the maximum loop depth. The advantage of this representation is its natural mapping to the syntactic nesting structure. Assuming the schedule ie zero-offset, the even rows are known as scalar dimensions and represent the nestingness and order of statements and loops, whereas the odd dimensions represent linear transformations (e.g. skewing or loop permutation) applied to the input code. Below we show the scheduling matrices for statements R and S of our FDTD-2D example. 
Transformations
Prior approaches for polyhedral scheduling typically aims for the best one-size-fits-all formulation: irrespective of the input program, the same scheduling problem is solved [3, 14] . Our work takes a significantly different approach: instead of attempting to find the best unique set of constraints/objectives to optimize for, we instead develop a catalogue of individual smaller scheduling formulations, each targetting a specific performance objective. We then select, as a function of the input program, which of these shall be used, and in which order of priority they shall be solved. We consider the following optimization objectives: outer parallelism, inner (SIMD) parallelism, parallel dimension selection and inner reuse tradeoff, stride access minimization and stencil skewing parallelization, independent statement separation and dependence guided fusion. We next briefly recap the single-ILP legal space construction we build our work on [14, 20] , before detailing the formulation of these individual objectives.
In this work we seek to avoid the hassle of experimental exploration very often required in HPC and in polyhedral compilation frameworks. In particular, as we will see in Sec. 5, we intentionally compare a single non-tiled variant to an autotuned space which includes 3 distinct fusion and heuristics and several tile sizes for each of these (Note: a fully integrated approach to combine our current goals plus the tiling constraints are left for future work). The nature of our main goals, i.e. achieving outer and innermost parallelism in addition to good locality usually embody opposing and conflicting goals. A scheduling approach such as Feautrier's [8, 9] which maximizes the dependence satisfaction at the outermost scheduling dimensions directly favor schedules which could potentially have multiple degrees of inner parallelism. Nevertheless, it also impacts the generated code by very often resulting in skewing transformations. On the other hand, the state-of-the-art Pluto polyhedral compiler and its tiling hyperplane method in order to find maximal tiling hyperplanes, push the dependence satisfaction into the innermost dimensions. This commonly results in tiled codes which are inherently serial. Approaches such as [14] take a decoupled approach which combines tiling and exposing coarse grained parallelism proceeded by a second stage which restructures the full and partial tiles of the parametrically tiled code. As we will see, the same set of constraints and objectives are not always necessary to achieve good performance. A good and simple metric and few decisions that select that nature of the constraints to be embedded should suffice to explore the legal space which in itself is an optimization / tuning space.
Building the legal space
We first construct the convex set of semantics preserving transformations [20, 25] . The ILP system built for this purpose integrates legal constraints for all dependence polyhedra of the original program. Given a point ⟨x R , y S ⟩ ∈ D R,S , the semantics of the original program state that the statement instance x R executes before y S . Thus, in the transformed program, the schedule applied to both of these statements, Θ R and Θ S must preserve the ordering condition
, which states that the timestamps assigned to them must maintain their relative order. As we use multidimensional schedules of depth 2d+1, this means that the difference between the two schedules will be a lexicographic positive vector:
In practice, we define 2d+1 δ variables for each dependence polyhedron, and bound each δ D R, S i , i ∈ {1, .., 2d + 1} to the set {0,1}, making it a boolean variable. This way, a dependence is satisfied when any of the δ i variables become positive. These constraints are formalized as:
In order to not over-constrain the legal space, as soon as a dependence is satisfied at some schedule level l, we nullify the constraints for all subsequent dimensions c > l. This is achieved in Equation 1 by the term K · ì n + K, K ∈ Z. Coefficients θ S i, j are bounded, for a sufficiently large K [20] , this expression upper bounds the time elapsed between any two statement instances. Thus, when δ l = 1, for some l, the sum of the right-hand side of the inequality becomes in practice −∞. On the contrary, all levels c < l would have δ c = 0, so the sum becomes zero, making the constraint valid. Lastly, we apply the affine form of the Farkas Lemma [9] to compute all the non-negative functions over the candidate system. Additional constraints and objectives can then be embedded to then find good program schedules [9, 14] .
Outer Parallelism (OP)
As we will see, in some scenarios it's obvious that extracting OP is possible (e.g. DGEMM), while in others (e.g. LU), is necessary to use the second outermost loop. In both cases we use the δ variables embedded into the convex space, and minimize their sum at a predefined level selected in the following fashion:
Dimension-p is parallel when this sum is 0. However, a nonzero sum doesn't necessarily preclude extracting parallelism.
Stride Optimization (SO)
To maximize the vectorization potential and program locality we drive the selection of transformations to minimize highstrides of innermost loops. The program objective adds the costs of all individual statements that have dimensionality 2 or higher, while the statement stride cost is built from a weighted sum of its references. Unlike previous works [14, 26, 28] we don't aim to maximize stride-1/0 references. Rather, we aim to minimize high-stride penalties as a whole by using higher penalties for stride-0 references than for stride-1, and doubling the penalty for write references. The reason for this is that if the iterator originally bound to the fastest varying dimension (FVD) of the array does not appear
, then some other iterator will take its place, thereby inducing a high-stride if the new iterator appears in any non-FVD. This is possible when, for instance, a loop permutation is applied.
To minimize high-strides we compute a set of weights for each loop iterator and reference. These weights represent the stride cost when the loop associated to a designated iterator is moved into the innermost loop dimension:
Next, we create two new variables per statement, one that minimizes the sum of θ coefficients, and another to aggregate the stride cost. These variables are placed in the leading position of our system to give them the highest possible priority. Thus, the objective for each statement becomes:
Inner Parallelism (IP)
Inner parallelism is modeled in a manner analogous to outer parallelism, except for the schedule dimension(s) corresponding to inner loops. The number of loop-carried dependences at the inner-most loop level shall be 0 if possible to achieve a parallel loop, which means they shall be carried at outer levels [8] . We only seek inner-parallelism when the loop depth is 3 or more, as 1D and 2D parallel loop nests are already covered by the outer parallelism case. Note an outer-parallel loop can always be sunk inner-most, possibly after using strip-mining [3] .
Outer Parallelism and Inner Reuse (OPIR)
To model the performance trade-off between parallelism and reuse we focus on the schedule dimensions of each statement S that model loops (i.e., the odd rows), and bind these to the spatial dimensions of each array reference F ∈ S. In practice, this novel formulation usually boils down to finding the best loop permutation that optimizes reuse while simultaneously selecting the best schedule dimension for outermost parallelism. We construct a set of auxiliary functions for each non-scalar reference:
ot her wise
The above matrices are constructed for each array reference F ∈ S in the SCoP. Their roles are the following: matrix M F computes weights of the iterators on reference F , matrix G F represents the parallelism component and its relation with F 's space dimensions, and R F embodies the potential reuse in an array w.r.t. the schedule of statement S:
Bounds on the schedule θ S are computed by taking the identity schedule (I S ) as the reference, and schedule rows of non-full dimensional statements are explicitly padded. Not doing so allows the scheduler to pad any of the outer rows, disrupting our objective. These constraints are shown below:
Q: between (P)arallelism and (R)euse We now have all the necessary ingredients to complete our formulation for trading parallelism for inner data reuse. For each (non scalar) array reference F (appearing in some statement S) and linear scheduling dimension i, we create a pair of variables, Q +F i and Q −F i . These variables relate three aspects of our formulation: parallelism, schedule-to-data-space mapping and the reuse potential. The first component is represented in our formulation by the 1 − δ D S, S i term, which equals 0 when a dependence is satisfied at schedule dimension i. The second component makes use of the auxiliary matrix G(F , M F ) i, j . Its role is two-fold: i) mapping outer loop dimensions of θ S to spatial dimensions of array F ; ii) reward or penalize transformations depending on the appearance of specific iterators in F and that are used (or not) in θ S . The third component further rewards transformations where the loop iterators that appear in θ S are not used in reference F , with higher rewards given to outer loop dimensions (see definition of array R F j above). This essentially is the reuse component of the formulation. Lastly, the paddinд variable is computed per statement, and is only required when statement S is not
Here we give a very brief example. Suppose we have the statement . Between these two options, the latter is the one that maximizes the overall cost of the statement (See next set of constraints). This pushes up the values of its Q F i , and in turn, make δ 1 = 1. When a statement carries no self-dependence, the 1 − δ term is removed, and the constraints attempt to find the best loop order that minimizes the number of high-strides in statement S. Next, the cost functions of parallelism-vs-reuse are aggregated per statement S. Each Q −S and Q +S are bounded by a compile-time constant that varies per statement (U B S ):
Finally, the overall program is optimized by minimizing the sum of Q −S variables into Q pr oд and inserting it into the leading position of the current system. This, in turn, maximizes the values of the Q +S variables in our formulation:
Dependence Guided Fusion (DGF)
To achieve even better locality, we enhance our formulation with constraints and objectives for fusion driven only by inter-statement flow-dependences (ISFD). WAR and WAW dependences are not considered to alleviate pressure on the back-end register-scheduler (In the next section we actually increase the lexicographic distance between statements), while we consider RAR dependences to be unprofitable in our scheme unless full fusion is achieved. We thus exclude it as well. The equations below are embedded in the ILP for each candidate dependence in the program. We skip all other types since these have little influence in the producer-toconsumer distance between two statements in a dependence relation. Before constructing the below constraints, we build the strongly connected components (SCC) graph, and query it to decide if two statements are separable This step is independent of the actual fusion structure of the input program. The focus of this objective is to capture inter-statement reuse, since our OPIR and SO objectives already account for intrastatement reuse. Moreover, we have observed that intra-SCC + inter-statement reuse is not beneficial since the statements will already share the outermost loop dimension. In practice, this is not a problem unless using extremely large problem sizes (e.g. 32KB or 64KB on a single array dimension).
We assign to each component of ì W a distinct power-of-2 weight. A similar scheme was first used by Vasilache [25] . If an ISFD involves statement R and S, then the above equation assigns a higher cost to the outer scalar dimensions. This cost function is minimized by bringing closer the scalar dimensions of the respective statements R and S. Clearly, when two statements share the same scalar coefficient at some level k, the term associated to this level becomes zero. As a final touch, when the array reference of the flow dependence is simultaneously a write-reference of the dependence target, then we double the cost of every term in ì W . Finally, we minimize the sum of ∆ variables and insert the sum variable in the leading position of the current system:
Separation of Independent Statements (SIS)
A weakness of dependence distance optimization schemes based on the minimization of the maximal reuse distance [3] is the inability to distinguish between dependences that affect performance and those that do not. Fusion of unrelated statements can be very detrimental to performance as these can essentially flush the cache. Thus, we introduce the concept of independence distance. The intuition is the following: dependence distance minimization attempts to bring closer all statement instances in a dependence relation, zipping them together from the outermost scalar dimension to the innermost. Now, as a given statement S can be the sink of several dependences (RAW, WAR, WAW and RAR), the statement that ends up closer to S could be practically anyone which is a source of a dependence. This order will ultimately be determined by several factors including the order in which the dependences are processed.
To maximize the good fusion potential in the program we increase the distance between independent statements. Coupling SIS with the DGF objective has proven to produce robust schedules with good locality in our evaluation. We follow three criteria to decide when to introduce our separation constraints and objectives: i) ∃D R,S and it is not a flow dependence, or ¬∃D R,S ; ii) R S; and iii) both R and S do not belong to the same SCC. (Note: in the equations below, R and S are both statements as well as natural numbers with values equal to their program order location):
We build a set I with all the pairs of statements which fulfill the above conditions. The above equation imposes an order between statements R and S. This direction is chosen from their relative order in the program, i.e. if S appears after R or if R appears after S. Finally, the overall independence distance cost function is assembled from the cost of all pairs (R, S) ∈ I with: min {∇ sum }
Parallelism on Stencil Computations
Traditionally, parallelism in time-iterated stencil computations has been primarily achieved through skewing transformations (to enable wavefront parallelization). The approach we take is to decompose the stencil optimization into 3 tasks: i) Stencil Dependence Classification (SDC), ii) Stencil Parallelism Constraints (SPAR) and, iii) Stencil Minimization of Vector Skewing (SMVS).
Stencil Dependence Classification (SDC) We carefully push dependence satisfaction into specific scheduling dimensions based on the type of dependences encountered. The classification applied is quite simple and stems directly from the inherent structure of the stencil. The cases that we consider are the following: i) Non-self forward dependences (NSFD): satisfy at the scheduling level corresponding to the time loop (outermost linear dimension); ii) Non-self backward dependence (NSBD): satisfy at some inner scalar dimension; iii) Self-dependences and No. statements > 1 (SDN): satisfy at the second linear schedule dimension (corresponds to the first spatial dimension) iv) Self-dependences and No. statements = 1 (SD1): can only be satisfied by the linear scheduling dimensions, proceed greedily preferring outer linear dimensions .
MU LT I _S K EW ≡ T r ue :
MU LT I _S K EW ≡ T r ue, ∀k ∈ 0..d − 1, k even :
With this simple classification we embed the above constraints and objectives, and link the δ k variables to Ψ k variables. The latter are inserted in the leading position of the system from the outermost to innermost dimension: Above we have used the predicate MU LT I _SKEW := No.cores < 2 × OPV , OPV = operations per vector , which allows to adjust the degree of skewing w.r.t the time dimension. The previous constraints allow to decide where to satisfy dependences. Next, we will discuss how to satisfy them.
Stencil Parallelism Objectives (SPAR)
We now explain how must inter-and intra-dependences be satisfied. In the intra-case, we apply a shift along the time dimension. This boils down to peeling an iteration from the dependence sources. In addition, it also serves the purpose of prefetching the data for subsequent iterations. Next, we enforce a fixed shift along the first space dimension of the stencil when the number of cores is "too large", and avoid resorting to skewing the iteration space in this scenario. The motivation is to avoid large skewing factors (or completely avoiding them) to minimize the wavefront start-up time required.
Intra-statement dependences, or self-dependences, can only be handled by manipulating the coefficients of the linear scheduling dimensions. Our approach favors transformations where the skewing degree reduces level-by-level, from row-1 to row d-2, and introduce a constraint to push the skewing factor of the first linear dimension above a specific threshold (the number of full-dimensional statements, card(FDS)). This last constraint is nullified when the statement in question is not full dimensional. Moreover, we enforce that the skewing is performed only w.r.t each space dimension of the stencil:
∀S, MU LT I _S K EW ≡ T r ue, k ∈ 0.. ⌊d /2⌋ − 2, min_dist := {1 : k > 0; 0 : ot her wise } :
The next step is to decide if skewing is in fact necessary to satisfy the self dependence. If δ D S, S 3 is set to 1, then we force a time skewing on the same dimension, otherwise, the time coefficient of the first spatial dimension can remain as zero. The reader can observe that if the skewing at level-3 is initiated, this will force the first linear dimension to have a larger skewing factor. The ∆ − SU M variable is inserted in the leading position of the current system, while ∆ + SU M is appended at the end.
Stencil Minimization of Vector Skewing (SMVS) Finally, to avoid the detrimental skewing transformations that derive in high-stride accesses, we create a cost function that minimizes that skewing factor of the iterator and scheduling dimension associated to the fastest-varying dimension of the dominant array of each statement. To limit the vector skewing, we use a set of Φ S variables which are inserted in the leading position of the current system. We call this the "Stencil Minimization for Vector Skewing (SMVS)",
Skewed Parallelism (SKEWPAR)
A loop is parallel iff it carries no dependence. To model parallelism, for each statement S and scheduling dimension k, we use one binary variable π S k ∈ {0, 1}. These variables are associated to all δ R,S k and δ S, R k dependence satisfaction variables that involve them. We recall that when a dependence D R,S is satisfied at some schedule level k, the corresponding δ R,S k is set to 1. In accordance, a loop dimension surrounding a statement S can be parallelized when δ R,S k = 0 for all possible δ . We thus upper bound each π S k in the following manner:
The benefit of using the introduced π S k variables is to centralize the parallelism property and to simplify several of the proceeding ILP constraints and performance objectives. Consider for instance, the statement
in the gemm kernel. Its schedule will have 7 dimensions. The cannonical loop order for this benchmark, (i,j,k), will exhibit two outermost parallel loops, with π 1 = 1∧π 3 = 1∧π 5 = 0. If however, a transformation seeks inner parallelism for SIMDvectorization, then π 3 = 0 ∧ π 5 = 1.
It's not always possible to extract outermost parallelism. The alternative is to focus on extracting this property at the second outermost loop dimension, one surrounded by an outer serial loop (as in the cholesky benchmark). As a precondition, we require the π variables to be previously inserted. In the event that no other performance objective has previously created them, we do so here. To structure the code so as to have parallel loops at the second linear schedule dimension, we optimize 3 cost functions: i) maximization of dependence satisfaction at schedule dimension 1; ii) minimization of sum of coefficients (to minimize potential skewing transformations induced by (i); iii) maximization of π values at the second linear dimension (e.q. the second outermost loop). The combination of these three cost functions allows the scheduler to choose between loop permutations and skewing transformations to maximize sync free loops at the second outermost level. Eq. 9 summarizes these objectives:
4.10 Space Narrowing (SN)
In some cases, particularly when we don't expect to be able to extract multiple levels of parallelism and to simplify the ILP solving process, we preset some scalar schedule coefficients and restrict the maximum values of the linear coefficients. The former has absolutely no impact on the correctness of the schedule, whereas the latter only limits potential skewing transformations.
Resource Constrained Optimal Unrolling (RCOU)
The final step we take is a quick analytical exploration to find optimal unroll and jam factors. This pass is applied after the polyhedral code generation process, i.e. once we recover the classical AST representation. At this stage, parallel and permutable loops are marked, and we collect some statement and reference metrics:
wr it e S j = 1 : j appear s in F ∧ F is wr it e r e f er ence 0 : ot her wise
Then we proceed as follows. Program P consists of a set of disjoint multi-dimensional loop nests {l 1 , l 2 , . . . , l n }. For each loop nest we first check if its unrollable. The conditions for deeming a loop nest unrollable are: i) must have at least one parallel or permutable loop dimension with constant bounds; ii) l i can be imperfectly nested, but statements should only appear in innermost loops, i.e. l i can have multiple innermost loops. If l i is unrollable we proceed to build the resource, reuse and write matrices for all the statements contained in l i . Next, we produce a list of nodes contained in l i , and set the exploration space of unrolling factors for each loop dimension to the set {1, 2, 4, 8, 16}. If some loop dimension in l i doesn't meet the criteria of condition (i), then the set becomes the singleton {1}. The space of candidate unrolling factors for the entire loop nest l i will be the cartesian product U F 1 × U F 2 × . . . U F n , where n is the number of loops in the loop nest. Function explore_space (Algorithm 1) evaluates each point in space by computing both its resource usage (the number of array references contained in each inner loop of l i , and the reuse yielded by the candidate unrolling factors. The resource usage is product-wise accumulative w.r.t the unrolling factor of each loop surrounding pair {statement, reference}. For example, if reference A[i][k] is surrounded by 3 loops, i, j, k, and this are unrolled by {2,4,3}, then the resource usage increases from 1 to 2 × 3. In addition, our cost model favors unrolling of non-innermost loops (the innermost loop already has inherent reuse) and dimensions which impact write-references, while also penalizing unrolling of loops that induce high-strides on array references. Once the optimal set of unrolling factors has been determined, a new loop nest l ′ is produced by unrolling and jamming every loop in l by opt_u f . The new program results from concatenating all the unrolled loop nests. Here we make two notes: i) in our experiments, we set N _V EC_REG to 32; ii) the purpose of bounding the product of unroll factors by N _V EC_REG/2 is to account for two FMA units on the SKX processor.
Putting it all together
Not all the constraints presented here are always embedded into the ILP. The main conclusion of this work is that when using a single ILP to find a transformation schedule, resorting to an exhaustive empirical evaluation should not be needed, but additional information on the target architecture and program are needed to select the right objectives and their optimization order. We thus adopt the classification strategy shown in Eq. 10, where we use SCoP metrics to classify the program into one of four types: stencils (STEN), 2-dimensional kernels (LDLC), dense linear algebra (HPFP), and other. The metrics we use are the number of dependences in the SCoP (N dep ), the number of scheduling dimensions (dim(Θ)) and the number of self-dependences (N sel f .dep ). The function is_stencil returns True when at least half of the statements in the SCoP refer to at least 2 neighboring points in the grid.
Once we have determined the class of a program, we select and order the performance objectives depending on the target architecture. Table 1 lists the performance objectives embedded into the ILP for each target platform and class of benchmarks. When necessary, some objectives are embedded only when the number of dependences is under a particular threshold.
Class
Priority of Performance Objectives (+ ←→ −) STEN SMVS, SDC, SPAR LDLC SO, IP, OPIR, SIS, DGF, OP HPFP {SO, IP, OPIR}(N sel f .dep ≤ N SCC ), SIS, DGF, OP OTHER SO (N dep < 50), OP, SN Table 1 . Cost functions set and priority per program class
The overall selection of performance objectives, and more importantly, their order of application is mostly determined by a mix of the target architecture, specific traits of the program and the "hardness" of the objective. Table 2 recaps the peformance idioms described together with the part of the schedule they affect and the purpose/goal of the idiom.
The rule of thumb followed to set the order of objectives is to give higher priority to more specific performance idioms. In practice, this translates to placing first the objectives that affect fewer schedule dimensions. For instace SMVS, SO, and IP only affect the innermost schedule dimensions, whereas OPIR affects all the linear dimensions except the innermost. Intuitively, it is easier to extract outer croase-grained parallelism than innermost/vector-SIMD parallelism because we have more dimensions to play with. The SDC and SPAR constraints and objectives follow the same rationale. On the other hand, objectives related to locality, SIS and DGF, impact exclusively the scalar dimensions. Moreover, as d + 1 of the 2d + 1 schedule dimensions are scalar, these set of transformations are "easier" than extracting inner parallelism with IP.
Aside of the above rules of thumb, a priori knowledge of the schedule maneuverability is also used to select the objectives. For instance, iterative stencil computations will consist typically of a single Strongly Connected Component (SCC). As such, attempting to separate unrelated statements with the SIS objective is unnecessary. In a similar fashion, applying the DGF idiom will likey reduce the potential parallelism or could make it much harder to extract both coarse and finegrained parallelism. We note here that in the stencil case, the outermost linear dimension of the schedule will represent a linear combination of the original loop iterators. Thus, this loop dimension will be inherently serial. Clearly, these objectives are not needed for the HPFP and LDLC program classes; both of these strongly benefit from the locality objectives (SIS and DGF). However, the LDLC program class has only two linear schedule dimensions, and are very sensitive to Table 2 . Summary of performance vocabulary fusion/distribution transformations. In general, fusing a serial and a parallel loop will yield a serial loop. Although SO and IP only affect the innermost linear dimension, in most cases there will be fewer transformations that minimize the number of high-strides than schedules which produce parallel innermost loops. As an example, consider the DGEMM benchmark: exchanging loop-i or loop-j with loop-k will produce an innermost parallel loop, but only loop-j will minimze the sum of high-strides. We also point out that some of these transformations could serve correcting purposes (e.g. OP in the HPFP and LDLC program classes). Lastly, the OTHER program class exists mainly to cover kernels/benchmarks which are too complex to be handled by the linear solver used in our experiments, PIP. A better customized recipe for these benchmarks require more powerful tools such as GAMS or CPLEX. We thus defer this to later work.
Experimental Evaluation
The entire scheduling approach presented here has been implemented in a fully automated fashion, in the PoCC compiler [17] . PoCC is to our knowledge the only open-source compiler implementing the multidimensional legal scheduling space we rely on [20] , hence our implementation choice. We evaluate the effectiveness of our scheduling approach with the Polybench/C (v3.2) benchmark suite [18] on 3 representative processors. As baseline for our experiments we use the Pluto compiler [3] (Version 0.11.4-85). For each benchmark we generate a number of variants for each fusion/distribution heuristic available in Pluto, namely for maximum fusion (MF), no fusion or maximum distribution (NF) and smart fusion (SF). The set of tile sizes chosen vary substantially depending on the values of the problem sizes, and we include powers of 2 and non-powers of 2. Roughly, the tile sizes chosen for each loop dimension belong to the set {1,2,32,50,64,100,128,...} up to approximately a fourth part of the problem size, always trying to have at least one nonpower-of-2 tile size for every power-of-2 evaluated. The final footprint of the tile sizes considered are such that they all fit in L1 or in L2. We use the STANDARD Polybench Table 3 . Polybench evaluation on Intel Skylake (10 cores) with GCC 7.2 dataset, where the total kernel footprint fits in L3. Experiments were conducted on a 3.3 GHz, 10-core Intel i9-7900X (codename Skylake-X), with 32 KB L1, 1 MB L2/core, 13.25 MB L3 (shared). Benchmarks were compiled using GCC 7.2. As a reference, on this machine the LAPACK DGEMM kernel achieves 0.64 GF/s. The results of our evaluation are shown in Table 3 . For each benchmark we report our achieved GF/s, the Pluto space size (No.tiled variants × 3 fusion heuristics), the best Pluto GF/s in the space, the default Pluto tile sizes and heuristic (32x32x...32, --parallel and SmartFuse heuristic), the percentage of the outperformed Pluto space (i.e., the fraction of points in the search space that performs slower than the version we produce), and finally the speedup over the best Pluto variant. Overall, our transformation schemes show very strong performance on practically all dense linear algebra (per our classification), low-dimensional (LDLC loop nests) and stencil-type of kernels. The lowest impact that we achieve are in kernels where practically no parallelism can be extracted without resorting to transformations that increase the dimensionality of the loopnest (i.e trmm, trisolv, ludcmp, floyd-warshall and cholesky). Moreover, the standard dataset of the reg_detect kernel uses a time loop that performs 10000 iterations while using extremely small space grids. Given these constraints, both our approach and the Pluto variants were unable to surpass the original kernel performance.
Regarding stencil computations (adi, fdtd-apml, fdtd-2d, jacobi-1d-imper, jacobi-2d-imper, seidel-2d), we observe that although Pluto applies time tiling to maximize locality, the requirement of skewing the iteration domain to extract parallelism has a 3-fold impact on the transformed code. First, uncontrolled skewing induces the creation of complex loop bounds that back-end compilers (e.g. GCC or ICC) are unable to analyze, producing very low vectorization ratios. Second, tiling produces much coarser (and fewer) units of work, making it unprofitable to use multiple threads per core. Third, the parallelism obtained by skewing transformations takes much longer to be exploited, as multiple time iterations might be required to reach the steady state.
Next, we focus our attention on the lower dimensional benchmarks (atax, bicg, durbin, gemver, gesummv, mvt and trisolv). These kernels have the property that only a single level of parallelism can be extracted (without stripmining or tiling). Hence, choosing between inner-vectorizable loops and outer-parallel loops is usually necessary. Another phenomenon observed in our experiments is that Pluto's MF heuristic was extremely detrimental to parallelism, while NF achieved the worst locality. Hence, Pluto's MF heuristic offered the best trade-off between these properties. These benchmarks are also bandwidth-bound, so tiling them improves the program's locality at the cost of reducing the available parallelism. In this benchmark class, we often fall 20% below the best Pluto variant while outperforming 60-90% of the exploration space.
Turning to the dense linear algebra kernels (gemm, 2mm, 3mm, correlation, covariance, doitgen, gramschmidt, lu, syrk, syr2k, symm), our approach consistently outperforms 93%-100% of the Pluto variants, and usually being 5%-7% away from the best Pluto performance when we don't outperform the whole space. These benchmarks exhibit several degrees of data parallelism which is properly exploited by our strategy. In particular, we note that trading parallelism for reuse (OPIR objectives) yields very strong performance on the Skylake processor, effectively making the computation to boil down to dot-products in all cases except for doitgen which presents an outermost parallel loop. Another crucial aspect to achieving high-performance is our unrolling cost model for data-reuse at the innermost loop level, parallel dimension to unroll and the potential penalization for inducing high-strides.
Performance Influence of Transformations To complement our study, we present in Figure 2 the cumulative effect of our transformations on a subset of benchmarks of the STEN and HPFP classes. For each benchmark (bar cluster), we show up to 6 levels of transformations. The exact objectives applied in each case depends on the architecture and benchmark class (See Table 1 ). The legend shows objectives added incrementally, from the highest priority (e.g. SO for doitgen) to the lowest one (e.g OP). The aggregated effect of several performance objectives defined in our vocabulary are easily observable in the doitgen benchmark, a tensor contraction. In contrast, the same set of objectives yield marginal performance improvements for covariance. It is interesting to see, however, that such a large set of cost functions does not degrade performance despite seeking seemingly opposing goals. We also observe a slight performance degradation due to the RCOU post-processing, which practically doubles covariance's performance while slightly decreasing doitgen's. For benchmarks fdtd-2d and jacobi2d-imper we see that our stencil-specific objectives achieve synergistic cooperation that translate to almost 4× speedup. Tuning Time As last contribution, we show in Table 4 Table 4 . Tuning time (sec) for selected benchmarks using GCC 7.2
Related Work
This work builds largely on previous research that constructs the convex space of semantics preserving transformations [20, 25] , also known as the Single-ILP scheduling approach. In contrast to the level-by-level scheduling strategy used in compilers such as Pluto [3] and PPCG [27] , leveraging the function space created by applying the affine form of Farkas lemma, allows to select transformation schedules that exhibit strongly desired program properties that impact performance. Older seminal works such as Feautrier's [9] minimal-delay schedules greedily satisfy dependences at the outermost loop dimensions, favoring transformations that exhibit inner parallelism. Trifunovic's et al. [24] proposed transformations for vectorization, but that only accounted for loop permutation. Kong et al. [14] developed a complex transformation framework with the goal of exploiting multi-dimensional reuse of point loops in tiled programs by integrating permutability, inner parallelism and stride-0/1 constraints into a single ILP. However, unlike us, they relied on a highly tuned and architecture specific SIMD code generator. Vasilache et al. [26] proposed a scheduling strategy to simultaneously model inner parallelism, contiguity and the data layout for vectorization. Similar to our work, it first built the convex affine space of all legal schedules [25] . Their formulation favors stride patterns amenable to SIMD-vectorization (stride 1 or 0) while aggregating costs of all array references. Only a handful of works resemble our dependence-oriented objectives. Zinenko's et al. [29, 30] introduced "spatial proximity" relations in a unified template to model contiguity, cache line access and avoid false sharing. Previously, Bondhugula's work [1] exploited the uniformity of dependences in stencil computations to produce "diamond" shaped tiles that maximized its concurrent start and execution.
CHiLL [4, 11, 23 ] is a polyhedral autotuning framework designed for extensive experimental exploration via the specification of transformation recipes (including parameters such as tile sizes, unrolling factors, among others). Other approaches combine (experimental) iterative and model driven schemes. Pouchet et al. [19] , explored fusion heuristics and statement interleavings that lead to substantial performance improvements. In a similar line, [20] proposed a technique that operated on the legal space to build all the possible tileable interleavings of statements.
Domain specific autotuning has also been a subject of research. Works such as [5, 13, 22] implemented autotuning frameworks for stencil computations targeting accelerators and multi-core processors. Their auto-tuning space included domain decomposition for GPUs, register blocking, NUMAaware memory allocation. Shirako et al. [21] combined polyhedral and AST transformations by using a cache line cost model, while Vasilache et al. [6] designed a mechanism for decomposing, isolating and recomposing the transformation effects of a schedule.
Conclusion
Traditional approaches for loop transformations via polyhedral scheduling attempt to find the best one-size-fits-all set of objectives to achieve good overall performance, yet the state of practice is to employ auto-tuning to ensure the best performance on different architectures. In this work we take a significantly different approach, by instead creating a common set of performance objectives, or idioms, which serve as bases for a vocabulary. Using this high-level vocabulary, we craft transformation recipes by selecting and embedding specific idioms into in a single-ILP. The objectives selected are made so to exploit both program characteristics and architectural target traits. Although building the transformation recipes entail a mix of performance engineering and a bit of empirical evaluation, we are still able to avoid, to a large extent, large experimental evaluation. We also provide several high-level intuitions and rationales that justify both the selection of performance idioms and their order of application.
Another result of our work is how simple SCoP metrics (e.g. No. of SCC or No. of self dependences) can be used to easily categorize programs. This classification, in tandem with some prior knowledge of the target architecture allows us to wisely select the adequate objectives to embed in the ILP, and more importantly, to prioritize them. We proposed novel objectives such as the trade-off between Outer ParallelismInner Reuse (OPIR), Stride Optimization (SO), Separation of Independent Statements (SIS), Dependence Guided Fusion (DGF), together with constraints specific to stencils computations that limit skewing, select the dimension at which a dependence is satisfied and how to satisfy it. Our experimental evaluation shows that strong performance can be achieved at almost no tuning cost by using the performance vocabulary presented here.
