A Performance Vocabulary for Affine Loop Transformations by Kong, Martin & Pouchet, Louis-Noël
A Performance Vocabulary for Affine Loop Transformations
Martin Kong
Brookhaven National Laboratory
Upton, New York
Louis-Noël Pouchet
Colorado State University
Fort Collins, Colorado
Abstract
Modern polyhedral compilers excel at aggressively optimiz-
ing codes with static control parts, but the state-of-practice
to find high-performance polyhedral transformations espe-
cially for different hardware targets still largely involves
auto-tuning. In this work we propose a novel polyhedral
scheduling technique, with the aim to reduce the need for
auto-tuning while allowing to build customizable and spe-
cific transformation strategies. We design constraints and
objectives that model several crucial aspects of performance
such as stride optimization or the trade-off between paral-
lelism and reuse, while taking into account important archi-
tectural features of the target machine. The developed set
of objectives embody a Performance Vocabulary for loop
transformation. The goal is to use this vocabulary, consisting
of performance idioms, to construct transformation recipes
adapted to a number of program classes. We evaluate our
work using the PolyBench/C benchmark suite and experi-
mentally validate it against large optimization spaces gen-
erated with the Pluto compiler on a 10-core Intel Core-i9
(Skylake-X). Our results show that we can achieve compa-
rable or superior performance to Pluto on the majority of
benchmarks, without implementing tiling in the source code
nor using experimental autotuning.
1 Introduction
Several program optimizations such as loop tiling and lo-
cality enhancements [3, 21], automatic coarse-grained par-
allelization [1, 15], SIMD-vectorization [12, 14, 24] and ac-
celerator targeting [2, 10, 16, 27] have been devised and
implemented in research and production compilers using
the polyhedral compilation framework [7, 9]. A typical limi-
tation of most, if not all, of these strategies is their limited
ability to adjust the program transformation approach to the
specifics of the input program, and of the target architecture.
In practice, typically knobs to control the internal heuristics
are exposed (e.g., the loop fusion scheme, the tile sizes, etc.),
and auto-tuning remains the de-facto approach to achieve
good performance on a variety of targets and programs.
This problem is compounded when considering the vari-
ability and trends of upcoming multi- and many-core pro-
cessors where the balance between the hardware parallelism
available and the data bandwidth at different memory lev-
els can very significantly differ across architectures, in turn
requiring different optimization strategies. Consequently,
performance portability for new hardware targets is often
addressed via exploring a search space of candidate transfor-
mations, or at worst by a redesign of the program optimiza-
tion scheme.
The exploration space that a compiler, auto-tuner or expert
would have to consider typically includes loop tiling (and the
associated tile size selection process), selecting a good loop
fusion/distribution structure, the loop permutation order and
often some unrolling factor.
Polyhedral compilers provide the capability to automat-
ically find an arbitrarily complex loop transformation se-
quence by reasoning instead on the scheduling of operations,
thereby alleviating the phase ordering issue of composing
loop transformations. But severe issues remain, such as how
to achieve seemingly opposing (or at least conflicting) goals:
outer parallelism, inner parallelism for SIMD-vectorization
and locality improvements. Each of these topics has been
extensively studied, and several constraints and optimization
criteria have been proposed, e.g. [1, 3, 9, 10, 12, 14, 21, 24, 27].
Nevertheless, these techniques still employ typically a one-
size-fits-all scheduling approach, suffering from a lack of
portability to different architectures.
We propose to tackle this problem by designing and im-
plementing a performance oriented vocabulary for affine loop
transformations. In this vocabulary we define several per-
formance idioms, each of which addresses a specific perfor-
mance aspect of the program. The vocabulary can then be
used to customize the set of scheduling constraints and ob-
jectives to the particularities of the program and the target
architecture. In practice, a subset of the performance idioms
are embedded into a Single Integer Linear Program (ILP)
[25]. This type of scheduling technique starts by construct-
ing the legal space of transformations. In this space, each
point represents a complete set of transformations applied
to the code.
Some of the variants in this space could look very similar,
and not differ in anything, differ only in the actual coeffi-
cients chosen (for instance a skewing factor) or be drastically
different (e.g. two fused serial loops vs. two distributed paral-
lel loops). We use simple metrics and properties of the code
to classify benchmarks into broad categories or program
classes, and then select the objectives and constraints that
must be embedded into the system. The ILP is then optimally
solved to find in a single step the final program transforma-
tion. This one-shot scheduling approach [20, 25] facilitates
the encoding of the properties that we desire to produce on
the transformed code. And essentially, this also allows to
ar
X
iv
:1
81
1.
06
04
3v
2 
 [c
s.D
C]
  9
 A
pr
 20
19
order and prioritize the cost functions according to the set
of properties expected on each classification category.
In summary, we make the following contributions: i) we
model in a single ILP numerous novel constraints and objec-
tives for simultaneously extracting outer and inner (SIMD-
vector) parallelism, trading parallelism for reuse, reference-
level stride optimization and a fusion/distribution approach;
ii) we develop a customized scheduling approach that mod-
ifies the set of objectives and constraints used in the ILP
as a function of the specifics of the benchmark and the tar-
get architecture, and provide evidence that classifying the
type of kernels according to simple metrics extracted from
affine programs can be utilized to judiciously enable or dis-
able constraints and objectives of the ILP; iii) we propose a
novel unroll-and-jam cost model that maximizes cache-line
usage and that accounts for outer parallelism and inner data
reuse; iv) we perform an extensive experimental evaluation
on a modern multi-core processor, and demonstrate that
we can achieve comparable or superior performance to the
Pluto compiler for most benchmarks evaluated, all without
resorting to source code tiling nor autotuning.
The rest of this paper is organized as follows. The next sec-
tion discusses the motivation of our work. Section 3 presents
the background concepts of the polyhedral model. Section
4 describes the ILP formulation implemented in our trans-
formation engine. Section 5 evaluates the efficiency of our
transformations. Finally, Section 6 and Section 7 discuss the
related work and the conclusions of this paper, respectively.
2 Motivation
Modernmulti andmany-core processors such as IBMPower9,
Intel KNLs and Skylakes, exhibit sufficiently distinct archi-
tectural features from their earlier multi-core predecessors.
Some of themost strikingly different features aremuch larger
L2-caches, which in Skylake’s becomes a 1MB private cache
per core. Substantially larger L3 caches are also on the rise.
Another important feature to consider is hardware paral-
lelism, i.e. the number of cores and the width of the vector
registers.
Figure 1 shows the source code of the FDTD-2D kernel
from the PolyBench/C benchmark suite. The plot on this fig-
ure represents the performance distribution of 567 different
tile configurations by using the Pluto polyhedral compiler,
and applying the wave-front parallelization strategy. These
variants were ran on a 10-core Intel Core-i9 7900X. This per-
formance looks far from impressive. As a reference, a 10243
DGEMM using LAPACK on this machine achieves 0.6 TF/s.
To achieve this performance, we have designed and de-
veloped a novel set of performance constraints and objec-
tives on a polyhedral compiler. Each of the implemented
objectives fulfills a specific role and purpose in the overall
approach. Collectively, they comprise a Performance Vocabu-
lary for Affine Transformations, and are meant to be used for a
wide variety of program classes and architectures. Although
this set is not comprenhensive yet, they can be seemlessly
combined and used to build more powerful transformations.
These are embedded into a single integer linear program
(ILP) and allow to model several relevant features such as: a)
inner-vector parallelism, b) inner-loop stride access, c) selec-
tion of outer parallel loop dimension and trade-off between
inner data reuse, d) skewing minimization for stencil com-
putations, e) automatic loop fusion/distribution selection,
f) separation of independent statements and non-flow re-
lated statements, g) true dependence distance minimization.
Unlike previous works, these objectives and constraints are
not applied indiscriminately. Instead, we carefully craft and
apply “transformation recipes” depending on a number of
metrics extracted from the input program as well as the tar-
get architecture. To give a simple example, objective (c) is not
essential for performance on a many-core processor such as
Intel’s KNL or IBM’s Power9, but it is fundamental to achieve
a high-fraction of the machine’s effective peak performance
on the Skylake processor. More importantly, our approach
does not require extensive and costly empirical space explo-
ration or autotuning in order to achieve this performance,
nor does it use the loop tiling (blocking) transformation to
achieve high-locality.
The hardware counters in (Table in Fig. 1) compare our
transformed program to the best performing pluto variant
in the space explored. Some differences are rather surprising.
For instance, the best pluto variant incurs in 5xmorememory
loads and 4.2x more stores than our transformed code. Note
in contrast to Pluto we do not use loop tiling, which often
result in code more difficult for the back-end compiler to
optimize, losing vectorization potential. Nonetheless, the
ratio between incoming and evicted data for our optimized
code is 3, 5 and 22, for the L1, L2 and L3 caches. This is a
much higher ratio than Pluto’s variant, which achieves ratios
of 2, 0.6 and 12. This means that our transformation strategy,
despite inducing a higher total volume, makes better use of
the data brought in. Next, despite a 4x higher L2 request rate,
we achieve 3x fewer misses on the L2 cache.
3 Background
Here we recap background concepts of the polyhedral model
that we use in Section 4, namely iteration domains, access
functions, dependence polyhedra, program schedules and
the notation used to model our constraints and objectives.
Iteration domains: In the polyhedral model, each syntactic
statement S is associated to a set DS ∈ Z+ comprised by
the dynamic instances of the statement. Then, via Access
functions, each iteration domain is mapped to the (multi-
dimensional) data-space of arrays referenced in them. Each
access function FA is represented as a MA × NS 2D array,
2
for(t = 0; t < NT; t++){
for (j = 0; j < NY; j++)
R: ey[0][j] = data[t];
for (i = 1; i < NX; i++)
for (j = 0; j < NY; j++)
S: ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]);
for (i = 0; i < NX; i++)
for (j = 1; j < NY; j++)
T: ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]);
for (i = 0; i < NX -1; i++)
for (j = 0; j < NY -1; j++)
U: hz[i][j] = hz[i][j] - 0.7 * (ex[i][j+1]
- ex[i][j] + ey[i+1][j] - ey[i][j]);
}
0.000
5.000
10.000
15.000
20.000
25.000
30.000
35.000
40.000
45.000
50.000
1 18 35 52 69 86 10
3
12
0
13
7
15
4
17
1
18
8
20
5
22
2
23
9
25
6
27
3
29
0
30
7
32
4
34
1
35
8
37
5
39
2
40
9
42
6
44
3
46
0
47
7
49
4
51
1
52
8
54
5
56
2
GF
LO
P/
s
Tiled Variants
FDTD-2D Double Precision - 20 x 1024 x 1024
Pluto-SKX Our-SKX
Metric our pluto
Load to store ratio 3.1368 3.6534
Retired mem.loads 1.14E+7 5.65E+7
Retired mem.stores 3.69E+6 1.55E+7
L2 LINES IN 9.86E+5 7.08E+5
L2 to L1 load [GBytes] 0.4298 0.1745
L1 to L2 evict [GBytes] 0.1327 0.0866
L3 to L2 load [GBytes] 0.0631 0.0453
L2 to L3 evict [GBytes] 0.0112 0.0745
System to L3 [GBytes] 0.0585 0.0371
L3 to system [GBytes] 0.0026 0.0030
L2 request rate 0.4303 0.1045
L2 miss ratio 0.0806 0.2378
Vectorization ratio 94.0256 0.0022
Operational intensity 0.3782 0.4626
Avg stall duration [cycles] 25.2594 15.8386
Figure 1. FDTD-2D kernel: [left] source code; [center] performance distribution of tiled variants on Core-i9 (SKX); [right]
hardware counters comparison of best Pluto tiled variant and our apporach, on a 10-core i9 (SKX)
where MA is the number of space dimensions of the array
and NS is the number of loops surrounding statement S. In
our FDTD-2D example, the first read reference of array ex
in statementU is represented as
F ex ( ®x ) =
[
0 1 0 0 0 0 0
0 0 1 0 0 1 0
]
(t i j NT NX NY 1)T =
[
i
j + 1
]
Dependence polyhedra embody the semantic orderings
of a program. Every program dependence in a SCoP is repre-
sented by one or more dependence polyhedra DR,S . These
polyhedra define the ordering among points ®xR and ®yS from
the iteration domains DR and DS , respectively. Lastly, pro-
gram transformations are represented in the polyhedral
model by scattering functions. These functions assign to
each point of the iteration domain DS a timestamp which
determines its execution order. Timestamps can be multi-
dimensional and are implemented as matrices. The number
of rows of the matrix must be defined a priori. For this work
we use the 2d+1 representation, which assumes that the scat-
tering matrix will consist of 2d+1 rows and d+1 columns,
where d is the maximum loop depth. The advantage of this
representation is its natural mapping to the syntactic nest-
ing structure. Assuming the schedule ie zero-offset, the even
rows are known as scalar dimensions and represent the nest-
ingness and order of statements and loops, whereas the odd
dimensions represent linear transformations (e.g. skewing
or loop permutation) applied to the input code. Below we
show the scheduling matrices for statements R and S of our
FDTD-2D example.
ΘR =
©­­­­­­­«
0 0 0
1 0 0
0 0 0
0 1 0
0 0 0
0 0 0
0 0 0
ª®®®®®®®¬
(t
j
1
)
=
©­­­­­­­«
0
t
0
j
0
0
0
ª®®®®®®®¬
T
, ΘS =
©­­­­­­­«
0 0 0 0
1 0 0 0
0 0 0 1
0 1 0 0
0 0 0 0
0 0 1 0
0 0 0 0
ª®®®®®®®¬
©­­«
t
i
j
1
ª®®¬ =
©­­­­­­­«
0
t
1
i
0
j
0
ª®®®®®®®¬
T
Notation In the rest of this paper, R and S denote arbitrary
statements, NS is the number of statements in the SCoP,
F , FA are generic access functions, d and dim(θ ) are the No.
of rows in θ , dim is a function that returns the number of
loops surrounding S and the number of space dimensions
in F, IS is the identity schedule for statement S. A linear
dimension of θ is any row-i s.t. i mod 2 = 1, while for
a scalar dimension this is i mod 2 = 0. Lastly, βS is the
portion of the schedule corresponding to the last column of
θS .
4 Transformations
Prior approaches for polyhedral scheduling typically aims
for the best one-size-fits-all formulation: irrespective of the
input program, the same scheduling problem is solved [3, 14].
Our work takes a significantly different approach: instead
of attempting to find the best unique set of constraints/ob-
jectives to optimize for, we instead develop a catalogue of
individual smaller scheduling formulations, each targetting a
specific performance objective. We then select, as a function
of the input program, which of these shall be used, and in
which order of priority they shall be solved.
We consider the following optimization objectives: outer
parallelism, inner (SIMD) parallelism, parallel dimension
selection and inner reuse tradeoff, stride access minimization
and stencil skewing parallelization, independent statement
separation and dependence guided fusion. We next briefly
recap the single-ILP legal space construction we build our
work on [14, 20], before detailing the formulation of these
individual objectives.
In this work we seek to avoid the hassle of experimental
exploration very often required in HPC and in polyhedral
compilation frameworks. In particular, as we will see in Sec.
5, we intentionally compare a single non-tiled variant to an
autotuned space which includes 3 distinct fusion and heuris-
tics and several tile sizes for each of these (Note: a fully
integrated approach to combine our current goals plus the
tiling constraints are left for future work). The nature of our
main goals, i.e. achieving outer and innermost parallelism
in addition to good locality usually embody opposing and
conflicting goals. A scheduling approach such as Feautrier’s
[8, 9] which maximizes the dependence satisfaction at the
outermost scheduling dimensions directly favor schedules
which could potentially have multiple degrees of inner par-
allelism. Nevertheless, it also impacts the generated code
3
by very often resulting in skewing transformations. On the
other hand, the state-of-the-art Pluto polyhedral compiler
and its tiling hyperplane method in order to find maximal
tiling hyperplanes, push the dependence satisfaction into
the innermost dimensions. This commonly results in tiled
codes which are inherently serial. Approaches such as [14]
take a decoupled approach which combines tiling and expos-
ing coarse grained parallelism proceeded by a second stage
which restructures the full and partial tiles of the parametri-
cally tiled code. As we will see, the same set of constraints
and objectives are not always necessary to achieve good
performance. A good and simple metric and few decisions
that select that nature of the constraints to be embedded
should suffice to explore the legal space which in itself is an
optimization / tuning space.
4.1 Building the legal space
We first construct the convex set of semantics preserv-
ing transformations [20, 25]. The ILP system built for this
purpose integrates legal constraints for all dependence poly-
hedra of the original program. Given a point ⟨xR ,yS ⟩ ∈ DR,S ,
the semantics of the original program state that the state-
ment instance xR executes before yS . Thus, in the trans-
formed program, the schedule applied to both of these state-
ments, ΘR and ΘS must preserve the ordering condition
ΘR (xR ) ≺ ΘS (xS ), which states that the timestamps assigned
to them must maintain their relative order. As we use multi-
dimensional schedules of depth 2d+1, this means that the
difference between the two schedules will be a lexicographic
positive vector: ΘS (yS ) −ΘR (xR ) ≻ (δ1,δ2, ..,δ2d+1). In prac-
tice, we define 2d+1 δ variables for each dependence poly-
hedron, and bound each δDR,Si , i ∈ {1, .., 2d + 1} to the set
{0,1}, making it a boolean variable. This way, a dependence is
satisfied when any of the δi variables become positive. These
constraints are formalized as:
∀DR,S , ∀l ∈ 0..d − 1, δDR,Sl ∈ {0, 1}, ∀DR,S ,
d−1∑
l=0
δ
DR,S
l = 1
∀DR,S , ∀l ∈ {0, .., d − 1}, ∀⟨ ®xR, ®yS ⟩ ∈ DR,S :
ΘSl ( ®xS ) − ΘRl ( ®yR ) ≥ −
l−1∑
c=0
δ
DR,S
c · (K · ®n + K ) + δDR,Sl
(1)
In order to not over-constrain the legal space, as soon
as a dependence is satisfied at some schedule level l , we
nullify the constraints for all subsequent dimensions c > l .
This is achieved in Equation 1 by the term K · ®n + K ,K ∈ Z.
Coefficients θSi, j are bounded, for a sufficiently large K [20],
this expression upper bounds the time elapsed between any
two statement instances. Thus, when δl = 1, for some l ,
the sum of the right-hand side of the inequality becomes in
practice −∞. On the contrary, all levels c < l would have
δc = 0, so the sum becomes zero, making the constraint valid.
Lastly, we apply the affine form of the Farkas Lemma [9] to
compute all the non-negative functions over the candidate
system. Additional constraints and objectives can then be
embedded to then find good program schedules [9, 14].
4.2 Outer Parallelism (OP)
As we will see, in some scenarios it’s obvious that extracting
OP is possible (e.g. DGEMM), while in others (e.g. LU), is
necessary to use the second outermost loop. In both cases
we use the δ variables embedded into the convex space, and
minimize their sum at a predefined level selected in the
following fashion:
min
∑
DR,S
δ
DR,S
p : i f NSCC ≥ Nsel f .dep p = 1 else p = 3 (2)
Dimension-p is parallel when this sum is 0. However, a non-
zero sum doesn’t necessarily preclude extracting parallelism.
4.3 Stride Optimization (SO)
Tomaximize the vectorization potential and program locality
we drive the selection of transformations to minimize high-
strides of innermost loops. The program objective adds the
costs of all individual statements that have dimensionality
2 or higher, while the statement stride cost is built from
a weighted sum of its references. Unlike previous works
[14, 26, 28] we don’t aim to maximize stride-1/0 references.
Rather, we aim to minimize high-stride penalties as a whole
by using higher penalties for stride-0 references than for
stride-1, and doubling the penalty for write references. The
reason for this is that if the iterator originally bound to the
fastest varying dimension (FVD) of the array does not appear
in θSd−2, then some other iterator will take its place, thereby
inducing a high-stride if the new iterator appears in any non-
FVD. This is possible when, for instance, a loop permutation
is applied.
To minimize high-strides we compute a set of weights for
each loop iterator and reference. These weights represent the
stride cost when the loop associated to a designated iterator
is moved into the innermost loop dimension:
∀S, ∀it ∈ 0..dim(S ) − 1 :W (S, it ) =
∑
F ∈S
W (F , it ) . P (F )
W (F , it ) =

1 : it ∈ FV D(F );
3 : it < F ;
10 : it ∈ F ∧ it < FV D(F )
 , P (F ) =
{
1 : F is read
2 : F is wr ite
}
∀S : cost (S ) =
dim(S )−1∑
k=0
θ Sd−2,k .W (S, k )
Next, we create two new variables per statement, one that
minimizes the sum of θ coefficients, and another to aggregate
the stride cost. These variables are placed in the leading
position of our system to give them the highest possible
priority. Thus, the objective for each statement becomes:
min
{dim(S )−1∑
k=0
θSd−2,k ,
∑
S
cost(S)
}
(3)
4
4.4 Inner Parallelism (IP)
Inner parallelism is modeled in a manner analogous to outer
parallelism, except for the schedule dimension(s) correspond-
ing to inner loops. The number of loop-carried dependences
at the inner-most loop level shall be 0 if possible to achieve a
parallel loop, which means they shall be carried at outer lev-
els [8]. We only seek inner-parallelism when the loop depth
is 3 or more, as 1D and 2D parallel loop nests are already
covered by the outer parallelism case. Note an outer-parallel
loop can always be sunk inner-most, possibly after using
strip-mining [3].
min
∑
δDR,S
δ
DR,S
2d−1 (4)
4.5 Outer Parallelism and Inner Reuse (OPIR)
To model the performance trade-off between parallelism and
reuse we focus on the schedule dimensions of each statement
S that model loops (i.e., the odd rows), and bind these to the
spatial dimensions of each array reference F ∈ S . In practice,
this novel formulation usually boils down to finding the best
loop permutation that optimizes reuse while simultaneously
selecting the best schedule dimension for outermost paral-
lelism. We construct a set of auxiliary functions for each
non-scalar reference:MF1×dim(S ), G
F
d×dim(S ) and R
F
1×dim(S ).
∀S, ∀F ∈ S, k ∈ 0..dim(S ) − 1 : M Fk =
dim(F )−1∑
i=0
|Fi,k |
∀S, ∀F ∈ S, ∀i ∈ 0..min{dim(F ), dim(S )} − 1, ∀j ∈ 0..dim(S ) − 1 :
G(F , M F )i, j =

M Fj : M
F
j > 0 ∧ Fi, j , 0
−1 : M Fj > 0 ∧ Fi, j = 0
0 : otherwise

∀F , ∀j ∈ 0..dim(S ) − 1 : R(M F )j =
{ ⌊d/2⌋ − j : M Fj > 0
0 : otherwise
}
The above matrices are constructed for each array refer-
ence F ∈ S in the SCoP. Their roles are the following: matrix
MF computes weights of the iterators on reference F , matrix
GF represents the parallelism component and its relation
with F ’s space dimensions, and RF embodies the potential
reuse in an array w.r.t. the schedule of statement S:
Bounds on the schedule θS are computed by taking the
identity schedule (IS ) as the reference, and schedule rows of
non-full dimensional statements are explicitly padded. Not
doing so allows the scheduler to pad any of the outer rows,
disrupting our objective. These constraints are shown below:
∀S, k ∈ 0..d − 1, k odd : ©­«
dim(S )−1∑
j=0
I S2k+1, j
ª®¬ − ©­«
dim(S )−1∑
j=0
θ S2k+1, j
ª®¬ ≥ 0
∀S, dim(S ) < d, ∀i ∈ 0..d − dim(S ) − 1, ∀j ∈ 0..dim(S ) − 1 :
θ S2i+1, j = θ
S
2(d−dim(S ))+1, j
Q: between (P)arallelism and (R)euse We now have all
the necessary ingredients to complete our formulation for
trading parallelism for inner data reuse. For each (non scalar)
array reference F (appearing in some statement S) and linear
scheduling dimension i , we create a pair of variables, Q+Fi
and Q−Fi . These variables relate three aspects of our formu-
lation: parallelism, schedule-to-data-space mapping and the
reuse potential. The first component is represented in our
formulation by the 1 − δDS,Si term, which equals 0 when a
dependence is satisfied at schedule dimension i . The second
component makes use of the auxiliary matrix G(F ,MF )i, j .
Its role is two-fold: i) mapping outer loop dimensions of θS
to spatial dimensions of array F ; ii) reward or penalize trans-
formations depending on the appearance of specific iterators
in F and that are used (or not) in θS . The third component
further rewards transformations where the loop iterators
that appear in θS are not used in reference F , with higher
rewards given to outer loop dimensions (see definition of
array RFj above). This essentially is the reuse component of
the formulation. Lastly, the paddinд variable is computed
per statement, and is only required when statement S is not
full-dimensional (i.e. dim(θS ) < d).
∀S, F ∈ S, C ≡min {dim(S ), dim(F )} − 1, i ∈ 0..C :
Q Fi ≤
(
1 − δDS,S2i+1
)
+
©­«
dim(S )−1∑
j=0
G(F , M F )i, j . θ S2i+1, j ª®¬ +©­«
dim(S )−1∑
j=0
C∑
k=i+1
R(M F )j . θ S2(paddinд+k )+1, j
ª®¬
Here we give a very brief example. Suppose we have the
statement C[i][j] += A[i][k] * B[k][j] from DGEMM,
and that row-5 of θS has been set to [0,1,0] by the previous
objectives. The M-matrices for C, A and B would be [1,1,0],
[1,0,1] and [0,1,1], respectively. Similarly, the G-matrices
would be [1,-1, 0;-1,1,0], [1,0,-1;-1,0,1] and [0,-1,1;0,1,-1]. The
reader can notice that each row of GF can also be obtained by
using MF as a mask over each row of F: if the corresponding
values are positive, the value of M is used; if M is positive
and F is 0, then -1 is set in G; for any other case, then entry
of G becomes 0. The R-matrices simply represent the weight
of each schedule dimension. Hence we have RC = [3, 2, 0],
RA = [3, 0, 1] and RB = [0, 2, 1]. All these factors are used
in the above constraint. The reader can observe that when
δ1 = 0 (outermost parallel loop), the first two rows of θS
become [1,0,0;0,0,1]. On the other hand, when δ1 = 1, θS
becomes [0,0,1;1,0,0]. Between these two options, the latter
is the one that maximizes the overall cost of the statement
(See next set of constraints). This pushes up the values of its
QFi , and in turn, make δ1 = 1.
When a statement carries no self-dependence, the 1 − δ
term is removed, and the constraints attempt to find the best
loop order that minimizes the number of high-strides in state-
ment S. Next, the cost functions of parallelism-vs-reuse are
aggregated per statement S . Each Q−S and Q+S are bounded
5
by a compile-time constant that varies per statement (UBS ):
∀S : U BS =
∑
F
C∑
i=0
(2 + ⌊d/2⌋ − i) ∧Q+S =
∑
F
∑
i
Q Fi ∧
0 ≤ Q−S , Q+S ≤ U BS ∧Qproд =
∑
S
Q−S
Finally, the overall program is optimized by minimizing the
sum ofQ−S variables intoQproд and inserting it into the lead-
ing position of the current system. This, in turn, maximizes
the values of the Q+S variables in our formulation:
min
{
Qproд
}
(5)
4.6 Dependence Guided Fusion (DGF)
To achieve even better locality, we enhance our formula-
tion with constraints and objectives for fusion driven only
by inter-statement flow-dependences (ISFD). WAR and WAW
dependences are not considered to alleviate pressure on the
back-end register-scheduler (In the next section we actu-
ally increase the lexicographic distance between statements),
while we consider RAR dependences to be unprofitable in
our scheme unless full fusion is achieved. We thus exclude it
as well.
The equations below are embedded in the ILP for each
candidate dependence in the program. We skip all other
types since these have little influence in the producer-to-
consumer distance between two statements in a dependence
relation. Before constructing the below constraints, we build
the strongly connected components (SCC) graph, and query
it to decide if two statements are separable This step is inde-
pendent of the actual fusion structure of the input program.
The focus of this objective is to capture inter-statement reuse,
since our OPIR and SO objectives already account for intra-
statement reuse. Moreover, we have observed that intra-SCC
+ inter-statement reuse is not beneficial since the statements
will already share the outermost loop dimension. In practice,
this is not a problem unless using extremely large problem
sizes (e.g. 32KB or 64KB on a single array dimension).
∀DR,S , R , S, FLOW (DR,S ) ≡ T rue, SCC(R) , SCC(S ),
dim(R, S ) ≡min {dim(R), dim(S )} − 1 :
∆DR,S =
dim(R,S )∑
i=0
(βS2i − βR2i ) .Wi
∀∆DR,S : 0 ≤ ∆DR,S ≤ No .stmts . 2⌈d/2⌉+1
∀i ∈ 0..dim(R, S ) :Wi = 2⌈d/2⌉−i−1
We assign to each component of ®W a distinct power-of-2
weight. A similar scheme was first used by Vasilache [25]. If
an ISFD involves statement R and S, then the above equation
assigns a higher cost to the outer scalar dimensions. This cost
function is minimized by bringing closer the scalar dimen-
sions of the respective statements R and S. Clearly, when two
statements share the same scalar coefficient at some level
k, the term associated to this level becomes zero. As a final
touch, when the array reference of the flow dependence is
simultaneously a write-reference of the dependence target,
then we double the cost of every term in ®W . Finally, we min-
imize the sum of ∆ variables and insert the sum variable in
the leading position of the current system:
min{
∑
DR,S
∆DR,S } (6)
4.7 Separation of Independent Statements (SIS)
A weakness of dependence distance optimization schemes
based on the minimization of the maximal reuse distance
[3] is the inability to distinguish between dependences that
affect performance and those that do not. Fusion of unre-
lated statements can be very detrimental to performance as
these can essentially flush the cache. Thus, we introduce
the concept of independence distance. The intuition is the
following: dependence distance minimization attempts to
bring closer all statement instances in a dependence relation,
zipping them together from the outermost scalar dimension
to the innermost. Now, as a given statement S can be the
sink of several dependences (RAW, WAR, WAW and RAR),
the statement that ends up closer to S could be practically
anyone which is a source of a dependence. This order will
ultimately be determined by several factors including the
order in which the dependences are processed.
IR,S =

T rue : (FLOW (DR,S ) ≡ T rue∧
R , S, SCC(R) , SCC(S )) ∨ (¬∃DR,S )
False : otherwise

∀S : 0 ≤ βS0 ≤ NS ∧
∑
S
βS0 ≤ NS × (NS + 1)/2
∀R, S, IR,S ≡ T rue, R < S : 0 ≤ ∇−R,S , ∇+R,S ≤ S − R
∇−R,S + ∇+R,S = S − R ∧ ∇+R,S = βS0 − βR0
∇sum =
∑
∇−R,S ∧ 0 ≤ ∇sum ≤ NS
To maximize the good fusion potential in the program
we increase the distance between independent statements.
Coupling SIS with the DGF objective has proven to produce
robust schedules with good locality in our evaluation. We
follow three criteria to decide when to introduce our separa-
tion constraints and objectives: i) ∃DR,S and it is not a flow
dependence, or ¬∃DR,S ; ii) R , S ; and iii) both R and S do
not belong to the same SCC. (Note: in the equations below,
R and S are both statements as well as natural numbers with
values equal to their program order location):
We build a set I with all the pairs of statements which
fulfill the above conditions. The above equation imposes an
order between statements R and S. This direction is chosen
from their relative order in the program, i.e. if S appears after
R or if R appears after S. Finally, the overall independence
distance cost function is assembled from the cost of all pairs
(R, S) ∈ I with:
min {∇sum} (7)
6
4.8 Parallelism on Stencil Computations
Traditionally, parallelism in time-iterated stencil computa-
tions has been primarily achieved through skewing transfor-
mations (to enable wavefront parallelization). The approach
we take is to decompose the stencil optimization into 3 tasks:
i) Stencil Dependence Classification (SDC), ii) Stencil
Parallelism Constraints (SPAR) and, iii) Stencil Mini-
mization of Vector Skewing (SMVS).
Stencil Dependence Classification (SDC) We carefully
push dependence satisfaction into specific scheduling di-
mensions based on the type of dependences encountered.
The classification applied is quite simple and stems directly
from the inherent structure of the stencil. The cases that
we consider are the following: i) Non-self forward depen-
dences (NSFD): satisfy at the scheduling level corresponding
to the time loop (outermost linear dimension); ii) Non-self
backward dependence (NSBD): satisfy at some inner scalar
dimension; iii) Self-dependences and No. statements > 1
(SDN): satisfy at the second linear schedule dimension (cor-
responds to the first spatial dimension) iv) Self-dependences
and No. statements = 1 (SD1): can only be satisfied by the
linear scheduling dimensions, proceed greedily preferring
outer linear dimensions .
MULT I_SKEW ≡ T rue : Ψ+1 =
∑
DR,S ∈NSFD
δ
DR,S
1
MULT I_SKEW ≡ T rue, ∀k ∈ 0..d − 1, k even :
Ψ+k =
∑
DR,S ∈NSBD
δ
DR,S
k ∧ βSk − βRk − δ
DR,S
k ≥ 0
∀k ∈ 0..d − 1, k odd, k , 3 : Ψ+k =
∑
DR,S ∈SD1
δ
DR,S
k
Ψ+3 =
∑
DR,S ∈SDN
δ
DR,S
3 ,
d∑
k=0
Ψ+k = No .deps
∀k ∈ 0..d : Ψ−k + Ψ+k = No .deps ∧ 0 ≤ Ψ−k , Ψ+k ≤ No .deps
With this simple classification we embed the above con-
straints and objectives, and link the δk variables to Ψk vari-
ables. The latter are inserted in the leading position of the
system from the outermost to innermost dimension: Above
we have used the predicate MULTI_SKEW := No.cores <
2 × OPV , OPV = operations per vector , which allows to
adjust the degree of skewing w.r.t the time dimension. The
previous constraints allow to decide where to satisfy depen-
dences. Next, we will discuss how to satisfy them.
Stencil Parallelism Objectives (SPAR) We now explain
how must inter- and intra-dependences be satisfied. In the
intra- case, we apply a shift along the time dimension. This
boils down to peeling an iteration from the dependence
sources. In addition, it also serves the purpose of prefetching
the data for subsequent iterations. Next, we enforce a fixed
shift along the first space dimension of the stencil when the
number of cores is “too large”, and avoid resorting to skew-
ing the iteration space in this scenario. The motivation is to
avoid large skewing factors (or completely avoiding them)
to minimize the wavefront start-up time required.
∀DR,S , R , S, I S_FLOW (DR,S ) ≡ T rue : βS1 − βR1 ≥ 1
∀DR,S , R , S, I S_FLOW (DR,S ) ≡ T rue,
MULT I_SKEW ≡ False : βS3 − βR3 ≥ 2 ×OPV
Intra-statement dependences, or self-dependences, can
only be handled by manipulating the coefficients of the lin-
ear scheduling dimensions. Our approach favors transfor-
mations where the skewing degree reduces level-by-level,
from row-1 to row d-2, and introduce a constraint to push
the skewing factor of the first linear dimension above a spe-
cific threshold (the number of full-dimensional statements,
card(FDS)). This last constraint is nullified when the state-
ment in question is not full dimensional. Moreover, we en-
force that the skewing is performed only w.r.t each space
dimension of the stencil:
∀S, MULT I_SKEW ≡ T rue, k ∈ 0.. ⌊d/2⌋ − 2,
min_dist := {1 : k > 0; 0 : otherwise } :
©­«
dim(S )−1∑
j=0
θ S2k+1, j
ª®¬ − ©­«
dim(S )−1∑
j=0
θ S2(k+1)+1, j
ª®¬ ≥ min_dist ∧
dim(S )−1∑
j=0
θ S1, j ≥ card (FDS ) − BIG_K .(d − dim(S ))
∀S, k ∈ 1..dim(S ) − 1 : θ S2k+1,k ≥ 1
The next step is to decide if skewing is in fact necessary
to satisfy the self dependence. If δDS,S3 is set to 1, then we
force a time skewing on the same dimension, otherwise, the
time coefficient of the first spatial dimension can remain as
zero. The reader can observe that if the skewing at level-3
is initiated, this will force the first linear dimension to have
a larger skewing factor. The ∆−SUM variable is inserted in
the leading position of the current system, while ∆+SUM is
appended at the end.
∀S, DS,S : θ S3,0 − δDS,S3 ≥ 0
∆−SUM + ∆
+
SUM =
∑
δ
DS,S
3 ∧ 0 ≤ ∆−SUM , ∆+SUM ≤ No .sel f .deps
Stencil Minimization of Vector Skewing (SMVS) Fi-
nally, to avoid the detrimental skewing transformations that
derive in high-stride accesses, we create a cost function that
minimizes that skewing factor of the iterator and scheduling
dimension associated to the fastest-varying dimension of
the dominant array of each statement. To limit the vector
skewing, we use a set of ΦS variables which are inserted in
the leading position of the current system. We call this the
"Stencil Minimization for Vector Skewing (SMVS)",
∀S : ΦS =
dim(S )−1∑
j=0
θ S2(d−1)+1, j +
d−2∑
k=0
θ S2k+1,dim(S )−1 (8)
7
4.9 Skewed Parallelism (SKEWPAR)
A loop is parallel iff it carries no dependence. To model paral-
lelism, for each statement S and scheduling dimension k , we
use one binary variable πSk ∈ {0, 1}. These variables are asso-
ciated to all δR,Sk and δ
S,R
k dependence satisfaction variables
that involve them. We recall that when a dependence DR,S
is satisfied at some schedule level k, the corresponding δR,Sk
is set to 1. In accordance, a loop dimension surrounding a
statement S can be parallelized when δR,Sk = 0 for all possible
δ . We thus upper bound each πSk in the following manner:
∀k ∈ {0..d − 1}, ∀δR,Sk : πRk ≤ 1 − δR,Sk ∧ π Sk ≤ 1 − δR,Sk
The benefit of using the introduced πSk variables is to cen-
tralize the parallelism property and to simplify several of the
proceeding ILP constraints and performance objectives. Con-
sider for instance, the statement C[i][j]+ = A[i][k] ∗ B[k][j]
in the gemm kernel. Its schedule will have 7 dimensions. The
cannonical loop order for this benchmark, (i,j,k), will exhibit
two outermost parallel loops, with π1 = 1∧π3 = 1∧π5 = 0. If
however, a transformation seeks inner parallelism for SIMD-
vectorization, then π3 = 0 ∧ π5 = 1.
It’s not always possible to extract outermost parallelism.
The alternative is to focus on extracting this property at
the second outermost loop dimension, one surrounded by
an outer serial loop (as in the cholesky benchmark). As a
precondition, we require the π variables to be previously
inserted. In the event that no other performance objective
has previously created them, we do so here. To structure
the code so as to have parallel loops at the second linear
schedule dimension, we optimize 3 cost functions: i) maxi-
mization of dependence satisfaction at schedule dimension 1;
ii) minimization of sum of coefficients (to minimize potential
skewing transformations induced by (i); iii) maximization of
π values at the second linear dimension (e.q. the second out-
ermost loop). The combination of these three cost functions
allows the scheduler to choose between loop permutations
and skewing transformations to maximize sync free loops
at the second outermost level. Eq. 9 summarizes these objec-
tives:
{
max
∑
δ
DR,S
1 ,min
∑
S ∈Ci
dim(S )−1∑
j=0
θS1, j ,max
∑
S ∈Ci
πS3
}
(9)
4.10 Space Narrowing (SN)
In some cases, particularly whenwe don’t expect to be able to
extract multiple levels of parallelism and to simplify the ILP
solving process, we preset some scalar schedule coefficients
and restrict the maximum values of the linear coefficients.
The former has absolutely no impact on the correctness of
the schedule, whereas the latter only limits potential skewing
transformations.
∀S, NSCC = 1 : βSd−1 = S ∧ βS0 = 0
∀S, NSCC = 1, ∀j ∈ 0..dim(S ) − 1 : ΘS1, j = I S1, j
∀S, ∀ odd i, ∀j ∈ 0..dim(S ) − 1 : ΘSi, j ≤ 2
4.11 Resource Constrained Optimal Unrolling
(RCOU)
The final step we take is a quick analytical exploration to
find optimal unroll and jam factors. This pass is applied after
the polyhedral code generation process, i.e. once we recover
the classical AST representation. At this stage, parallel and
permutable loops are marked, and we collect some statement
and reference metrics:
r esourceSj =
∑
F ∈S
dim(F )−1∑
i=0
|Fi, j |, r euseSj =
∑
F ∈S
|Fdim(F )−1, j |,
writeSj =
{
1 : j appears in F ∧ F is wr ite ref erence
0 : otherwise
}
Then we proceed as follows. Program P consists of a set of
disjointmulti-dimensional loop nests {l1, l2, . . . , ln}. For each
loop nest we first check if its unrollable. The conditions for
deeming a loop nest unrollable are: i) must have at least one
parallel or permutable loop dimension with constant bounds;
ii) li can be imperfectly nested, but statements should only
appear in innermost loops, i.e. li can have multiple innermost
loops. If li is unrollable we proceed to build the resource,
reuse and write matrices for all the statements contained
in li . Next, we produce a list of nodes contained in li , and
set the exploration space of unrolling factors for each loop
dimension to the set {1, 2, 4, 8, 16}. If some loop dimension
in li doesn’t meet the criteria of condition (i), then the set
becomes the singleton {1}. The space of candidate unrolling
factors for the entire loop nest li will be the cartesian product
UF1 ×UF2 × . . .UFn , where n is the number of loops in the
loop nest.
Function explore_space (Algorithm 1) evaluates each point
in space by computing both its resource usage (the number
of array references contained in each inner loop of li , and
the reuse yielded by the candidate unrolling factors. The re-
source usage is product-wise accumulative w.r.t the unrolling
factor of each loop surrounding pair {statement, reference}.
For example, if reference A[i][k] is surrounded by 3 loops,
i, j,k , and this are unrolled by {2,4,3}, then the resource us-
age increases from 1 to 2 × 3. In addition, our cost model
favors unrolling of non-innermost loops (the innermost loop
already has inherent reuse) and dimensions which impact
write-references, while also penalizing unrolling of loops
that induce high-strides on array references. Once the opti-
mal set of unrolling factors has been determined, a new loop
nest l ′ is produced by unrolling and jamming every loop in l
by opt_u f . The new program results from concatenating all
the unrolled loop nests. Here we make two notes: i) in our
experiments, we set N_VEC_REG to 32; ii) the purpose of
8
Algorithm1: explore_space(space,T,resource,reuse,write)
Input: space: unrolling factors to consider; T: loop nest to unroll;
resource, reuse, and write matrices
Output: optimal unrolling factors for all loops in loop nest T
1 opt_UF = {1 × 1 × . . . × 1}; opt_reuse = 0;
2 for each UF ∈ space do
3 val_resource = 0; val_reuse = 0;
4 for each inner loop Li ∈ loop_nest do
5 iters = iterators of loops from Li to root of T;
6 for each statement S ∈ Li do
7 for each array reference F ∈ S do
8 resource_F = 1;
9 for each it ∈ iters do
10 if (it appears in F) then
11 resource_F *= UF(it);
12 if (it is iterator of inner loop) then
13 val_reuse -= UF(it) × (resource[S][it] - reuse[S][it]);
14 else
15 val_reuse += (MAX_DEPTH - depth(it) + 1) × UF(it) × ( 3
× reuse[S][it] + write[S][it]);
16 if (∃ iterator jt ∈ iter s s.t. it→ jt or jt→ it) then
17 val_reuse = −∞;
18 val_resource += resource_F;
19 if (
∏
i UF(i) ≥ N _V EC_REG/2) then
20 opt_reuse = −∞;
21 if (val_resource ≤ N_VEC_REG and val_reuse > opt_reuse) then
22 opt_UF = UF; opt_reuse = val_reuse;
23 return opt_UF ;
bounding the product of unroll factors by N_VEC_REG/2 is
to account for two FMA units on the SKX processor.
4.12 Putting it all together
Not all the constraints presented here are always embed-
ded into the ILP. The main conclusion of this work is that
when using a single ILP to find a transformation schedule,
resorting to an exhaustive empirical evaluation should not be
needed, but additional information on the target architecture
and program are needed to select the right objectives and
their optimization order. We thus adopt the classification
strategy shown in Eq. 10, where we use SCoP metrics to
classify the program into one of four types: stencils (STEN),
2-dimensional kernels (LDLC), dense linear algebra (HPFP),
and other. Themetrics we use are the number of dependences
in the SCoP (Ndep ), the number of scheduling dimensions
(dim(Θ)) and the number of self-dependences (Nsel f .dep ).
The function is_stencil returns True when at least half of the
statements in the SCoP refer to at least 2 neighboring points
in the grid.
class(proд) =

ST EN : if(is_stencil (proд) ∧ Ndep ≤ 3 × dim(Θ))
LDLC : elseif(dim(Θ) ≤ 5)
HPFP : elseif(NSCC ≥ Nsel f .dep )
OTHER : otherwise

(10)
Once we have determined the class of a program, we se-
lect and order the performance objectives depending on the
target architecture. Table 1 lists the performance objectives
embedded into the ILP for each target platform and class
of benchmarks. When necessary, some objectives are em-
bedded only when the number of dependences is under a
particular threshold.
Class Priority of Performance Objectives (+←→ −)
STEN SMVS, SDC, SPAR
LDLC SO, IP, OPIR, SIS, DGF, OP
HPFP {SO, IP, OPIR}(Nsel f .dep ≤ NSCC ), SIS, DGF, OP
OTHER SO (Ndep < 50), OP, SN
Table 1. Cost functions set and priority per program class
The overall selection of performance objectives, and more
importantly, their order of application is mostly determined
by a mix of the target architecture, specific traits of the
program and the “hardness” of the objective. Table 2 recaps
the peformance idioms described together with the part of
the schedule they affect and the purpose/goal of the idiom.
The rule of thumb followed to set the order of objectives is
to give higher priority to more specific performance idioms.
In practice, this translates to placing first the objectives that
affect fewer schedule dimensions. For instace SMVS, SO, and
IP only affect the innermost schedule dimensions, whereas
OPIR affects all the linear dimensions except the innermost.
Intuitively, it is easier to extract outer croase-grained par-
allelism than innermost/vector-SIMD parallelism because
we have more dimensions to play with. The SDC and SPAR
constraints and objectives follow the same rationale. On the
other hand, objectives related to locality, SIS and DGF, impact
exclusively the scalar dimensions. Moreover, as d + 1 of the
2d + 1 schedule dimensions are scalar, these set of transfor-
mations are “easier” than extracting inner parallelism with
IP.
Aside of the above rules of thumb, a priori knowledge of
the schedule maneuverability is also used to select the objec-
tives. For instance, iterative stencil computations will consist
typically of a single Strongly Connected Component (SCC). As
such, attempting to separate unrelated statements with the
SIS objective is unnecessary. In a similar fashion, applying
the DGF idiom will likey reduce the potential parallelism or
could make it much harder to extract both coarse and fine-
grained parallelism. We note here that in the stencil case, the
outermost linear dimension of the schedule will represent a
linear combination of the original loop iterators. Thus, this
loop dimension will be inherently serial. Clearly, these objec-
tives are not needed for the HPFP and LDLC program classes;
both of these strongly benefit from the locality objectives
(SIS and DGF). However, the LDLC program class has only
two linear schedule dimensions, and are very sensitive to
9
Performance Schedule Goal
Idiom Target
SMVS Last linear dimension Minimize skewing that could affect the vectorization potential
SDC Two outermost linear dimensions; 2nd scalar dimension Map dependence satisfaction to specific dimensions
SPAR Two outermost linear dimensions Produce a linear combination on outermost linear dimension;
free the 2nd linear dimension to extract coarse-grained parallelism
SO Innermost linear dimension Minimization of high-strides
OPIR All linear dimensions except the innermost Trade parallelism for data reuse
OP Outermost linear dimension Attempt to parallelize the outermost loop dimension
IP Innermost linear dimension Maximize inner parallelism by minimizing
∑
δ
SIS Outermost scalar dimension Increase the difference between the values of βR and βS
DGF All scalar dimensions Minimize the lexicographic distance between two statements R and S
Table 2. Summary of performance vocabulary
fusion/distribution transformations. In general, fusing a se-
rial and a parallel loop will yield a serial loop. Although SO
and IP only affect the innermost linear dimension, in most
cases there will be fewer transformations that minimize the
number of high-strides than schedules which produce par-
allel innermost loops. As an example, consider the DGEMM
benchmark: exchanging loop-i or loop-j with loop-k will pro-
duce an innermost parallel loop, but only loop-j will minimze
the sum of high-strides. We also point out that some of these
transformations could serve correcting purposes (e.g. OP in
the HPFP and LDLC program classes). Lastly, the OTHER
program class exists mainly to cover kernels/benchmarks
which are too complex to be handled by the linear solver
used in our experiments, PIP. A better customized recipe
for these benchmarks require more powerful tools such as
GAMS or CPLEX. We thus defer this to later work.
5 Experimental Evaluation
The entire scheduling approach presented here has been
implemented in a fully automated fashion, in the PoCC com-
piler [17]. PoCC is to our knowledge the only open-source
compiler implementing the multidimensional legal schedul-
ing space we rely on [20], hence our implementation choice.
We evaluate the effectiveness of our scheduling approach
with the Polybench/C (v3.2) benchmark suite [18] on 3 repre-
sentative processors. As baseline for our experiments we use
the Pluto compiler [3] (Version 0.11.4-85). For each bench-
mark we generate a number of variants for each fusion/dis-
tribution heuristic available in Pluto, namely formaximum
fusion (MF), no fusion or maximum distribution (NF)
and smart fusion (SF). The set of tile sizes chosen vary
substantially depending on the values of the problem sizes,
and we include powers of 2 and non-powers of 2. Roughly,
the tile sizes chosen for each loop dimension belong to the
set {1,2,32,50,64,100,128,...} up to approximately a fourth part
of the problem size, always trying to have at least one non-
power-of-2 tile size for every power-of-2 evaluated. The
final footprint of the tile sizes considered are such that they
all fit in L1 or in L2. We use the STANDARD Polybench
Benchmark Our Pluto Pluto Pluto %Space Overall
GF Space Best Default Below Speedup
Size (GF/s) (GF/s)
2mm 185.13 2188 197.97 21.75 99.41% 0.94
3mm 153.39 2188 165.03 14.13 99.59% 0.93
adi 5.24 568 5.63 4.29 96.48% 0.93
atax 4.57 769 5.81 4.61 63.07% 0.79
bicg 4.57 769 5.95 4.65 65.67% 0.77
cholesky 6.43 1537 6.18 6.11 100.00% 1.04
correlation 90.91 2188 100.39 41.43 98.58% 0.91
covariance 111.11 2188 102.74 41.65 100.00% 1.08
doitgen 76.70 7204 75.13 29.70 100.00% 1.02
durbin 0.23 769 0.28 0.23 48.11% 0.82
dynprog 3.02 7204 3.58 1.38 99.99% 0.84
fdtd-2d 45.83 568 18.34 6.58 100.00% 2.50
fdtd-apml 10.07 1537 10.44 6.42 97.66% 0.96
floyd-warshall 3.54 1537 12.01 9.94 43.79% 0.30
gemm 201.33 2188 226.71 101.69 97.62% 0.89
gemver 20.25 769 11.09 7.68 100.00% 1.83
gesummv 3.20 769 8.79 5.20 12.61% 0.36
gramschmidt 5.49 2188 14.18 2.49 73.49% 0.39
jacobi-1d-imper 3.75 145 11.87 2.55 53.10% 0.32
jacobi-2d-imper 18.52 568 19.74 4.17 99.30% 0.94
lu 132.56 1702 71.67 45.54 100.00% 1.85
ludcmp 2.05 2188 2.03 2.01 100.00% 1.01
mvt 5.33 769 6.47 4.44 94.28% 0.82
reg_detect 0.00 73 2.55 1.84 0.00% 0.00
seidel-2d 12.00 568 12.80 1.15 99.82% 0.94
symm 55.89 1108 81.01 0.46 93.41% 0.69
syr2k 54.55 2188 33.51 29.50 100.00% 1.63
syrk 45.45 2188 42.06 32.51 100.00% 1.08
trisolv 4.00 769 8.11 7.67 59.69% 0.49
trmm 2.93 730 6.74 6.05 21.10% 0.43
Table 3. Polybench evaluation on Intel Skylake (10 cores)
with GCC 7.2
dataset, where the total kernel footprint fits in L3. Experi-
ments were conducted on a 3.3 GHz, 10-core Intel i9-7900X
(codename Skylake-X), with 32 KB L1, 1 MB L2/core, 13.25
MB L3 (shared). Benchmarks were compiled using GCC 7.2.
As a reference, on this machine the LAPACK DGEMM kernel
achieves 0.64 GF/s.
The results of our evaluation are shown in Table 3. For
each benchmarkwe report our achieved GF/s, the Pluto space
size (No.tiled variants × 3 fusion heuristics), the best Pluto
GF/s in the space, the default Pluto tile sizes and heuristic
(32x32x...32, - -parallel and SmartFuse heuristic), the per-
centage of the outperformed Pluto space (i.e., the fraction
of points in the search space that performs slower than the
10
version we produce), and finally the speedup over the best
Pluto variant. Overall, our transformation schemes show
very strong performance on practically all dense linear al-
gebra (per our classification), low-dimensional (LDLC loop
nests) and stencil-type of kernels. The lowest impact that we
achieve are in kernels where practically no parallelism can be
extracted without resorting to transformations that increase
the dimensionality of the loopnest (i.e trmm, trisolv, ludcmp,
floyd-warshall and cholesky). Moreover, the standard dataset
of the reg_detect kernel uses a time loop that performs 10000
iterations while using extremely small space grids. Given
these constraints, both our approach and the Pluto variants
were unable to surpass the original kernel performance.
Regarding stencil computations (adi, fdtd-apml, fdtd-2d,
jacobi-1d-imper, jacobi-2d-imper, seidel-2d), we observe that
although Pluto applies time tiling to maximize locality, the
requirement of skewing the iteration domain to extract par-
allelism has a 3-fold impact on the transformed code. First,
uncontrolled skewing induces the creation of complex loop
bounds that back-end compilers (e.g. GCC or ICC) are unable
to analyze, producing very low vectorization ratios. Second,
tiling produces much coarser (and fewer) units of work, mak-
ing it unprofitable to use multiple threads per core. Third, the
parallelism obtained by skewing transformations takes much
longer to be exploited, as multiple time iterations might be
required to reach the steady state.
Next, we focus our attention on the lower dimensional
benchmarks (atax, bicg, durbin, gemver, gesummv, mvt and
trisolv). These kernels have the property that only a single
level of parallelism can be extracted (without stripmining or
tiling). Hence, choosing between inner-vectorizable loops
and outer-parallel loops is usually necessary. Another phe-
nomenon observed in our experiments is that Pluto’s MF
heuristic was extremely detrimental to parallelism, while
NF achieved the worst locality. Hence, Pluto’s MF heuristic
offered the best trade-off between these properties. These
benchmarks are also bandwidth-bound, so tiling them im-
proves the program’s locality at the cost of reducing the
available parallelism. In this benchmark class, we often fall
20% below the best Pluto variant while outperforming 60-90%
of the exploration space.
Turning to the dense linear algebra kernels (gemm, 2mm,
3mm, correlation, covariance, doitgen, gramschmidt, lu, syrk,
syr2k, symm), our approach consistently outperforms 93%-
100% of the Pluto variants, and usually being 5%-7% away
from the best Pluto performance when we don’t outperform
the whole space. These benchmarks exhibit several degrees
of data parallelism which is properly exploited by our strat-
egy. In particular, we note that trading parallelism for reuse
(OPIR objectives) yields very strong performance on the Sky-
lake processor, effectively making the computation to boil
down to dot-products in all cases except for doitgen which
presents an outermost parallel loop. Another crucial aspect
to achieving high-performance is our unrolling cost model
for data-reuse at the innermost loop level, parallel dimen-
sion to unroll and the potential penalization for inducing
high-strides.
Performance Influence of Transformations To com-
plement our study, we present in Figure 2 the cumulative
effect of our transformations on a subset of benchmarks of
the STEN and HPFP classes. For each benchmark (bar clus-
ter), we show up to 6 levels of transformations. The exact
objectives applied in each case depends on the architecture
and benchmark class (See Table 1). The legend shows ob-
jectives added incrementally, from the highest priority (e.g.
SO for doitgen) to the lowest one (e.g OP). The aggregated
effect of several performance objectives defined in our vo-
cabulary are easily observable in the doitgen benchmark,
a tensor contraction. In contrast, the same set of objectives
yield marginal performance improvements for covariance.
It is interesting to see, however, that such a large set of cost
functions does not degrade performance despite seeking
seemingly opposing goals. We also observe a slight perfor-
mance degradation due to theRCOU post-processing, which
practically doubles covariance’s performance while slightly
decreasing doitgen’s. For benchmarks fdtd-2d and jacobi-
2d-imper we see that our stencil-specific objectives achieve
synergistic cooperation that translate to almost 4× speedup.
covariance doitgen
Benchmarks
16.0
32.0
64.0
128.0
G
F
L
O
P
/s
Influence of Cost Functions on Intel i9 7900X (Skylake-X) for HPFP Class
SO
SO+IP
SO+IP+OPIR
SO+IP+OPIR+SIS
SO+IP+OPIR+SIS+DGF
SO+IP+OPIR+SIS+DGF+OP
SO+IP+OPIR+SIS+DGF+OP+RCOU
fdtd-2d jacobi-2d-imper
Benchmarks
4.0
8.0
16.0
32.0
G
F
L
O
P
/s
Influence of Cost Functions on Intel i9 7900X (Skylake-X) for Stencil Class
SMVS
SDCS+SMVS
SPAR+SDC+SMVS
Figure 2. Performance influence of transformations on Intel
Skylake
Tuning Time As last contribution, we show in Table 4 the
time (sec) to tune a subset of benchmarks with our frame-
work and with the Pluto compiler. For our approach, the
tuning time consists of the Gen (transformation), Bin (GCC
compilation) and Exec (kernel execution time for 5 runs). For
Pluto, we show the average Gen, Bin and Exec times, which
11
scaled by the space size, gives the total tuning time. The last
column shows the speedup achieved by our framework w.r.t
Pluto’s tuning time.
Benchmark
Our Approach Pluto Space Speed-
Total Time ( Gen + Total Tuning Time up
Bin + Exec) (Gen + Bin + Exec)
gemm 3.1 (2+0.29+0.8) 2856.6 (0.16+0.27+0.88) 924.5
3mm 88.8 (88+0.55+0.21) 12794.7 (0.55+0.39+4.91) 144.2
correlation 675.5 (675+0.37+0.055) 3578.9 (0.77+0.44+0.43) 5.3
doitgen 15 (14.45+0.43+0.035) 5806.8 (0.31+0.32+0.18) 389.4
fdtd-2d 21.9 (21.2+0.55+0.06) 1220.1 (1.46+0.35+0.34) 56
jacobi-2d-imper 7.4 (7.3+0.036+0.027) 1782.3 (0.63+2.43+0.08) 242.1
fdtd-apml 1793.7 (1793+0.4+0.3) 3813.1 (1.65+0.39+0.44) 2.2
lu 2.6 (2.2+0.27+0.0405) 1600.2 (0.21+0.31+0.42) 637.4
gramschmidt 277.6 (277+0.32+0.2445) 2626.8 (0.37+0.26+0.57) 9.5
symm 8.1 (7.3+0.51+0.24) 60962.4 (9.11+44.4+1.51) 7573
gemver 2.7 (2.3+0.27+0.0395) 819.6 (0.16+0.25+0.66) 314.1
Table 4. Tuning time (sec) for selected benchmarks using
GCC 7.2
6 Related Work
This work builds largely on previous research that constructs
the convex space of semantics preserving transformations
[20, 25], also known as the Single-ILP scheduling approach.
In contrast to the level-by-level scheduling strategy used
in compilers such as Pluto [3] and PPCG [27], leveraging
the function space created by applying the affine form of
Farkas lemma, allows to select transformation schedules
that exhibit strongly desired program properties that im-
pact performance. Older seminal works such as Feautrier’s
[9] minimal-delay schedules greedily satisfy dependences
at the outermost loop dimensions, favoring transformations
that exhibit inner parallelism. Trifunovic’s et al. [24] pro-
posed transformations for vectorization, but that only ac-
counted for loop permutation. Kong et al. [14] developed a
complex transformation framework with the goal of exploit-
ing multi-dimensional reuse of point loops in tiled programs
by integrating permutability, inner parallelism and stride-
0/1 constraints into a single ILP. However, unlike us, they
relied on a highly tuned and architecture specific SIMD code
generator. Vasilache et al. [26] proposed a scheduling strat-
egy to simultaneously model inner parallelism, contiguity
and the data layout for vectorization. Similar to our work,
it first built the convex affine space of all legal schedules
[25]. Their formulation favors stride patterns amenable to
SIMD-vectorization (stride 1 or 0) while aggregating costs
of all array references.
Only a handful of works resemble our dependence-oriented
objectives. Zinenko’s et al. [29, 30] introduced “spatial prox-
imity” relations in a unified template to model contiguity,
cache line access and avoid false sharing. Previously, Bond-
hugula’s work [1]exploited the uniformity of dependences
in stencil computations to produce “diamond” shaped tiles
that maximized its concurrent start and execution.
CHiLL [4, 11, 23] is a polyhedral autotuning framework
designed for extensive experimental exploration via the spec-
ification of transformation recipes (including parameters
such as tile sizes, unrolling factors, among others). Other ap-
proaches combine (experimental) iterative and model driven
schemes. Pouchet et al. [19], explored fusion heuristics and
statement interleavings that lead to substantial performance
improvements. In a similar line, [20] proposed a technique
that operated on the legal space to build all the possible
tileable interleavings of statements.
Domain specific autotuning has also been a subject of re-
search. Works such as [5, 13, 22] implemented autotuning
frameworks for stencil computations targeting accelerators
and multi-core processors. Their auto-tuning space included
domain decomposition for GPUs, register blocking, NUMA-
aware memory allocation. Shirako et al. [21] combined poly-
hedral and AST transformations by using a cache line cost
model, while Vasilache et al. [6] designed a mechanism for
decomposing, isolating and recomposing the transformation
effects of a schedule.
7 Conclusion
Traditional approaches for loop transformations via polyhe-
dral scheduling attempt to find the best one-size-fits-all set
of objectives to achieve good overall performance, yet the
state of practice is to employ auto-tuning to ensure the best
performance on different architectures. In this work we take
a significantly different approach, by instead creating a com-
mon set of performance objectives, or idioms, which serve
as bases for a vocabulary. Using this high-level vocabulary,
we craft transformation recipes by selecting and embedding
specific idioms into in a single-ILP. The objectives selected
are made so to exploit both program characteristics and
architectural target traits. Although building the transfor-
mation recipes entail a mix of performance engineering and
a bit of empirical evaluation, we are still able to avoid, to
a large extent, large experimental evaluation. We also pro-
vide several high-level intuitions and rationales that justify
both the selection of performance idioms and their order of
application.
Another result of our work is how simple SCoP metrics
(e.g. No. of SCC or No. of self dependences) can be used to eas-
ily categorize programs. This classification, in tandem with
some prior knowledge of the target architecture allows us
to wisely select the adequate objectives to embed in the ILP,
and more importantly, to prioritize them. We proposed novel
objectives such as the trade-off between Outer Parallelism -
Inner Reuse (OPIR), Stride Optimization (SO), Separation of In-
dependent Statements (SIS), Dependence Guided Fusion (DGF),
together with constraints specific to stencils computations
that limit skewing, select the dimension at which a depen-
dence is satisfied and how to satisfy it. Our experimental
evaluation shows that strong performance can be achieved at
12
almost no tuning cost by using the performance vocabulary
presented here.
References
[1] Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012.
Tiling stencil computations to maximize parallelism. In High Perfor-
mance Computing, Networking, Storage and Analysis (SC), 2012 Inter-
national Conference for. IEEE, 1–11.
[2] MuthuManikandan Baskaran, Uday Bondhugula, SriramKrishnamoor-
thy, Jagannathan Ramanujam, Atanas Rountev, and Ponnuswamy Sa-
dayappan. 2008. A compiler framework for optimization of affine
loop nests for GPGPUs. In Proceedings of the 22nd annual international
conference on Supercomputing. ACM, 225–234.
[3] Uday Bondhugula, A Hartono, J Ramanujam, and P Sadayappan. 2008.
Pluto: A practical and fully automatic polyhedral program optimiza-
tion system. In Proceedings of the ACM SIGPLAN 2008 Conference on
Programming Language Design and Implementation (PLDI 08), Tucson,
AZ (June 2008).
[4] Chun Chen, Jacqueline Chame, and Mary Hall. 2005. Combining
models and guided empirical search to optimize for multiple levels
of the memory hierarchy. In Code Generation and Optimization, 2005.
CGO 2005. International Symposium on. IEEE, 111–122.
[5] Matthias Christen, Olaf Schenk, and Helmar Burkhart. 2011. Patus: A
code generation and autotuning framework for parallel iterative stencil
computations on modern microarchitectures. In Parallel & Distributed
Processing Symposium (IPDPS), 2011 IEEE International. IEEE, 676–687.
[6] Albert Cohen, Marc Sigler, Sylvain Girbal, Olivier Temam, David Par-
ello, and Nicolas Vasilache. 2005. Facilitating the search for composi-
tions of program transformations. In Proceedings of the 19th annual
international conference on Supercomputing. ACM, 151–160.
[7] Paul Feautrier. 1991. Dataflow analysis of array and scalar references.
International Journal of Parallel Programming 20, 1 (1991), 23–53.
[8] Paul Feautrier. 1992. Some efficient solutions to the affine scheduling
problem. I. One-dimensional time. International journal of parallel
programming 21, 5 (1992), 313–347.
[9] Paul Feautrier. 1992. Some efficient solutions to the affine scheduling
problem. Part II. Multidimensional time. International journal of parallel
programming 21, 6 (1992), 389–420.
[10] Tobias Grosser and Torsten Hoefler. 2016. Polly-ACC Transparent
compilation to heterogeneous hardware. In Proceedings of the 2016
International Conference on Supercomputing. ACM, 1.
[11] Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy,
and Malik Murtaza Khan. 2009. Loop transformation recipes for code
generation and auto-tuning. In International Workshop on Languages
and Compilers for Parallel Computing. Springer, 50–64.
[12] Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet,
Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2013. A
stencil compiler for short-vector SIMD architectures. In Proceedings of
the 27th international ACM conference on International conference on
supercomputing. ACM, 13–24.
[13] Shoaib Kamil, Cy Chan, Leonid Oliker, John Shalf, and Samuel
Williams. 2010. An auto-tuning framework for parallel multicore
stencil computations. In Parallel & Distributed Processing (IPDPS), 2010
IEEE International Symposium on. IEEE, 1–12.
[14] Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-
Noël Pouchet, and Ponnuswamy Sadayappan. 2013. When polyhedral
transformations meet SIMD code generation. InACM SIGPLAN Notices,
Vol. 48. ACM, 127–138.
[15] Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, Jagan-
nathan Ramanujam, Atanas Rountev, and Ponnuswamy Sadayappan.
2007. Effective automatic parallelization of stencil computations. ACM
Sigplan Notices 42, 6 (2007), 235–244.
[16] Allen Leung, Nicolas Vasilache, BenoîtMeister, Muthu Baskaran, David
Wohlford, Cédric Bastoul, and Richard Lethin. 2010. A mapping path
for multi-GPGPU accelerated computers from a portable high level pro-
gramming abstraction. In Proceedings of the 3rd Workshop on General-
Purpose Computation on Graphics Processing Units. ACM, 51–61.
[17] PoCC. 2018. The Polyhedral Compiler Collection. (2018). https:
//sourceforge.net/projects/pocc/ Online; accessed on September 2018.
[18] Louis-Noël Pouchet. 2012. Polybench: The polyhedral benchmark suite.
URL: http://www. cs. ucla. edu/pouchet/software/polybench (2012).
[19] Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Co-
hen, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2010.
Combined iterative and model-driven optimization in an automatic
parallelization framework. In High Performance Computing, Network-
ing, Storage and Analysis (SC), 2010 International Conference for. IEEE,
1–11.
[20] Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Co-
hen, Jagannathan Ramanujam, Ponnuswamy Sadayappan, and Nicolas
Vasilache. 2011. Loop transformations: convexity, pruning and opti-
mization. ACM SIGPLAN Notices 46, 1 (2011), 549–562.
[21] Jun Shirako, Louis-Noël Pouchet, and Vivek Sarkar. 2014. Oil and water
can mix: an integration of polyhedral and AST-based transformations.
In High Performance Computing, Networking, Storage and Analysis,
SC14: International Conference for. IEEE, 287–298.
[22] Kevin Stock, Martin Kong, Tobias Grosser, Louis-Noël Pouchet, Fabrice
Rastello, Jagannathan Ramanujam, and Ponnuswamy Sadayappan.
2014. A framework for enhancing data reuse via associative reordering.
In ACM SIGPLAN Notices, Vol. 49. ACM, 65–76.
[23] Ananta Tiwari, ChunChen, Jacqueline Chame,MaryHall, and Jeffrey K
Hollingsworth. 2009. A scalable auto-tuning framework for compiler
optimization. In Parallel & Distributed Processing, 2009. IPDPS 2009.
IEEE International Symposium on. IEEE, 1–12.
[24] Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks, and Ira
Rosen. 2009. Polyhedral-model guided loop-nest auto-vectorization.
In Parallel Architectures and Compilation Techniques, 2009. PACT’09.
18th International Conference on. IEEE, 327–337.
[25] Nicolas Vasilache. 2007. Scalable Program Optimization Techniques in
the Polyhedra Model. Ph.D. Dissertation. University of Paris-Sud 11.
[26] Nicolas Vasilache, Benoit Meister, Muthu Baskaran, and Richard Lethin.
2012. Joint scheduling and layout optimization to enable multi-level
vectorization. IMPACT, Paris, France (2012).
[27] Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose Igna-
cio Gomez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral
parallel code generation for CUDA. ACM Transactions on Architecture
and Code Optimization (TACO) 9, 4 (2013), 54.
[28] Sven Verdoolaege and Alexandre Isoard. 2018. Extending Pluto-style
polyhedral scheduling with consecutivity. IMPACT, Manchester, UK
(2018).
[29] Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako,
Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2017. Unified Polyhe-
dral Modeling of Temporal and Spatial Locality. (2017).
[30] Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako,
Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the
Conflicting Demands of Parallelism and Temporal/Spatial Locality in
Affine Scheduling. In Proceedings of the 27th International Conference
on Compiler Construction (CC 2018). ACM, New York, NY, USA, 3–13.
https://doi.org/10.1145/3178372.3179507
13
