Exact and Approximated Data-Reuse Optimizations for Tiling with Parametric Sizes by Darte, Alain & Isoard, Alexandre
Exact and Approximated Data-Reuse Optimizations for
Tiling with Parametric Sizes
Alain Darte, Alexandre Isoard
To cite this version:
Alain Darte, Alexandre Isoard. Exact and Approximated Data-Reuse Optimizations for Tiling
with Parametric Sizes. [Research Report] RR-8671, LIP - ENS Lyon; CNRS; Inria; UCBL.
2015, pp.28. <hal-01103460>
HAL Id: hal-01103460
https://hal.inria.fr/hal-01103460
Submitted on 14 Jan 2015
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
IS
S
N
02
49
-6
39
9
IS
R
N
IN
R
IA
/R
R
--
86
71
--
FR
+E
N
G
RESEARCH
REPORT
N° 8671
January 2015
Project-Team Compsys
Exact and Approximated
Data-Reuse
Optimizations for Tiling
with Parametric Sizes
Alain Darte, Alexandre Isoard

RESEARCH CENTRE
GRENOBLE – RHÔNE-ALPES
Inovallée
655 avenue de l’Europe Montbonnot
38334 Saint Ismier Cedex
Exact and Approximated
Data-Reuse Optimizations for Tiling with
Parametric Sizes
Alain Darte, Alexandre Isoard
Project-Team Compsys
Research Report n° 8671 — January 2015 — 28 pages
This report is an improved version of the work presented at the IMPACT’14
workshop [11] and is published, without the appendix, at the CC’15 conference [12].
Work partially supported by the ManycoreLabs project PIA-6394 led by Kalray.
Abstract:
Loop tiling is a loop transformation widely used to improve spatial and temporal data locality, to
increase computation granularity, and to enable blocking algorithms, which are particularly useful
when oﬄoading kernels on computing units with smaller memories. When caches are not available
or used, data transfers and local storage must be software-managed, and some useless remote
communications can be avoided by exploiting data reuse between tiles. An important parameter
of tiling is the sizes of the tiles, which impact the size of the required local memory. However,
for most analyses involving several tiles, which is the case for inter-tile data reuse, the tile sizes
induce non-linear constraints, unless they are numerical constants. This complicates or prevents a
parametric analysis with polyhedral optimization techniques.
This paper shows that, when tiles are executed in sequence along tile axes, the parametric (with
respect to tile sizes) analysis for inter-tile data reuse is nevertheless possible, i.e., one can de-
termine, at compile-time and in a parametric fashion, the copy-in and copy-out data sets for all
tiles, with inter-tile reuse, as well as sizes for the induced local memories. When approximations
of transfers are performed, the situation is much more complex, and involves a careful analysis
to guarantee correctness when data are both read and written. We provide the mathematical
foundations to make such approximations possible. Combined with hierarchical tiling, this result
opens perspectives for the automatic generation of blocking algorithms, guided by parametric cost
models, where blocks can be pipelined and/or can contain parallelism. Previous work on FPGAs
and GPUs already showed the interest and feasibility of such automation with tiling, but in a
non-parametric fashion.
Key-words: Loop Tiling, Memory Hierarchy, Software-Managed Caches, Polyhedral Analysis
and Optimizations.
Optimisation pour la re´utilisation des donne´es,
me´thode exacte et me´thode approche´e, pour le
“tiling” avec tailles parame´triques
Re´sume´ :
Le tuilage de boucles (”loop tiling”) est une transformation de code lar-
gement utilise´e pour ame´liorer la localite´ spatiale et temporelle des donne´es,
augmenter la granularite´ des calculs, et ge´ne´rer des algorithmes par blocs, parti-
culie`rement utiles pour de´porter des portions de code sur des unite´s de calcul a`
petite me´moire. En l’absence de me´canisme automatique de cache, les transferts
de donne´es et le stockage local doivent eˆtre ge´re´s au niveau logiciel, et certaines
de ces communications peuvent eˆtre e´vite´es en re´utilisant des donne´es entre
tuiles. Un parame`tre important du tuilage est la taille des tuiles qui influe sur la
taille de la me´moire locale requise. Mais, pour la plupart des analyses impliquant
plusieurs tuiles, ce qui est le cas pour la re´utilisation des donne´es entre tuiles,
ces tailles de tuiles induisent des contraintes non-line´aires, sauf si ce sont des
constantes nume´riques. Ceci complique ou empeˆche une analyse parame´trique
par des techniques polye´driques.
Ce rapport montre que, lorsque les tuiles sont exe´cute´es en se´quence le long
des axes des tuiles, l’analyse parame´trique (au sens des tailles de tuiles), pour la
re´utilisation des donne´es entre tuiles, est ne´anmoins possible, c’est-a`-dire qu’on
peut de´terminer, de fac¸on statique et parame´trique, les ensembles de donne´es
entrantes et sortantes pour chaque tuile, avec re´utilisation entre tuiles, ainsi que
des tailles pour les me´moires locales induites. Lorsque des approximations des
transferts sont faites, la situation devient plus complexe, et requiert une ana-
lyse attentive pour garantir la correction de la transformation dans le cas ou` des
donne´es peuvent eˆtre a` la fois lues et e´crites. Nous apportons des e´le´ments de base
the´oriques pour rendre de telles approximations possibles. Combine´s avec du tui-
lage hie´rarchique, ces re´sultats ouvrent des perspectives pour la ge´ne´ration auto-
matique d’algorithmes par blocs, guide´e par des mode`les de couˆt parame´triques,
ou` les blocs peuvent eˆtre pipeline´es ou contenir du paralle´lisme. Des travaux
ante´rieurs pour FPGA et GPU ont de´ja` montre´ l’inte´reˆt et la faisabilite´ d’une
telle automatisation par tuilage, mais de fac¸on non-parame´trique.
Mots-cle´s : Tuilage de boucles, hie´rarchie me´moire, caches a` gestion logicielle,
analyses et optimisations polye´driques.
4 Darte & Isoard
Exact and Approximated Data-Reuse
Optimizations for Tiling with Parametric Sizes
Alain Darte and Alexandre Isoard
Compsys, Computer Science Lab (LIP), CNRS, INRIA, ENS-Lyon, UCB-Lyon
firstname.lastname@ens-lyon.fr
Loop tiling is a loop transformation widely used to improve spatial and tem-
poral data locality, to increase computation granularity, and to enable blocking
algorithms, which are particularly useful when oﬄoading kernels on comput-
ing units with smaller memories. When caches are not available or used, data
transfers and local storage must be software-managed, and some useless remote
communications can be avoided by exploiting data reuse between tiles. An im-
portant parameter of tiling is the sizes of the tiles, which impact the size of the
required local memory. However, for most analyses involving several tiles, which
is the case for inter-tile data reuse, the tile sizes induce non-linear constraints,
unless they are numerical constants. This complicates or prevents a parametric
analysis with polyhedral optimization techniques.
This paper shows that, when tiles are executed in sequence along tile axes,
the parametric (with respect to tile sizes) analysis for inter-tile data reuse is nev-
ertheless possible, i.e., one can determine, at compile-time and in a parametric
fashion, the copy-in and copy-out data sets for all tiles, with inter-tile reuse, as
well as sizes for the induced local memories. When approximations of transfers
are performed, the situation is much more complex, and involves a careful anal-
ysis to guarantee correctness when data are both read and written. We provide
the mathematical foundations to make such approximations possible. Combined
with hierarchical tiling, this result opens perspectives for the automatic gener-
ation of blocking algorithms, guided by parametric cost models, where blocks
can be pipelined and/or can contain parallelism. Previous work on FPGAs and
GPUs already showed the interest and feasibility of such automation with tiling,
but in a non-parametric fashion.
1 Introduction
Todays’ hardware diversity increases the need for optimizing compilers and run-
time systems. A difficulty when using hardware accelerators (FPGA, GPU,
dedicated boards) is to automatically perform kernel/function oﬄoading (a.k.a.
outlining as opposed to inlining) between the host and the accelerator, and to
organize data transfers between the different memory layers (e.g., in a GPU,
from remote to global memory, and from global to shared memory, or even reg-
isters). This requires static analysis to identify the kernel input (data read) and
output (data produced), and code generation for transfers, synchronizations,
and computations. In general, such tasks are done by the programmer who has
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 5
to express the communications, to allocate and size the intermediate buffers,
and to decompose the kernel into fitting chunks of computation. When each
kernel is oﬄoaded in a three-phase process (i.e., upload, compute, store back),
such programming remains feasible. For GPUs, developers can use OpenCL or
CUDA, or they can rely on higher-level abstractions (e.g., compilation direc-
tives as in OpenACC or garbage collector mechanisms as in [9]), static analysis
as in OpenMPC [25], runtime approaches as in [24], or mixed compile/runtime
optimizations as in [27]. These approaches mainly work at the granularity of
variable names, still defined by the programmer, but they can be used to op-
timize remote transfers when several kernels are successively launched. Things
get more complicated when a given kernel is decomposed into smaller kernels
(and the initial arrays into array regions) to get blocking algorithms, thanks to
loop tiling. Indeed, iteration-wise loop analysis and element-wise array analysis
are needed to enable intra- and inter-tile data reuse. Moreover, the choice of
tile sizes is driven by hardware capabilities such as memory bandwidth, size,
and organization, computational power, and such codes are very hard to obtain
without automation and some cost model. With this objective, our contribution
is a parametric (w.r.t. tile sizes) polyhedral analysis technique for inter-tile
data reuse and a mathematical framework to reason with approximations of
data accesses and transfers.
Loop tiling is a well-known transformation used to improve data locality [36],
increase computation granularity, and control the use and size of local memories
for out-of-core computations (see [38] for details on semantics, validity condi-
tions, and code generation). It was first introduced for a set of perfectly nested
loops, as a grouping of iterations into supernodes [21], which are atomic (i.e., can
be executed without any communication/synchronization with other supernodes
except for live-in/live-out data at beginning/end of a tile execution), identical by
translation, bounded, and form a partition of the whole iteration space. Validity
conditions were given in terms of dependence cones and hyperplane partition-
ing, which define tiles as hyper-rectangles (after some possible change of basis)
and establish a link with affine scheduling and the generation of permutable
loops. Now, tiling is also used for non-perfectly nested loops [7], thanks to
multi-dimensional affine loop transformations: as in the perfectly nested case,
some permutable dimensions can be used to perform tiling, even if not all in-
structions have the same iteration domain. Analysis and code generation may
involve more complex sets, but the principles are similar. Today, loop tiling is
still a key loop transformation for performance (speed, memory, locality) and the
subject of many new advanced developments, including non-rectangular tiling.
Loop tiling can be viewed as a composition of strip-mining and loop in-
terchange, after a preliminary change of basis. It transforms n nested loops
into n tile loops iterating over the tiles, surrounding n intra-tile loops iterat-
ing within a tile. Dependence analysis and code generation for loop tiling is
well-established in the polyhedral model [15], i.e., for a set of nested for loops,
writing and reading multi-dimensional arrays and scalar variables, where loop
bounds, if conditions, and array access functions are affine expressions of sur-
RR n° 8671
6 Darte & Isoard
rounding loop counters and structure parameters. In this case, loop iterations
can be represented by a polyhedral iteration domain. When tile sizes are numer-
ical constants, parametric (w.r.t. program counters and structural parameters)
polyhedral optimizations (e.g., linear programming) can be used although loop
tiling transforms n loops into 2n loops. Indeed, the image by tiling of an n-
dimensional polyhedral iteration domain can be expressed as a 2n-dimensional
polyhedral iteration domain, because the set of points after tiling with fixed sizes
can be described by affine inequalities.1 In general, parametric tiling refers to
the case where tile sizes are parameters too. Parametric analysis within a tile is
in general feasible as the set of points in a tile is defined with affine constraints
from the tile sizes and the tile origin (first corner of the tile). However, when an
analysis involves several tiles, it becomes more intricate, if not unsolvable, as a
priori expressing the tiled space with tile sizes as parameters induces quadratic
constraints. For example, the tiling theory developed in [37], the code gener-
ation schemes of [21,16,7], the data movement and scratch-pad optimizations
of [23,22,6,4,29,35] are not parametric. Recently, efficient code generation for
parametric tiling [31,20] as well as some form of symbolic scheduling for tiled
codes [8] have been developed.
In the context of high-level synthesis (HLS), inter-tile data reuse was pro-
posed [2] (then automated [4]), as a source-to-source process on top of Altera
C2H HLS tool, to oﬄoad small computation kernels to FPGAs while optimizing
communications from a remote (in this case external) DDR memory. Simi-
lar results with data reuse between two successive tiles only were then demon-
strated for AutoESL Xilinx tool [29]. Different (and more restricted) forms of
inter-tile data reuse were also designed for programmable accelerators such as
GPUs [5,18,35]. None of these approaches are parametric w.r.t. tile sizes. In
this paper, we show that maximal inter-tile data reuse can be expressed in the
parametric case, even in an approximated situation. The trick to get around a
quadratic formulation is to work with all possible tiles – not just the tiles that
are part of the iteration space partitioning and whose origins belong to a lattice –
but the difficulty is to make sure that exactness and correctness are maintained.
Our contributions, mostly at the level of code analysis, are the following:
– When read/write accesses can be described in an exact way using polyhedral
representations, we show how to derive, thanks to manipulations of integer
sets, the copy-in and copy-out sets for each tile, with parametric tile sizes.
This gives a full parametric generalization of the inter-tile data reuse of [4].
– We extend this parametric analysis to handle approximations, which make
the analysis more complex when some data may be both read and written
by the tiles, as loading too much may not be safe. We introduce the concept
of pointwise functions for which no additional loss of accuracy is induced.
– Using similar analysis principles, we show how such a parametric analysis
can be exploited in the following steps of the compilation, in particular to
perform parametric array contraction for the definition of local arrays.
1 However, difficulties due to large coefficients are possible.
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 7
2 Prerequisites
2.1 Notations and Definitions
We write all vectors with bold letters such as i, with components i1, . . . , in.
The vector 0 (resp. 1) has all components equal to 0 (resp. 1) and a ◦ b is the
product (component-wise) of a and b. We denote by  the lexicographic total
order on vectors of arbitrary size and by ≤ the component-wise partial order on
vectors with same size, defined by i ≤ j if and only if (iff) ik ≤ jk for all k.
We will not elaborate on how to build and interpret the different affine func-
tions for tiling non-perfectly nested loops. To simplify the discussion and no-
tations, we only focus on the n dimensions to be tiled. We assume that each
statement S with polyhedral iteration domain DS (scanned with the iteration
vector i) is tiled, after a first affine mapping i 7→ i′ = θ(S, i), by canonical tiles
whose sizes are specified by a vector s. In other words, a point i is mapped
to the tile indexed by T where Tk = b i
′
k
sk
c, or equivalently skTk ≤ (θ(S, i))k <
sk(Tk + 1), for k ∈ [1..n], i.e., 0 ≤ θ(S, i)−s◦T ≤ s−1. Also, we restrict to the
case where the original and the tiled programs are both executed sequentially.2
Several orders of iterations in the tiled program are possible, we consider that the
tiled code is executed following the lexicographic order on the 2n-dimensional
vectors (T , i′). The tiled iteration domain for statement S is then:
TS = {(T , i′) | ∃i ∈ DS , i′ = θ(S, i), 0 ≤ i′ − s ◦ T ≤ s− 1}
If θ is a one-to-one mapping and DS the set of integer points in a polyhedron,
then i can be eliminated and TS is also the set of integer points in a polyhedron.
Example We illustrate the concepts and steps of our technique with the kernel
jacobi_1d_imper from PolyBench [30], with a time loop, and tiled in 2D. For
the code in Fig. 1, the Pluto compiler [28] generates the following mapping:
θ(S1, (t, i)) = (t, 2t+ i, 0) θ(S2, (t, j)) = (t, 2t+ j + 1, 1)
DS1 = DS2 = {(t, i) | 0 ≤ t ≤M − 1, 0 ≤ i ≤ N − 2}
for (t = 0; t < M; t++) {
for (i = 1; i < N - 1; i++)
S1: B[i] =
(A[i-1] + A[i] + A[i+1])/3;
for (j = 1; j < N - 1; j++)
S2: A[j] = B[j];
}
Figure 1. Original kernel.
for (t = 0; t < M; t++)
for (i’ = 2t+1; i’ < 2t + N; i’++) {
S0: i = i’-2t;
S1: if (i<N-1) B[i] =
(A[i-1] + A[i] + A[i+1])/3;
S2: if (i>1) A[i-1] = B[i-1];
}
Figure 2. transformed kernel.
2 However, parallelism inside a tile is possible, as well as hierarchical tiling, which
enables to play with the extent of the tiled domain. Parallel execution are also pos-
sible by defining a partial execution order, if execution follows the axes defining tiles.
Other cases seem possible but with additional complications and approximations.
RR n° 8671
8 Darte & Isoard
This means shifting S2 by 1 in the j loop, fusing the i and j loops, then skewing
by 2 the inner loop, to get the code of Fig. 2. Then, several tiled code generations
are possible depending on how iterators are defined and how tiles are aligned,
i.e., what the underlying lattice of the tiling is. With the relation Tk = b iksk c,
tiles are aligned with the canonical basis obtained after the transformation θ
(see Fig. 3 for tiles of size 2×3, drawn in the original basis to save space). With
the “outset” code generation scheme of [31], for tile sizes s1 × s2, we get:
for (T1 = 0; T1 < M; T1+=s1) {
lb = 2T1+1-(s2-1); lb = s2*ceiling(lb/s2);
for (T2 = lb; T2 < 2T1 + N + 2(s1 - 1); T2+=s2)
for (t=max(0,T1); t<min(M,T1+s1); t++)
for (i’=max(2t+1,T2); i’<min(2t+N,T2+s2); i’++) {
S0: i = i’-2t;
S1: if (i<N-1) B[i] = (A[i-1] + A[i] + A[i+1])/3;
S2: if (i>1) A[i-1] = B[i-1];
}
}
For our scheme, it would also be valid to shift, after tiling, the inner tile-loop
w.r.t. the outer tile-loop, i.e., to move up or down each column in Fig. 3. 
2.2 Inter-Tile Data Reuse
The inter-tile reuse problem we formalize here is the kernel offloading with op-
timized remote accesses presented in [2,4], even if other variations are possible.
A kernel is tiled and oﬄoaded, tile by tile, to a computing accelerator (a FPGA
in [2,4]). Initially, all data are in remote memory, while all computations are
performed on the accelerator. Each tile T consists of three successive phases:
a loading phase where data are copied from remote memory to local memory,
enabling burst communications, then a compute phase where the original compu-
tations corresponding to the tile are performed on the local memory, and finally
a storing phase where data are copied to remote memory. In addition, all com-
pute (resp. loading and storing) phases are performed in sequence, following the
lexicographic order on tile indices. Nevertheless, loads and stores can be done
t
i
Non-empty 2 × 3 tiles drawn
w.r.t the original space. In-
struction S1 in red. Instruc-
tion S2 in green.
Are also shown some flow de-
pendences, due to reads of B,
at distance (0, 1), and reads
of A, at distance (1, 0), (1,−1),
(1,−2) in the (t, i) space.
Figure 3. jacobi1d kernel and skewed tiling.
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 9
concurrently with the computations of other tiles, enabling pipelining, compu-
tation/communication overlapping, and execution similar to double buffering.
Inter-tile reuse makes this possible even when data are both read and written.3
Then, the “maximal inter-tile data reuse problem” is to define the loading
and storing sets Load(T ) and Store(T ) for each tile T so that a data element
is never loaded from remote memory if it is already available in local memory,
i.e., if it has already been loaded or computed (as, in this latter case, the remote
memory is not necessarily up-to-date). This inter-tile reuse is performed for each
tile strip (subspace of tiles corresponding to inner tile dimensions). In [4], a tile
strip is one-dimensional, but the technique can be applied to multi-dimensional
strips. This choice however impacts the size of the local memory.
Note: there are some similarities with the reuse analysis of [17]. Given a
“sliding window” of iterations, one analyzes the data that each iteration needs
to bring because they were not already present due to previous iterations in the
sliding window. But the communications are not coalesced out of the tile, they
are still at the iteration level. In other words, this is a reuse analysis at constant
(possibly parametric) distance (the sliding window), but with no granularity or
scheduling (through tiling) reorganization, which makes the problem different.
The technique of [4], based on parametric linear programming [14], consists
in performing loads (resp. stores) as late (resp. as soon) as possible, i.e., a data
element is loaded just before the first tile that accesses it, if this access is a
read, and is stored just after the last tile that writes it. Among all schemes that
exploit a full inter-tile reuse in a strip, this tends to reduce the size of the local
memory. We illustrate this technique again on the jacobi_1d_imper example.
Example (cont’d) For the tiling of Fig. 3, a 1D tile strip is vertical, indexed by
T1 = b ts1 c. To simplify explanations, we only consider the array A (the array B
is not live-in of a tile strip). We compute the first operation (following the
order defined by the tiling) that accesses A[m]. This means computing, with
(i1, i2) = (t, i) and parameters M , N , m, T1, the lexicographic minimum of
(T2, i
′
1, i
′
2, k, i1, i2) in a set defined by a disjunction of two conjunctions of affine
inequalities derived from the program (iteration domains and access functions):{−1 ≤ m− i2 ≤ 1, 0 ≤ i1 ≤M − 1, 1 ≤ i2 ≤ N − 2, k = 0,
i′1 = i1, i
′
2 = 2i1 + i2, 0 ≤ i′1 − 2T1 ≤ 1, 0 ≤ i′2 − 3I2 ≤ 2
∨{
m = i2, 0 ≤ i1 ≤M − 1, 1 ≤ i2 ≤ N − 2, k = 1, i′1 = i1,
i′2 = 2i1 + i2 + 1, 0 ≤ i′1 − 2T1 ≤ 1, 0 ≤ i′2 − 3T2 ≤ 2
The first set of constraints corresponds to reads in S1 and specifies that A[m] is
A[i-1], A[i], or A[i+1], that iterations in tiles are valid ((T1, T2, i
′
1, i
′
2) ∈ TS),
and k = 0 is the third component of θ(S1, (t, i)) (i.e., S1 is the first executed
statement in the loop body). The second set of constraints corresponds to writes
3 Without inter-tile reuse, full pipelining of tiles is not always possible if a data is
locally written, then read in a subsequent tile. Indeed, one would then need to wait
for the data to be stored in remote memory before loading it again. Inter-tile reuse
enables to break such a cycle of synchronizations and avoid considering latencies.
RR n° 8671
10 Darte & Isoard
in S2 (with k = 1, i.e., second executed statement in the loop body). The lex-
icographic minimum is expressed as a disjunction of cases (a QUAST or quasi
affine solution tree [14]). Then, all solutions (i.e., leaves of the tree) that cor-
respond to a write operation are removed. Here, all first accesses are reads, no
simplification is needed. It remains to project out the variables i′1, i
′
2, i1, i2,
k, to get a relation between tile index T and array element m, which describes
Load(T ) as a union:
Load(T ) =
{m | 0 ≤ 2T1 ≤M − 1, 2 ≤ m ≤ N − 1, 1 ≤ m + 4T1 − 3T2 ≤ 3}
∪
{m | 0 ≤ m ≤ 1, 3 ≤ N, 0 ≤ 2T1 ≤M − 1, −1 ≤ 4T1 − 3T2 ≤ 1}
The second set loads the additional A[0] and A[1] for the unique tile in the
strip that contains an iteration (t, 1) on its first column (squares in Fig. 3). 
As can be seen from the inequalities involved in the previous example with
s = (2, 3) (and in the definition of TS), considering the components of the size
vector s as parameters generates quadratic constraints. In other words, this
formulation is inherently not linear in the tile sizes. The goal of this paper is to
show that, surprisingly, the problem can nevertheless be solved, both for exact
inter-tile reuse (as in the previous example) and with approximations.
3 Dealing with Unaligned Tiles
The first key idea to break the non-linearity constraint is to represent each tile
not with its tile index T defined earlier, but with the index I of its origin (first
element in the tile in the lexicographic order). The first difference is that tiles
are scanned with loops with increments equal to 1 when T is used and equal
to s when I is used. The second difference is that, when I is used instead of T ,
the set of elements i in a tile is affine in s: this is the set of all i such that
I ≤ i ≤ I + s− 1. In other words, parametric analysis inside a tile is possible.
This representation is not new, it is used for analysis in PIPS [19, Fig. 6] and
for the parametric code generation [31] used for the tiled code of Section 2.1.
However, when reasoning with different tiles, the non-linearity is coming back.
Indeed, in a given execution, the tile origins I are restricted to the lattice L
defined by I ∈ L iff I = s ◦ J for some integer vector J . The second key
idea is to show how these quadratic constraints can nevertheless be ignored, by
reasoning on the set of all tiles of size s, not just those restricted to L. The
inter-tile reuse problem then becomes (piece-wise) affine in s as we will show.
Note that, with standard conditions for tiling (i.e., when all dependence
distances are non-negative along the dimensions being tiled [21]), if a tiling is
valid, any translation of it is valid too. In other words, considering all tile origins
I = s ◦ J + I0 for some vector I0 defines a valid tiling too. This has the same
effect as defining the tiling from the shifted mapping i 7→ σ(S, i)− I0 for all S.
Hereafter, we say that two tiles are aligned if they belong to the same tiling.
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 11
3.1 Exact Approach with Set Equations
In Section 2.2, maximal inter-tile data reuse was expressed as a linear program-
ming optimization, following [4]. It can be equivalently formulated with set
equations [3], expressed in terms of In(T ) and Out(T ), the standard live-in and
live-out sets for tile T , as defined for example for array region analysis [10]:
Load(T ) = In(T ) \
⋃
T ′≺T
(In ∪Out)(T ′) = In(T ) \ (In ∪Out)(T ′ ≺ T )}
Store(T ) = Out(T ) \
⋃
T ′T
Out(T ′) = Out(T ) \Out(T ′  T )
Here, as indicated in the previous formulas, X(T ′ ≺ T ) is a shortcut to denote
the union of all sets X(T ′) for all tiles T ′ executed before T (lexicographic order)
in the same tile strip as T . Expressing X(T ′ ≺ T ) from X(T ′) is done simply by
adding the constraint T ′ ≺ T and specifying that T ′ is in the strip where reuse
is exploited. The previous set equations state that we load what is live-in for T
and not previously live-in (redundant load) or live-out (defined locally), and we
store what is live-out, but not again live-out later (redundant store). One could
expect to rather subtract Load(T ′ ≺ T ) from Load(T ) and Store(T ′  T ) from
Store(T ), but such recursive implicit definitions are not usable.
We now rephrase these equations when tiles T are represented by their tile
origins I as previously explained. We also consider all tiles with size s, not
just those whose origins belong to the lattice L, i.e., even those that will not
be executed in a given tiling. These tiles contain valid iterations (which will
be executed as part of an aligned tile), but their Load and Store sets will not
generate transfers during the execution. We define two relations on tiles:
– I′ @s I iff I′ ≺ I and I − I′ ∈ L. This is equivalent to the lexicographic
order T ′ ≺ T for the corresponding tile indices.
– I′ ≺s I iff, for some k ∈ [1..n], I ′i ≤ Ii for all i < k and I ′k ≤ Ik− sk where n
is the dimension of I and I′. This is a variation of the lexicographic order.
The standard reflexive extensions vs and s of these relations are clearly partial
orders. Fig. 4 shows all tile origins I′ strictly smaller (in blue) or strictly larger
(in red) than the tile origin I (in yellow), for the orders vs and s. Note that
tiles comparable for vs are always aligned with each other. An alternate, maybe
more intuitive, definition of ≺s is as follows: I′ ≺s I iff, in the tiling induced
by I (the same is true with I′, this is symmetric), every point in the tile I′ is
executed before any point in the tile I (but I and I′ may not be aligned).
With tile origins, the previous Load/Store equations can be rewritten as:
Load(I) = In(I) \ (In ∪Out)(I′ @s I) (1)
Store(I) = Out(I) \Out(I′ As I) (2)
The key is now to show that these sets can also be defined equivalently as:
Load(I) = In(I) \ (In ∪Out)(I′ ≺s I) (3)
Store(I) = Out(I) \Out(I′ s I) (4)
RR n° 8671
12 Darte & Isoard
This is not obvious as the contribution of unaligned tiles (i.e., not in the same
tiling as I) is also subtracted, thus the Load/Store sets could now be too small.
Nicely, these sets only involve affine constraints as the relation ≺s is, by
definition, piece-wise affine (this is also the case for a similar “happens-before”
relation defined on iteration points). They can thus be computed with a library
such as isl [33]. Before proving these formulas, we first illustrate their use.
Example (cont’d) The following sets were computed thanks to the isl calculator
iscc [34] with the generic script of Fig. 5, for jacobi_1d_imper (see Fig. 3).
Load(I) = {A(m) | 1 ≤ m + 2I1 − I2 ≤ s2, s1 ≥ 1, I1 ≥ 0, m ≥ 1, I1 ≤ −1 + M,
I2 ≥ 2− s2 + 2I1, m ≤ −1 + N, N ≥ 3}
∪ {A(m) | m ≥ 1 + I2, m ≥ 1, M ≥ 1, m ≤ −1 + N, I1 ≤ −1,
I1 ≥ 1− s1, I2 ≥ 2− s2, N ≥ 3, m ≤ s2 + I2}
∪ {A(1) | I2 = 1 + 2I1 ∧ 0 ≤ I1 ≤ −1 + M, N ≥ 3, s1 ≥ 1, s2 ≥ 1}
∪ {A(m) | 0 ≤ m ≤ 1, I2 = 1 ≤ s2, 1− s1 ≤ I1 ≤ −1, M ≥ 1, N ≥ 3}
∪ {A(0) | 0 ≤ I1 ≤M − 1, N ≥ 3, s1 ≥ 1, 1 ≤ I2 − 2I1 ≥ 2− s2 }
∪ {A(0) | 1− s1 ≤ I1 ≤ −1, M ≥ 1, N ≥ 3, I2 ≥ 2− s2, I2 ≤ 0}
Store(I) = {B(m) | m ≥ 1, m ≥ 2− 2M + s2 + I2, m ≤ −2 + N,
I1 ≥ 1− s1, 2 ≤ m + 2s1 + 2I1 − I2 ≤ 1 + s2, s1 ≥ 1}
∪ {B(m) | m ≥ 1, s1 ≥ 1, m ≤ −2 + N, I1 ≤ −1 + M, m ≤ 1− 2M + s2 + I2,
m ≥ 2− 2s1 − 2I1 + I2, I1 ≥ 1− s1, M ≥ 1, m ≥ 2− 2M + I2}
∪ {A(m) | m ≥ 1, m ≥ 1− 2M + s2 + I2, m ≤ −2 + N,
I1 ≥ 1− s1, 1 ≤ m + 2s1 + 2I1 − I2 ≤ s2, s1 ≥ 1}
∪ {A(m) | m ≥ 1, s1 ≥ 1, m ≤ −2 + N, I1 ≤ −1 + M, m ≤ −2M + s2 + I2,
m ≥ 1− 2s1 − 2I1 + I2, I1 ≥ 1− s1, M ≥ 1, m ≥ 1− 2M + I2}
The fact that the array B appears in the Store set may be surprising as B is
recomputed in each tile strip (this is why it does not appear in the Load set).
This is because the script of Fig. 5 considers each tile strip in isolation. To be
able to remove B from the Store set, one would need a similar analysis on tile
strips to discover that B is actually overwritten by subsequent tile strips. Then,
only the last tile strip should store B, in case it is live-out of the program.
It can be checked (e.g., with iscc) that the set Load(I) above is indeed a
generalization of the set Load(T ) derived earlier for the canonical tiling with
s = (2, 3). It is the complete expression, parameterized by s, of all cases,
Here s = (2, 2)I @s I′I′ @s I
I
I2
I1
I′ ≺s I I ≺s I′
I
I2
I1
Figure 4. Orders vs and s. Points are tile origins.
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 13
# Inputs
Params := [M, N, s_1, s_2] -> { : s_1 >= 0 and s_2 >= 0 };
Domain := [M, N] -> { # Iteration domains
S_1[i_1, i_2] : 1 <= i_2 <= N-2 and 0 <= i_1 <= M-1;
S_2[i_1, i_2] : 1 <= i_2 <= N-2 and 0 <= i_1 <= M-1; } * Params;
Read := [M, N] -> { # Read access functions
S_1[i_1, i_2] -> A[m] : -1 + i_2 <= m <= 1 + i_2;
S_2[i_1, i_2] -> B[i_2]; } * Domain;
Write := [M, N] -> { # Write access functions
S_1[i_1, i_2] -> B[i_2];
S_2[i_1, i_2] -> A[i_2]; } * Domain;
Theta := [M, N] -> { # Preliminary mapping
S_1[i_1, i_2] -> [i_1, 2 i_1 + i_2, 0];
S_2[i_1, i_2] -> [i_1, 1 + 2 i_1 + i_2, 1]; };
# Tools for set manipulations
Tiling := [s_1, s_2] -> { # Two dimensional tiling
[[I_1, I_2] -> [i_1, i_2, k]] -> [i_1, i_2, k] :
I_1 <= i_1 < I_1 + s_1 and I_2 <= i_2 < I_2 + s_2 };
Coalesce := { [I_1, I_2] -> [[I_1, I_2] -> [i_1, i_2, k]] };
Strip := { [I_1, I_2] -> [I_1, I_2’] };
Prev := { # Lexicographic order
[[I_1, I_2] -> [i_1, i_2, k]] -> [[I_1, I_2] -> [i_1’, i_2’, k’]] :
i_1’ <= i_1 - 1 or (i_1’ <= i_1 and i_2’ <= i_2 - 1)
or (i_1’ <= i_1 and i_2’ <= i_2 and k’ <= k - 1) };
TiledPrev := [s_1, s_2] -> { # Special ‘‘lexicographic’’ order
[I_1, I_2] -> [I_1’, I_2’] : I_1’ <= I_1 - s_1 or
(I_1’ <= I_1 and I_2’ <= I_2 - s_2) } * Strip;
TiledNext := TiledPrev^-1;
TiledRead := Tiling.(Theta^-1).Read; TiledWrite := Tiling.(Theta^-1).Write;
# Set/relation computations
In := Coalesce.(TiledRead - (Prev.TiledWrite)); Out := Coalesce.TiledWrite;
Load := In - ((TiledPrev.In) + (TiledPrev.Out)); Store := Out - (TiledNext.Out);
print coalesce (Load % Params); print coalesce (Store % Params);
Figure 5. Script iscc for the Jacobi1D example.
including incomplete tiles, and even tilings obtained by translation of L. Note
that simply changing the object Strip (see Fig. 5) from {[I_1,I_2]->[I_1,I_2’]}
to {[I_1,I_2]->[I_1’,I_2’]} gives 2D inter-tile reuse, i.e., in the whole space,
as the first dimension is not a fixed parameter anymore. The strict order ≺s is
defined by TiledPrev while Load and Store, at the end of the script, express
Eq. (3) and (4). Constraints on parameters or on I can be added in Params,
e.g., to get simplified Load/Store sets for complete tiles, for large tiles, etc.
Note however that isl uses coalescing heuristics to simplify expressions and,
depending on the constraints, the outcome can be simpler or more complicated
(although equivalent). Here, replacing s1 ≥ 0 by s1 > 0 changes the final
expression. 
RR n° 8671
14 Darte & Isoard
To prove that we can use ≺s (in Eq (3) and (4)) instead of @s (in Eq (1)
and (2)), we define the concept of pointwise functions. This is a bit more than
what we need for the proofs, but this concept makes easier to understand the
underlying problems, related to the equality (or not) of some unions of images
of sets, which will be even more subtle when dealing with approximations.
3.2 Pointwise Functions
If A is a set, P(A) denotes the set of subsets of A (sometimes also written 2A).
Hereafter, the function F is typically a function such as Out, which maps a tile,
i.e., a subset of the tile strip (A), to a subset of all data elements (B).
Definition 1. Let A and B be two sets, C ⊆ P(A). The function F : C → P(B)
is pointwise iff there exists f : A → P(B) such that ∀X ∈ C, F (X) =
⋃
x∈X
f(x).
In other words, a function F is pointwise if the image of any set where F is defined
(not necessarily all sets) can be summarized by the contributions (through f) of
the points it contains. In our case, A is the set of iterations in the tile strip to be
analyzed and C is the set of all tiles (aligned or unaligned) intersected with A.
If all written values are live-out, Out(I) = Write(I), the values written in I.
Otherwise, this set should be intersected with Liveout, the set of all elements
live-out of the tile strip. The function Write is, by definition, pointwise, because
it is the union, for all points i in I, of the set of values write(i) written at
iteration i. Also, even if I 7→ In(I) may not be pointwise, any element read but
not written in I is live-in for I, thus (In∪Write)(I) = (Read∪Write)(I), which
is pointwise, by introducing read(i) the set of points read at iteration i. We get:
Load(I) = In(I) \ (In ∪Write)(I′ @s I) = In(I) \
⋃
I′@sI
⋃
i∈I′
(read ∪ write)(i)
= In(I) \
⋃
I′≺sI
⋃
i∈I′
(read ∪ write)(i) = In(I) \ (In ∪Write)(I′ ≺s I)
This is because ∪I′≺sII′ = ∪I′@sII′. Indeed, since all tiles aligned with I form
a partition of A, the points covered by the two unions are the same: these are
all the points executed before any point in I. The same is true for Store(I),
which is equal to Liveout∩ (Write(I) \Write(I′ As I)), or equivalently equal to
Liveout ∩ (Write(I) \Write(I′ s I)). This concludes the proof in the exact
case.
In summary, because tiles represent points exactly and because the “happens-
before” relation (the fact that a point, resp. a tile, happens, during tiled execu-
tion, before another point, resp. tile) can be represented by a piece-wise affine
relation, it is possible to perform a parametric analysis of inter-tile data reuse.
The equality of the unions of the images for I′ @s I and for I′ ≺s I is actu-
ally a general property, and even a characterization, of pointwise functions. As
the following theorem shows, pointwise functions are exactly those that induce
the desired “stability” property on union of sets, i.e., if two unions of sets cover
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 15
Figure 6. “Double squares” (red), F (image of red & green), non pointwise situations.
the same points, then the union of their contributions through F are the same.
This a more general property than distributive functions (for ∪), those for which
F (A ∪B) = F (A) ∪ F (B) because, in our case, F (A ∪B) may not be defined.
Theorem 1. F : C → P(B) is pointwise if and only if ∀C′ ⊆ C, ∀C′′ ⊆ C,⋃
X∈C′ X =
⋃
X∈C′′ X ⇒
⋃
X∈C′ F (X) =
⋃
X∈C′′ F (X).
Note that the previous property on unions is equivalent to ∀X ∈ C, ∀C′ ⊆ C,
X ⊆ ⋃X′∈C′ X ′ ⇒ F (X) ⊆ ⋃X′∈C′ F (X ′), i.e., if a set is covered by a union of
sets, then its image is contained in the union of the images of these sets.
A third equivalent characterization is possible, which explicitely builds a
function f for a pointwise function F . If F and G are from C to P(B), we
write F ⊆ G if ∀X ∈ C, F (X) ⊆ G(X). Theorem 2 also identifies the “largest”
pointwise under-approximation of F . All missing proofs are provided in the
appendix.
Theorem 2. For F : C ⊆ P(A) → P(B), let F◦ be the pointwise function
defined from f◦(x) =
⋂
Y ∈C, x∈Y F (Y ). Then F◦ is the largest pointwise under-
approximation of F , i.e., F◦ ⊆ F and, if F ′ is pointwise, F ′ ⊆ F ⇒ F ′ ⊆ F◦.
In particular, F is pointwise if and only if F = F◦.
To get the intuition for these concepts, it is simpler to consider objects more
general than rectangular tiles. Let C be the set of all possible “double squares”
(in 2D) defined as two diagonally-neighboring squares as depicted on the left of
Fig. 6 (red points in two boxes). Suppose each point i has an image f(i). If
F (I) is defined for a “double-square” I as the union of all f(i) for i ∈ I, it is
pointwise by definition. Now, suppose F (I) is defined as the union of all f(i)
for i in the convex hull of I (red + green points). The first situation on the
right of Fig. 6 shows that each point i is included in two “double-squares” whose
images by F have only f(i) in common. Thus F0 is not equal to F (the image of
green points are missing) unless f has some additional property and, according
to Theorem 2, F is not pointwise. The second situation on the right of Fig. 6
shows that a “double-square” is fully contained in two “double-squares”, but the
image of its green points (if f is injective) is not covered by the image of these
two “double-squares” so, according to Theorem 1, F is not pointwise.
RR n° 8671
16 Darte & Isoard
3.3 The Case of Approximations
We will use the previous properties of pointwise functions for approximations.
There are at least four reasons why approximations of the various sets In, Out,
Load, and Store may be used in an automatic code analyzer and optimizer.
– The execution of S at iteration i is not guaranteed, for example when it
depends on a non-analyzable (e.g., data-dependent) if condition.
– The access functions are not fully analyzable (e.g., indirect accesses).
– The In/Out sets are approximated on purpose (e.g., they are restricted to
polyhedra or hyper-rectangles) due to the algorithms used for analysis.
– The Load/Store sets are approximated to make them simpler, or to get
transfer sets of some special form (e.g., vector/array communications).
In the first two cases, the approximation is pointwise, so the Read/Write func-
tions remain pointwise. In the last two cases, it is more likely that In ∪ Out is
not pointwise anymore. We first recall and extend the principles stated in [3]
for approximations, assuming that the sets In, Out, and Out are given such that
In(I) ⊆ In(I) and Out(I) ⊆ Out(I) ⊆ Out(I). Here, the under-approximations
(that could benefit from [10,32]) are not used for correctness, only for accuracy.
Non-Parametric Case. The first step is to define the Store sets, as exactly as
possible from the Out sets, i.e., the sets of data possibly written:
Store(I) = Liveout ∩ (Out(I) \Out(I′ As I)) (5)
Then, any over-approximation Store(I) of Store(I) can be used. Eq. (5) means
that a possibly-defined element is always stored to remote memory, in case it is
indeed written at runtime. But what if this is not the case? We add it to the
set of input elements so that its initial value is stored back instead of garbage:
In
′
(I) = In(I) ∪ (Store(I) \Out(I)) (6)
Following [3, Thm. 3], loads are defined, as exactly as possible, from the sets
Out, Out, and In
′
(i.e., after Store is defined). They are valid if for any tile I:
Load(I′ vs I) contains Ra(I) = In′(I) \Out(I′ @s I) (7)
Load(I) ∩Out(I′ @s I) = ∅ (8)
Eq. (7) means that all data possibly defined outside of the tile strip – the remote
accesses Ra(I) – have to be loaded before I. Eq. (8) means that data possibly
defined earlier in the tile strip should not be loaded, as this could overwrite some
valid data. Eq. (9) below gives a non-recursive definition of Load(I), simpler
(and more usable) than the formula of [3, Thm. 6] (although it is equivalent):
Load(I) = RaI ∩ ((In′ ∪Out)(I) \ (In′ ∪Out)(I′ @s I)) (9)
where RaI denotes all remote accesses for the tile strip w.r.t. I, i.e., the union
of all Ra(I′), as defined in Eq. (7), for all I′ that belong to the same tiling as I.
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 17
The mechanism of Eq. (9) is actually simple: unlike for the exact case, a remote
access live-in for I (i.e., in In
′
(I)) cannot be loaded just before I if it may be
written earlier (i.e., in Out(I′ @s I)). Otherwise, the load will erase the right
value if, at runtime, it was indeed written earlier. Instead, the trick is to load
the element before the first tile I′ that may write it. This way, either the value is
defined locally and the read in I gets this value, or it is not defined and the read
gets the original value. Thm. 3 (proof in the appendix) states more formally the
correctness and exactness of Eq. (9). Then, any over-approximation Load(I) of
this “exact” Load(I) can be used (even if it may induce some useless loads) as
long as it still satisfies Load(I) ∩Out(I′ @s I) = ∅, as required by Eq. (8).
Theorem 3. Eq. (9) defines valid loads, which are “exact” w.r.t. the In
′
, Out,
and Out sets (no useless or redundant loads) and performed as late as possible.
We write ∆F the function defined from F by ∆F (I) = F (I)\F (I′ @s I). Then,
with F = In
′ ∪Out, we get Load(J) = RaI ∩∆F (J) for all J aligned with I.
Parametric Case. Our goal is to reformulate Eq. (5) and (9) so that the Store
and Load sets can be computed with the tile sizes s as parameter. Can we just
replace the order vs by s as in the exact case (Section 3.1)? No. Doing so may,
in general, be incorrect, resulting in missing loads or stores for I, if subtracting
the contribution of unaligned tiles (i.e., those that will not be executed) remove
additional elements. This is where pointwise functions come, again, into play.
The easy case is when approximations are at the level of iterations, i.e., the
accesses of each iteration i are approximated with write(i) ⊆ write(i) ⊆ write(i)
and read(i) ⊆ read(i), resulting in pointwise functions Write, Write, and Read.
If the sets Out, In, then Store are derived from Write and Read with no further
approximation, then, as for the exact case, Out and In
′ ∪Out are pointwise too.
Thus, a Store(I) can be computed with Eq. (5), in a parametric way, with s
instead of As. The same is true for the central part of Load(I) in Eq. (9) with ≺s
instead of @s. It remains to compute RaI from Ra(I) = In
′
(I) \Out(I′ @s I).
As the tiles in L cover the whole iteration space, RaI is the set of all data that
are maybe read (or written for stores) and possibly not written before, i.e., live-
in for the tile strip, for the schedule induced by the tiling aligned with I. But
if the mapping θ used for tiling was considered legal with the same pointwise
approximation of reads and writes, then any shifted tiling (with standard validity
conditions) preserves anti, flow, and output dependences, thus RaI does not
depend on I. It is even equal to the live-in data for the tile strip when considering
the original order of the code and, thus, can be computed, independently on s.
The previous approach can be used when Load/Store sets are computed
“exactly” but from a pointwise approximation of accesses. We now consider the
case where, in addition to this pointwise approximation, even the sets Out, In,
Store, and Load can be over-approximated further, for whatever reason. For
example, Store(I) can contain data that are not even in Out or In, and thus not
remote in the strict sense. However, transfers still need to be correct. We first
RR n° 8671
18 Darte & Isoard
consider how to handle Out in Eq. (5) and In
′ ∪Out in Eq. (9), which, a priori,
have no reason to be pointwise. We deal with the computation of RaI later.
We first mention an interesting intermediate situation that works with no
further difficulties, even if the approximations are not pointwise. If a pointwise
function F is over-approximated through its domain (the iterations) instead of
its range (the data), i.e., F (I) = F (I) with I ⊆ I, then it may be the case that,
when computing the unions (either with @s or ≺s), no new iterations are added
with the approximated domains. This is what happens with the approximated
“double-squares” of Fig. 6, typical from parallel tiles. Then F (I′ @s I) equals:⋃
I′@sI
⋃
i∈I′
f(i) =
⋃
I′@sI
⋃
i∈I′
f(i) =
⋃
I′≺sI
⋃
i∈I′
f(i) =
⋃
I′≺sI
⋃
i∈I′
f(i) = F (I′ ≺s I)
In this case, even without pointwise functions, parametric approximations can be
designed, with a careful analysis of the “shape” (the sets I) of approximations.
But, this situation does not cover the case where approximations are made in
the range of F and cannot be converted into approximations in the domain of F ,
as it is the case for pointwise functions. We now address this general case.
The key point for approximation is that loading earlier and storing later
always keeps correctness. As noticed earlier, Load(I) has the form RaI ∩∆F (I)
with ∆F (I) = F (I) \ F (I′ @s I), thus ∆F (I′ vs I) = F (I′ vs I). If we
define F ◦ pointwise such that F ⊆ F ◦, then ∆F (I′ vs I) ⊆ ∆F ◦(I′ vs I),
i.e., possibly more data are loaded (but no load is delayed), thus the validity
condition of Eq. (7) is satisfied with RaI ∩∆F ◦. The same is true for Store(I)
with ws: possibly more data are stored but no store is advanced. Finally,
Eq. (8) is satisfied too as Out(I′ @s I) ⊆ F (I′ @s I) ⊆ F ◦(I′ @s I), which is
subtracted in ∆F ◦. Thus, such an over-approximation mechanism (making F
bigger) is always valid.
Thm. 4 below shows how to build such a function F ◦ with the additional
property that loads in ∆F that correspond to “pointwise loads” are still loaded
for the same tile with ∆F ◦, i.e., not earlier (thus with no lifetime increase).
Indeed, the goal is to try to avoid the naive solution where all data are loaded
(resp. stored) before (resp. after) the whole computation of the tile strip.
Theorem 4. Let C be the set of all tiles of size s and F : C → P(B). Define F ◦
by F ◦(I) = ∪J, I∈JF (J), where I ∈ J means that I is in the tile with origin J .
Then F ⊆ F ◦ and F ◦ is pointwise. Moreover, if y is such that ∀I, y ∈ F (I)⇒
y ∈ F◦(I) (F◦ is defined in Thm. 2), then ∀I, y ∈ ∆F ◦(I) ⇒ y ∈ ∆F (I), i.e.,
over-approximating F by F ◦ does not load “pointwise” elements earlier.
The same technique can be used for Store(I) but with an expression such as
F ◦(I) = ∪J,J∈IF (J). It remains to see what to do with the set RaI . We can
compute, with s as parameter, Ra(I) = In(I) \Out(I′ ≺s I), thus replacing @s
by ≺s. We get a priori a smaller set, which could be problematic because of
the intersection in Eq. (9). However, it is still correct and, actually, even more
precise. Indeed, as Out is exact, we have In
′
(I) \ Out(I′ @s I) = In′(I) \
Out(I′ ≺s I) and what is actually important in Eq. (7) is that this set is indeed
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 19
loaded. Thus, considering Ra(I) = In(I) \ Out(I′ ≺s I) in Eq. (7) is fine as it
is a superset. Finally, to compute RaI =
⋃
J,J−I∈LRa(J), we drop the lattice
constraint. If Ra is not pointwise, we get a possibly larger set: this is suboptimal,
but correct.
This completes the theory for parametric tiling with inter-tile reuse and
approximations. In practice, it needs to be adapted to each approximation
scheme but it still provides some general mathematical means to reason on the
correctness of approximations for parametric tiling. A possible approximation
(to reduce complexity) consists in removing, in all intermediate computations
such as Out, Store, In′, all existential variables (projection) and to manipu-
late only integer points in polyhedra. Another possibility is to rely on array
region analysis techniques [10]. This is left for future work. We point out
however that generalizing such a parametric inter-tile reuse to more general
tilings, where tiles (rectangular or not) are not executed following the axes that
define them, will be more difficult if the iteration space covered by tiles that
“happen before” a given tile cannot be defined by a piece-wise affine relation.
One can still define approximations, even not necessarily pointwise, as long as
(In
′ ∪ Out)(I′ ≺s I) = (In′ ∪ Out)(I′ @s I) (and similar equalities), as illus-
trated with the ”double-squares” of Fig. 6. However such approximations are
more difficult to define systematically and may require unacceptable (i.e., too
rough) additional over-approximations.
4 Next Step: Deriving Local Memory Sizes
One of the interests of computing the Load/Store sets in a parametric fashion
is that, now, the size of the resulting local memory (e.g., obtained by bounding
boxes or lattice-based array contraction [13]) can also be computed in a para-
metric fashion. Such a parametric scheme seems almost mandatory in a context
such as described in [4,29], for HLS from C to FPGA. Indeed, as explained
in [4], some manual (though systematic) changes must be done to the tiled code
so that it is accepted by the HLS tool. Doing these changes for all interesting
tile sizes is not reasonable. Also, as explained in [29], identifying the right tile
sizes may require executions of multiple scenarios. Parametric code generation
would help speeding up such a design space exploration. With this parametric
inter-tile reuse, combined with parametric code generation [31] and buffer siz-
ing [1], one should be able to derive a fully automatic scheme, with parametric
tile sizes. This also makes the design and use of analytical cost models possible,
in particular to explore hierarchical tiling, which impacts the local memory size.
To illustrate such applications, we extended the buffer sizing of [1] – which
requires lifetime information of array elements to use memory reuse for ar-
ray contraction – to the case where s is a parameter, and for partial orders
of computations, e.g., those expressing pipeline executions. As for inter-tile
reuse, we consider all tiles, not just those aligned w.r.t. a given lattice. Again,
one can make sure that no rough approximation is performed that would re-
sult in an over-estimated memory size. These results are out of the scope
RR n° 8671
20 Darte & Isoard
of this paper. We only report here some examples, for two schedules, as il-
lustration. The first schedule performs all computations in sequence: tiles
are serialized and each tile performs its loads, then its computations, then its
stores before a new tile is computed. The second one is a double-buffering-
style schedule (in each tile strip) defined with the following precedences: a) if
I1, I2, I3 are three successive tiles for vs, transfer requests are serialized as
Load(I2)→ Store(I1)→ Load(I3)→ Store(I2)→ . . ., b) tile computations are
done sequentially following vs, and c) each tile I loads its set Load(I), then
computes, then stores its set Store(I). All other overlappings (in particular par-
allelism between computations and transfers) can arise at runtime, achieving a
kind of double-buffering-style computation.
Example (cont’d) The jacobi_1d_imper code of Fig. 1 has two parameters N
and M defining the loop bounds. The proposed tiling has also two tile size
parameters s1 and s2. There could be a 5th parameter to specify each tile strip,
but we chose to derive mappings valid for all tile strips (as for all examples
hereafter). After Load/Store analysis and memory folding with modulos, we get
(after simplification) to following sizes for A and B, for the sequential schedule:
– size(B) = min(N − 2, 2M + s2 − 1, 2s1 + s2 − 1).
– size(A) = min(N, 2M + s2, 2s1 + s2).
and, with the pipeline schedule:
– size(B) = min(N − 2, 2M + 2s2 − 2, 2s1 + 2s2 − 2).
– size(A) = min(N, 2M + 2s2, 2s1 + 2s2).
These expressions are actually expressed as disjunctions, each term that con-
tributes to the minimum being specified by conditions on parameters. One can
also of course easily retrieve (this time in a parametric fashion) the expression
of the memory size for the product of 2 polynomials analyzed in [4]. 
We are currently working on an automated implementation of the described
algorithm with isl, with an integration into PPCG [35], an optimizer for GPUs.
For the moment, we manually adapted an iscc script for each PolyBench [30]
example. The results are given in Table 1 (see appendix). The transforma-
tions θ were given by the isl scheduler, which gives results similar to those of
Pluto [28]. We tiled the largest consecutive tilable dimensions (underlined in
Table 1) for which dependences are nonnegative. Some examples were omitted,
either because the schedule provided by isl did not exhibit any “tileability”4
– at least without preliminary transformations such as array expansion –, or
simply because they had too many instructions5 or variables6 and will not fit in
the table anyway (these examples were not tried: they may – but maybe not –
reveal complexity issues, which will be explored with the automatic implemen-
tation in isl, as well as different approximation schemes). Moreover, in Table 1,
4 Kernels durbin, ludcmp, cholesky, and symm
5 Kernels adi, fdtd-apml, gramschmidt, 2mm, 3mm, correlation, and covariance
6 Kernels bicg, gemver, and gesummv
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 21
parameters were restricted so that each kernel domain contains at least one strip
with at least two consecutive full tiles, and tile sizes are at least 2: this avoids
many special cases (their generation is possible however) that, again, would not
fit in the table.
The results shown in Table 1 are the array sizes after memory folding. We
computed a memory allocation compatible for all tile strips, depending on the
program parameters and the counters of the loops surrounding the tiled loops.
Another choice could have been to compute a memory allocation depending
on the strip, potentially saving space for boundary strips. The memory size
was computed for both sequential and pipelined (double buffering) execution
with inter-tile data reuse, using the successive modulo approach of Lefebvre and
Feautrier [26]. We are still working on the approximations, not provided in the
table, as well as on techniques to speed-up and simplify both the expressions of
intermediate sets such as In
′
and the final ones such as Load and memory sizes.
Double buffering, as expected, usually doubles the local memory size in terms
of the innermost tile size. Some arrays require almost all data to be live during
a strip, thus causing the whole array to be stored into local memory (e.g., x in
trisolv). Furthermore, modulo allocation has limitations. It is really apparent
on floyd warshall where memory conflicts are spread in such a way that only
a modulo bigger than k + 1 and n− k on both dimensions is valid. Thus, while
the number of conflicting memory addresses is proportional to the tile area, the
allocation is not. A tighter memory allocation could be obtained with a piece-
wise modulo allocation scheme, allocating accesses to path[i, k] and path[k, j]
differently from the accesses to path[i, j]. More generally, it is more likely that
automating such schemes, with pipelining, parallelism, and hierarchical trans-
fers, will require more advanced communication and allocation strategies.
5 Conclusion
This work provides the first parametric solution for generating memory trans-
fers with data reuse when a kernel is offloaded to a distant accelerator, tile by
tile after loop tiling, and when all intermediate results are stored locally on the
accelerator. In this case, when a value has been loaded or defined in a previous
tile, it is read from the local memory and not loaded from the remote memory,
which is not yet up-to-date. Our solution is parametric in the sense that we
can derive the copy-in/copy-out sets for each tile, exploiting both intra- and
inter-tile data reuse, with tile sizes as parameters. Such a result is quite sur-
prising as parametric tiling is often considered as necessarily involving quadratic
constraints, i.e., not analyzable within the polyhedral model. We solve it in
an affine way with a different reasoning that considers, in the analysis, all (un-
aligned) possible tiles obtained by translation and not just the tiles of a given
tiling. A similar technique can be used to parameterize the computations of
local memory sizes, thanks to parametric lifetime analysis and array contraction
with parametric modulos (or bounded boxes), even for pipeline schedules similar
to double buffering.
RR n° 8671
22 Darte & Isoard
This reasoning can also be extended in the case of approximations, which are
needed when dealing with kernels that are not fully affine, or because approxi-
mations of communications are desired for code simplicity, complexity issues, or
architectural constraints (e.g., vector communication). The main difficulty with
approximation is that, when some data can be both read and written, loading
blindly from remote memory, in an over-approximate way, is not safe as it may
not be up-to-date. We address the problem thanks to the introduction of the
concept of pointwise functions, well suited to deal with unaligned tiles. This con-
cept may be useful for other applications linked to extensions of the polyhedral
model as it turns out to be fairly powerful. For the moment, our study provides
the mathematical foundations to discuss the correctness of approximation tech-
niques that still need to be designed, even if some simple schemes are already
possible. The full implementation, from the analysis down to code generation,
is still a development challenge. Full experiments will be needed to validate the
approach and help designing cost models for tile size selection. Nevertheless, the
different performance studies with inter-tile data reuse for GPUs [17,18,35] or
FPGAs [4,29], for non-parametric tile sizes, already demonstrate its interest.
“Guessing” the right size of the tiles can be laborious, especially when deal-
ing with multi-level tiling and multi-level caches. The search space can become
so wide that even iterative compilation might not be sufficient. As said, our
parametric technique provides a direct expression of the copy-in/copy-out sets
for each tile, and can then be used for performing array contraction on the accel-
erator still in a parametric fashion. It is only with such a parametric description
that we can hope to design cost models for compile-time tile size selection in the
context of tiling with inter-tile data reuse. Such static compilation techniques
could then be integrated on top of intermediate languages such as OpenACC
or OpenCL, or directly generate lower-level code, providing an automatic way
to derive blocking algorithms for accelerators. Other applications are certainly
possible, as soon as data reuse among tiles or pages has to be analyzed.
Acknowledgements We thank Sven Verdoolaege for his help in using isl and
iscc as well as for his suggestion that set differences and relations could solve the
non-parametric problem as efficiently as linear programming optimizations [4].
References
1. Christophe Alias, Fabrice Baray, and Alain Darte. Bee+Cl@k: An implementation
of lattice-based array contraction in the source-to-source translator Rose. In ACM
SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded
Systems (LCTES’07), San Diego, June 2007.
2. Christophe Alias, Alain Darte, and Alexandru Plesco. Optimizing DDR-SDRAM
communications at C-level for automatically-generated hardware accelerators. An
experience with the Altera C2H HLS tool. In 21st IEEE Int. Conf. on Application-
specific Systems, Architectures and Processors (ASAP’10), pages 329–332, Rennes,
July 2010. IEEE Computer Society.
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 23
3. Christophe Alias, Alain Darte, and Alexandru Plesco. Kernel oﬄoading with op-
timized remote accesses. Technical Report RR-7697, Inria, July 2011.
4. Christophe Alias, Alain Darte, and Alexandru Plesco. Optimizing remote accesses
for oﬄoaded kernels: Application to HLS for FPGA. In Design, Automation and
Test in Europe (DATE’13), pages 575–580, Grenoble, March 2013.
5. Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ra-
manujam, Atanas Rountev, and P. Sadayappan. Automatic data movement and
computation mapping for multi-level parallel architectures with explicitly managed
memories. In 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel
Programming (PPoPP’08), pages 1–10, 2008.
6. Muthu Manikandan Baskaran, Nicolas Vasilache, Benoˆıt Meister, and Richard
Lethin. Automatic communication optimizations through memory reuse strate-
gies. In 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming (PPoPP’12), pages 277–278, New Orleans, February 2012.
7. Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. A practi-
cal automatic polyhedral parallelizer and locality optimizer. In ACM International
Conference on Programming Languages Design and Implementation (PLDI’08),
pages 101–113, Tucson, June 2008.
8. Srinivas Boppu, Franck Hannig, and Ju¨rgen Teich. Loop program mapping
and compact code generation for programmable hardware accelerators. In 24th
Int. Conference on Application-Specific Systems, Architectures and Processors
(ASAP’13), pages 10–17, Washington, DC, June 2013.
9. Mathias Bourgoin, Emmanuel Chailloux, and Jean Luc Lamotte. Efficient abstrac-
tions for GPGPU programming. International Journal of Parallel Programming,
42(4):583–600, 2014.
10. Be´atrice Creusillet and Franc¸ois Irigoin. Interprocedural array region analyses.
In International Workshop on Languages and Compilers for Parallel Computing
(LCPC’96), volume 1033 of LNCS, pages 46–60. Springer, 1996.
11. Alain Darte and Alexandre Isoard. Parametric tiling with inter-tile data reuse. In
Sanjay Rajopadhye and Sven Verdoolaege, editors, 4th International Workshop on
Polyhedral Compilation Techniques (IMPACT’14), Vienna, Austria, January 2014.
12. Alain Darte and Alexandre Isoard. Exact and Approximated Data-Reuse Opti-
mizations for Tiling with Parametric Sizes. In 24th International Conference on
Compiler Construction (CC’15), part of ETAPS’15, London, United Kingdom,
April 2015.
13. Alain Darte, Robert Schreiber, and Gilles Villard. Lattice-based memory alloca-
tion. IEEE Transactions on Computers, 54(10):1242–1257, October 2005.
14. Paul Feautrier. Parametric integer programming. RAIRO Recherche
Ope´rationnelle, 22(3):243–268, 1988. Corresponding software tool PIP: http:
//www.piplib.org/.
15. Paul Feautrier and Christian Lengauer. The polyhedron model. In David Padua,
editor, Encyclopedia of Parallel Programming. Springer, 2011.
16. Georgios I. Goumas, Maria Athanasaki, and Nectarios Koziris. An efficient code
generation technique for tiled iteration spaces. IEEE Transactions on Parallel and
Distributed Systems, 14(10):1021–1034, 2003.
17. Armin Gro¨ßlinger. Precise management of scratchpad memories for localising ar-
ray accesses in scientific codes. In 18th International Conference on Compiler
Construction (CC’09), pages 236–250, 2009.
18. Serge Guelton, Mehdi Amini, and Be´atrice Creusillet. Beyond do loops: Data
transfer generation with convex array regions. In Hironori Kasahara and Keiji
RR n° 8671
24 Darte & Isoard
Kimura, editors, International Workshop on Languages and Compilers for Parallel
Computing (LCPC’13), volume 7760 of LNCS, pages 249–263. Springer, 2013.
19. Serge Guelton, Ronan Keryell, and Franc¸ois Irigoin. Compilation pour cible
he´te´roge`nes: automatisation des analyses, transformations et de´cisions ne´cessaires.
In 20e`me Rencontres Franc¸aises du Paralle´lisme (Renpar’11), Saint Malo, France,
May 2011.
20. Albert Hartono, Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayap-
pan. DynTile: Parametric tiled loop generation for parallel execution on multicore
processors. In 24th IEEE International Symposium on Parallel and Distributed
Processing (IPDPS’10), pages 1–12, Atlanta, April 2010.
21. Franc¸ois Irigoin and Re´mi Triolet. Supernode partitioning. In 15th
ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
(POPL’88), pages 319–329, San Diego, California, 1988. ACM.
22. Ilya Issenin, Erik Borckmeyer, Miguel Miranda, and Nikil Dutt. DRDU: A data
reuse analysis technique for efficient scratch-pad memory management. ACM
Transactions on Design Automation of Electronics Systems (ACM TODAES),
12(2), April 2007. Article 15.
23. Mahmut Kandemir, Ismail Kadayif, Alok Choudhary, J. Ramanujam, and Ibrahim
Kolcu. Compiler-directed scratch pad memory optimization for embedded multi-
processors. IEEE Transactions on VLSI Systems, 12(3):281–287, March 2004.
24. Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. Achieving a single
compute device image in OpenCL for multiple GPUs. In 16th ACM Symposium
on Principles and Practice of Parallel Programming (PPoPP’11), pages 277–288.
ACM, 2011.
25. Seyong Lee and Rudolf Eigenmann. OpenMPC: Extended OpenMP programming
and tuning for GPUs. In ACM/IEEE International Conference for High Perfor-
mance Computing, Networking, Storage and Analysis (SC’10), pages 1–11. IEEE
Computer Society, 2010.
26. Vincent Lefebvre and Paul Feautrier. Automatic storage management for parallel
programs. Parallel Computing, 24:649–671, 1998.
27. Sreepathi Pai, R. Govindarajan, and Matthew J. Thazhuthaveetil. Fast and effi-
cient automatic memory management for GPUs using compiler-assisted runtime
coherence scheme. In 21st International Conference on Parallel Architectures and
Compilation Techniques (PACT’12), pages 33–42. ACM, 2012.
28. PLUTO: An automatic polyhedral parallelizer and locality optimizer for multi-
cores. http://pluto-compiler.sourceforge.net.
29. Louis-Noe¨l Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. Polyhedral-
based data reuse optimization for configurable computing. In ACM/SIGDA Inter-
national Symposium on Field Programmable Gate Arrays (FPGA’13), pages 29–38.
ACM, 2013.
30. Louis-Noe¨l Pouchet. PolyBench/C, the polyhedral benchmark suite. http:
//sourceforge.net/projects/polybench/.
31. Lakshminarayanan Renganarayanan, DaeGon Kim, Sanjay V. Rajopadhye, and
Michelle Mills Strout. Parameterized tiled loops for free. In ACM SIGPLAN 2007
Conference on Programming Language Design and Implementation (PLDI’07),
pages 405–414, San Diego, June 2007.
32. RamakrishnaUpadrasta andAlbert Cohen. Sub-polyhedral scheduling using (unit-)
two-variable-per-inequality polyhedra. In The 40th Annual ACM SIGPLAN-
SIGACT Symposium on Principles of Programming Languages (POPL’13), pages
483–496, Roma, Italy, January 2013.
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 25
33. Sven Verdoolaege. isl: An integer set library for the polyhedral model. In Komei
Fukuda, Joris Hoeven, Michael Joswig, and Nobuki Takayama, editors, Mathemat-
ical Software - ICMS 2010, volume 6327 of LNCS, pages 299–302. Springer, 2010.
http://freecode.com/projects/isl/.
34. Sven Verdoolaege. Counting affine calculator and applications. In 1st Interna-
tional Workshop on Polyhedral Compilation Techniques (IMPACT’11), Chamonix,
France, April 2011.
35. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, Jose´ Ignacio Go´mez, Chris-
tian Tenllado, and Francky Catthoor. Polyhedral parallel code generation for
CUDA. ACM Transactions on Architecture and Code Optimization (TACO),
9(4):54, 2013.
36. Michael Wolf and Monica Lam. A data locality optimizing algorithm. In ACM
SIGPLAN Conference on Programming Language Design and Implementation
(PLDI’91), pages 30–44, Toronto, Ontario, Canada, 1991. ACM.
37. Jingling Xue. On tiling as a loop transformation. Parallel Processing Letters,
7(4):409–424, 1997.
38. Jingling Xue. Loop Tiling for Parallelism. Kluwer Academic Publishers, 2000.
RR n° 8671
26 Darte & Isoard
A Proofs for Pointwise Functions
We first recall the definition of pointwise functions.
Definition 1. Let A and B be two sets, C ⊆ P(A). The function F : C → P(B)
is pointwise iff there exists f : A → P(B) such that ∀X ∈ C, F (X) =
⋃
x∈X
f(x).
Thus, F is pointwise if the image of any set where F is defined can be summa-
rized by the contributions (through f) of the points it contains. We first prove
Theorem 2, then Theorem 1 (theorems are presented in the opposite order in the
text). If F and G are from C to P(B), we write F ⊆ G if ∀X ∈ C, F (X) ⊆ G(X).
Theorem 2. For F : C ⊆ P(A) → P(B), let F◦ be the pointwise function
defined from f◦(x) =
⋂
Y ∈C, x∈Y F (Y ). Then F◦ is the largest pointwise under-
approximation of F , i.e., F◦ ⊆ F and, if F ′ is pointwise, F ′ ⊆ F ⇒ F ′ ⊆ F◦.
In particular, F is pointwise if and only if F = F◦.
Proof. Let X ∈ C and y ∈ F◦(X) = ∪x∈Xf◦(x): ∃xy ∈ X such that y ∈ f◦(xy).
With Y = X in the definition of f◦, we get f◦(xy) ⊆ F (X), thus y ∈ F (X),
and F◦ ⊆ F . If F ′ is pointwise and F ′ ⊆ F , then f ′(x) ∈ F ′(Y ) ⊆ F (Y ) for
all Y ∈ C such that x ∈ Y . Thus f ′(x) ⊆ f◦(x) by definition of f◦. Finally, if
the function F is pointwise, F ⊆ F◦, thus F = F◦ since F◦ ⊆ F . Conversely, if
F = F◦, F is pointwise with f◦. uunionsq
Theorem 1. F : C → P(B) is pointwise if and only if ∀C′ ⊆ C, ∀C′′ ⊆ C,⋃
X∈C′ X =
⋃
X∈C′′ X ⇒
⋃
X∈C′ F (X) =
⋃
X∈C′′ F (X).
Proof. Let A =
⋃
X∈C′ X and B =
⋃
X∈C′′ X. If the function F is pointwise,⋃
X∈C′ F (X) =
⋃
X∈C′
⋃
x∈X f(x) =
⋃
x∈A f(x), and the same for B. Thus, if
A = B, the two unions are equal.
Now suppose that F is not pointwise. Theorem 2 shows that there exist X ∈
C and y ∈ F (X) \ F◦(X), where F◦(X) =
⋃
x∈X
⋂
Y ∈C,x∈Y F (Y ), i.e., ∀x ∈ X,
∃Yx ∈ C such that x ∈ Yx and y /∈ F (Yx). By construction, X ⊆
⋃
x∈X Yx
thus
⋃
x∈X Yx = X ∪ (
⋃
x∈X Yx). But y /∈
⋃
x∈X F (Yx) while y ∈ F (X) thus
y ∈ F (X) ∪ (⋃x∈X F (Yx)), contradiction. uunionsq
We write ∆F the function defined from F by ∆F (I) = F (I) \ F (I′ @s I).
By induction, for all I, ∆F (I′ vs I) = F (I′ vs I) (but the first one is a disjoint
union) and, similarly, ∆F (I′ @s I) = F (I′ @s I). This implies the recursive
relation ∆F (I) = F (I)\∆F (I′ @s I). Also, ∆F (I) = F (I′ vs I)\F (I′ @s I).
Theorem 3. Eq. (9) defines valid loads, which are “exact” w.r.t. the In
′
, Out,
and Out sets (no useless or redundant loads) and performed as late as possible.
Proof. We first prove that the loads are valid. First, Eq. (8) is satisfied since
Out(I′ @s I) is subtracted in Eq. (9). By defining F = In
′ ∪ Out, we get
Load(J) = RaI ∩∆F (J) for all J aligned with I, thus Load(J ′ vs J) = RaI ∩
Inria
Data-Reuse Optimizations for Tiling with Parametric Sizes 27
∆F (J ′ vs J) = RaI ∩ F (J ′ vs J). As Ra(J) ⊆ RaI and Ra(J) ⊆ In′(J) ⊆
F (J), then Ra(J) ⊆ RaI ∩ F (J ′ vs J), thus Eq. (7) is satisfied too. Note that
the intersection with RaI in Load(I) is not needed for correctness but it makes
sure there are no useless loads. Also, Load(J) = RaI ∩ (F (J)\∆F (J ′ @s J)) =
(RaI ∩ F (J)) \ Load(J ′ @s J), thus there are no redundant loads. Finally, if
y ∈ Load(J), either y ∈ In′(J) and y must be loaded before J as it may be read
in J , or y ∈ Out(J) and it cannot be loaded later or it will overwrite the value
possibly written in J . Loads are thus done as late as possible. uunionsq
Theorem 4. Let C be the set of all tiles of size s and F : C → P(B). Define F ◦
by F ◦(I) = ∪J, I∈JF (J), where I ∈ J means that I is in the tile with origin J .
Then F ⊆ F ◦ and F ◦ is pointwise. Moreover, if y is such that ∀I, y ∈ F (I)⇒
y ∈ F◦(I) (F◦ is defined in Thm. 2), then ∀I, y ∈ ∆F ◦(I) ⇒ y ∈ ∆F (I), i.e.,
over-approximating F by F ◦ does not load “pointwise” elements earlier.
Proof. Depending of the context, we use I to represent a point in Zn but also
the tile with origin I. Of course F ⊆ F ◦ since I ∈ I. Now, let f◦ : Zn → P(B)
with f◦(J) = F (J − s + 1): J is the opposite corner in the tile whose origin
is J − s + 1. Then, ∀I ∈ Zn, ∪J∈If◦(J) = ∪J∈IF (J − s + 1). But J ∈ I iff
I ∈ J ′ = J−s+1. Thus, the previous union is equal to ∪J′,I∈J′F (J ′) = F ◦(I),
i.e., F ◦ is pointwise.
Now, suppose that for all I, y ∈ F (I) ⇒ y ∈ F◦(I). If y ∈ F ◦(I′ vs I) =
∪I′vsI ∪J, I′∈J F (J), then y ∈ F (J) for some J and I′ such that I′ vs I,
I′ ∈ J . Thus y ∈ F◦(J) and y ∈ f◦(x) for some x ∈ J because F◦ is pointwise.
Since F◦ ⊆ F and since the union of tiles ∪I′vsI ∪J, I′∈J J spans the same set of
points as the union of tiles ∪I′vsII′, this shows y ∈ F (I′ vs I). Remember that
for any function G, ∆G(I) = G(I′ vs I) \ G(I′ @s I). Thus if y ∈ ∆F ◦(I),
y ∈ F ◦(I′ vs I) \ F ◦(I′ @s I), which implies y ∈ F (I′ vs I) (as we just
showed) and y /∈ F (I′ @s I) (because F ⊆ F ◦). Thus y ∈ ∆F (I). uunionsq
RR n° 8671
28 Darte & Isoard
Sample Schedule Sequential Memory Size Pipelined Memory Size
Stencils
fdtd-2d
S0(t,j)7→(t,t,t+j,0)
S1(t,i,j)7→(t,t+i,t+i+j,1)
S2(t,i,j)7→(t,t+i,t+i+j,3)
S3(t,i,j)7→(t,t+i+1,t+i+j+1,2)
hz[s1+s2,min(s1,s2)+s3]
ex[s1+s2,min(s1,s2)+s3]
ey[s1+s2,min(s1,s2−1)+s3]
_fict_[min(s1,s2)]
hz[s1+s2,min(s1,s2)+2s3]
ex[s1+s2,min(s1,s2)+2s3]
ey[s1+s2,min(s1,s2)+2s3]
_fict_[min(s1,s2)]
jacobi-1d-imper
S0(t,i) 7→(t,2t+i,0)
S1(t,j) 7→(t,2t+j+1,1)
A[2s1+s2]
B[2s1+s2−1]
A[2s1+2s2]
B[2s1+2s2−2]
jacobi-2d-imper
S0(t,i,j)7→(t,2t+i,2t+i+j,0)
S1(t,i,j)7→(t,2t+i+1,2t+i+j+1,1)
A[2s1+s2,min(2s1,s2+1)+s3]
B[2s1+s2−1,min(2s1,s2)+s3−1]
A[2s1+s2,min(2s1,s2+1)+2s3]
B[2s1+s2−1,min(2s1,s2+1)+2s3−2]
seidel-2d S0(t,i,j)7→(t,t+i,2t+i+j) A
[
s1+s2+1,
min(2s1+2,s1+s2,2s2+2)+s3
]
A
[
s1+s2+1,
min(2s1+2,s1+s2,2s2+2)+2s3
]
Medley
floyd-warshall S0(k,i,j)7→(k,i,j) path
[
max(k+1,n−k),
max(k+1,n−k)
]
path
[
max(k+1,n−k),
max(k+1,n−k,2s2)
]
reg-detect
S0(t,j,i,cnt)7→(t,j−i,t+i,t+cnt,2)
S1(t,j,i)7→(t,j−i,t+i,t,4)
S2(t,j,i,cnt)7→(t,j−i,t+i,t+cnt,3)
S3(t,j,i)7→(t,j−i,t+i,len+t,0)
S4(t,i)7→(t,−i,t+i,len+t,5)
S5(t,j,i)7→(t,j−i,t+i,len+t,1)
diff
s1+s2+s3−3,min(s1+s3−2,s2),
min(s1,s3)+s4−1)

path
[
min(s1−1,s4)+s2+s3−1,
min(s1+s3−1,s2,s3+s4)
]
mean
[
s2+s3−1,
min(s2,s3−1)
]
sum_tang
[
s1+s2+s3−2,
min(s1+s3−1,s2)
]
sum_diff
s1+s2+s3−2,min(s1+s3−1,s2),
min(s1,s3)+s4

diff
s1+s2+s3−3,min(s1+s3−2,s2),
min(s1,s3)+s4−1)

path
[
min(s1,2s4)+s2+s3−1,
min(s1+s3,s2,s3+2s4)
]
mean
[
s2+s3−1,
min(s2,s3−1)
]
sum_tang
[
s1+s2+s3−2,
min(s1+s3−1,s2)
]
sum_diff
s1+s2+s3−2,min(s1+s3−1,s2),
min(s1,s3)+s4

Linear algebra solvers
dynprog
S0(iter,i,j)7→(iter,i,0,j,4)
S1(iter,i,j)7→(iter,i,0,j,3)
S2(iter,i,j,k)7→(iter,k,j,i+j,1)
S3(iter,i,j)7→(iter,j,j,i+j,2)
S4(iter)7→(iter,len,len,len,0)
sum_c
min(s1,s2+s3−1),s2+s3−2,
st

W
[
min(s1,s2)+s3−1,
min(s1,s2,s3)
]
c[len−1,len−2]
sum_c
min(s1,s2+2s3−1),s2+2s3−3,
st

W
[
min(s1,s2)+2s3−1,
min(s1,s2,2s3)
]
c[len−1,len−2]
lu
S0(t,i)7→(k,k,j,1)
S1(t,i,j)7→(k,i,j,0)
A[n,n] A[n,n]
Linear algebra kernels
atax
S0(i) 7→(0,i,2)
S1(i) 7→(i,0,0)
S2(i,j) 7→(i,j,1)
S3(i,j) 7→(i,ny+j,3)
A[s1,ny]
x[s2]
y[ny]
tmp[s1]
A[s1,ny]
x[2s2]
y[ny]
tmp[s1]
doitgen
S0(r,q,p)7→(r,q,p,0,0)
S1(r,q,p,s)7→(r,q,p+s,s,1)
S2(r,q,p)7→(r,q,p+np,np,2)
A[s1,s2,np]
sum[s1,s2,s3+s4−1]
C4[s4,s3]
A[s1,s2,np]
sum[s1,s2,s3+2s4−1]
C4[2s4,s3]
gemm
S0(i,j) 7→(i,j,0,0)
S1(i,j,k)7→(i,j,k,1)
A[s1,s3]
B[s3,s2]
C[s1,s2]
A[s1,2s3]
B[2s3,s2]
C[s1,s2]
mvt
S0(i,j) 7→(1,i,j)
S1(i,j) 7→(0,i,j)
for S0 for S1
A[s1,s2] A[s2,s1]
x1[s1] x2[s1]
y_1[s2] y_2[s2]
for S0 for S1
A[s1,2s2] A[2s2,s1]
x1[s1] x2[s1]
y_1[2s2] y_2[2s2]
syr2k
S0(i,j) 7→(i,j,0,0)
S1(i,j,k)7→(i,j,k,1)
S2(i,j,k) 7→(i,j,k,2)
A[ni,s3]
B[ni,s3]
C[s1,s2]
A[ni,2s3]
B[ni,2s3]
C[s1,s2]
syrk
S0(i,j) 7→(i,j,0,0)
S1(i,j,k)7→(i,j,k,1)
A[ni,s3]
C[s1,s2]
A[ni,2s3]
C[s1,s2]
trisolv
S0(i) 7→(0,i,0)
S1(i,j) 7→(j,i,1)
S2(i) 7→(i,i,2)
A[s2,s1]
x[n]
c[s2]
A[2s2,s1]
x[n]
c[2s2]
trmm S0(i,j,k)7→(i,j+k,j)
A[1,min(k,s1+s2−1)]
B
[
max(ni−k,k+1),
min(ni,s1+k,s2+k)
] A[1,min(k,s1+2s2)]
B
[
max(ni−k,k+1),
min(ni,s1+k,2s2+k)
]
Table 1. Memory sizes for different arrays, with sequential & pipelined schedules (PolyBench examples).
Inria
RESEARCH CENTRE
GRENOBLE – RHÔNE-ALPES
Inovallée
655 avenue de l’Europe Montbonnot
38334 Saint Ismier Cedex
Publisher
Inria
Domaine de Voluceau - Rocquencourt
BP 105 - 78153 Le Chesnay Cedex
inria.fr
ISSN 0249-6399
