PCOT: Cache Oblivious Tiling of Polyhedral Programs by Ranasinghe, Waruna et al.
PCOT: Cache Oblivious Tiling of Polyhedral Programs
Waruna Ranasinghe
Nirmal Prajapati
Colorado State University
Department of Computer Science
Fort Collins, CO 80523, USA
Tomofumi Yuki
INRIA
Rennes, France
Sanjay Rajopadhye
Colorado State University
Department of Computer Science
Fort Collins, CO 80523, USA
ABSTRACT
This paper studies two variants of tiling: iteration space tiling (or
loop blocking) and cache oblivious methods that recursively split
the iteration space with divide-and-conquer. The key question to
answer is when we should be using one over the other. The answer
to this question is complicated for modern architecture due to a
number of reasons.
In this paper, we present a detailed empirical study to answer this
question for a range of kernels that fit the polyhedral model. Our
study is based on a generalized cache oblivious code generator that
support this class, which is a superset of those supported by existing
tools. The conclusion is that cache oblivious code is most useful
when the aim is to have reduced off-chip memory accesses, e.g.,
lower energy, albeit certain situations that diminish its effectiveness
exist.
CCS CONCEPTS
• Theory of computation → Shared memory algorithms; •
Computing methodologies→ Shared memory algorithms;
KEYWORDS
Tiling, Cache Oblivious, Polyhedral Model
ACM Reference format:
Waruna Ranasinghe, Nirmal Prajapati, Tomofumi Yuki, and Sanjay Rajopad-
hye. 2018. PCOT: Cache Oblivious Tiling of Polyhedral Programs. In Pro-
ceedings of ACM Conference, Washington, DC, USA, July 2017 (Conference’17),
14 pages.
https://doi.org/10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
Modern multicore processors are complicated, and programming
them is a challenge, especially when seeking the best performance.
Many different, and often conflicting factors need to be optimized,
notably, parallelism and data locality, and that too, at multiple levels
of the memory/processor hierarchy (vectors-registers cores-caches).
Indeed, in the exascale era, the very notion of “performance” may
refer to execution time (i.e., speed) or to energy (product of the
average power and time), or even the energy-delay product.
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
Conference’17, July 2017, Washington, DC, USA
© 2018 Copyright held by the owner/author(s).
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
https://doi.org/10.1145/nnnnnnn.nnnnnnn
Iteration space tiling [35, 67, 68] (variously called loop block-
ing [56] or partitioning [8, 15, 61]) is a critical transformation, used
for multiple objectives: balancing granularity of communication
to computation across nodes in a distributed machine, improving
data locality on a single node, controlling locality and parallelism
among multiple cores on a node, and, at the finest grain, exploiting
vectorization while avoiding register pressure on a single core. It is
an essential strategy used by compilers and automatic parallelizers.
Another approach to optimizing memory transfers is provided
by cache oblivious algorithms [25, 51]. They come with elegant
theoretical bounds on the number of cache misses, and are claimed
to require less tuning than iteration space tiling. When applied to
specific computation patterns like “stencils,” cache-oblivious the
strategy may also be viewed as tiling, where a tile is recursively split
into smaller ones through the divide-and-conquer strategy [26].
The fundamental difference between the two approaches is the
schedule, or temporal order in which tiles are executed. A program
transformed with (single-level) iteration space tiling visits the tiles
in the lexicographic order. Cache oblivious methods visit (leaf) tiles
in a different order, yielding better locality—this can also be viewed
as hierarchical tiling [9] where the number of levels depends on
the problem size.
The practical impact of this difference is not fully understood
due to many complicating factors. The theoretical results for cache
oblivious methods assume fully associative caches, and therefore
do not fully account for conflict misses. Hardware prefetching plays
an important role in modern architecture, and it also has significant
impact on the choice of tile sizes [48]. Most importantly, exist-
ing tools for cache oblivious code generation are domain specific:
Pochoir targets Jacobi-style stencils [59] and AutoGen specializes
on dynamic programming [11]. This makes an “apples-to-apples”
comparison of the two techniques difficult because of the mismatch
between the classes of programs where the application of the two
techniques are automated.
Our objective is to understand the difference between iteration
space tiling and cache oblivious strategy on modern architecture.
Our study emphasizes the influence of tile sizes (or base case thresh-
olds) that strongly impact the performance of both approaches. To
do this we generalized cache oblivious code generation to all com-
putations that fit the polyhedral model, i.e., (imperfect) loop nests
with affine bounds and array accesses [21–24, 46, 53, 54]. This class
of programs include, as a proper subset, all the inputs handled by
Pochoir and Autogen, and many others such as dense linear algebra
algorithms and Gauss-Seidel style stencils. Our study reveals the
following:
• The absolute number of cache misses, as well as the vari-
ability of this number as a function of tile sizes, are both
ar
X
iv
:1
80
2.
00
16
6v
1 
 [c
s.P
L]
  1
 Fe
b 2
01
8
Conference’17, July 2017, Washington, DC, USA Waruna Ranasinghe, Nirmal Prajapati, Tomofumi Yuki, and Sanjay Rajopadhye
significantly lower with cache oblivious codes. This is con-
sistent with theeory.
• Both approaches show similar variability in speed, implying
that the tuning effort for both methods is similar. The main
reason is that speed does not depend on cache behavior alone.
The leaf tile sizes influence other aspects such as recursion
overhead, register allocation, vectorization, and prefetcher
effectiveness.
• The additional levels of tiling in cache oblivious codes do
not contribute to speed, especially for polyhedral programs.
Once there the code is compute-bound at a certain level of
cache, additional locality in slower caches does not improve
speed.
• The benefit of cache oblivious approaches diminish on ar-
chitectures with low set-associativity of caches, and with
prefetching. The tile execution order of cache oblivious codes
can increase conflict misses with high dimensional data.
Hardware prefetching may favor large tile sizes in the in-
nermost dimension, if speed is the primary objective, where
the leaf tile sizes are already targeting the LLC (Last Level
Cache).
These results are, in part, confirmation of the work by Yotov et
al. [70], who focused on sequential executions of matrix computa-
tions. We extended the study to a broader range of kernels, and to
parallel execution. The main takeaway is that if the main interest is
in speed, then cache oblivious codes are unlikely to help. However,
if the goal is to reduce off-chip memory traffic, e.g., for better en-
ergy efficiency, then cache oblivious codes give this for “free:” no
additional tuning is required beyond that required for speed. Even
then, its effectiveness is constrained by various aspects of the code
and the platform, and care must be taken to select the appropriate
strategy. We provide a generalized cache oblivious code generator
to offer this option to any polyhedral program, and give insights
on when it will be effective.
2 MOTIVATIONS FOR EMPIRICAL STUDY
The key question that we try to answer is when and why we should
use standard iteration space tiling over cache oblivious tiling. The
two approaches perform similar partitioning of the iteration space,
but the schedules given to the partitions are different. Theoretically,
cache oblivious code seems to have advantages over iteration space
tiling. However, many factors complicate the actual performance,
which made our initial experiments difficult to interpret. In this
section, we describe the obstacles between the theory and practice
we have identified.
We use Single-Level Tiling (SLT) for iteration space tiling, and
Cache Oblivious Tiling (COT) for cache oblivious techniques in this
paper, which are further described in Section 3.
Recursion Overhead. This is a well-known overhead of COT [70].
The recursion introduces overheads, such as function call overhead,
and increased register pressure. Furthemore, the functions force
inter-procedural analysis/optimization, known to be more difficult
for compilers well. Thus, the leaf tiles must be “sufficiently large”
to avoid excessive overhead due to the recursion.
Recursive Split Constraints the Tile Sizes. In typical cache oblivi-
ous algorithms, the problem is recursively split into halves in each
dimension. This is in fact a rather coarse-grained exploration of
the hierarchical partitioning of the iteration space. For instance, if
the current problem size is B3, then the next sub-problem would be
(B2 )3. If the best problem size for utilizing a level of cache is (B−x)3
where x ≪ B2 then the subproblems due to divide-and-conquer
will not match the best. This is another factor that necessitates fine
tuning of leaf tile sizes even for COT, since the utilization rate of
L1 cache has strong impact on performance.
COT has more Conflict Misses. The divide-and-conquer execu-
tion order may negatively affect cache interference, especially with
high dimensional data. This happens when the memory is allo-
cated such that the accesses are contiguous along some direction
in the iteration space (typically along innermost canonical axis).
With lexicographic order of execution, this contiguity is largely
preserved in the tiled execution. However, divide-and-conquer exe-
cutes neighboring tiles in all dimensions, and many of those tiles
access some distant location in memory. In contrast to accessing
contiguous regions of memory, accessing various segments of the
memory increases the chances of conflicts.
Hardware Prefetching. Modern architectures are equipped with
hardware prefetchers that can bring data to the L1 cache. When
having sufficient locality at L2 or LLC makes the program compute-
bound, then the latency to L2/LLC can be hidden by the prefetcher.
For such programs, it is unnecessary to tile for the fastest cache, and
larger tiles targeting slower caches improve performance by maxi-
mizing prefetcher effectiveness [48]. When the primary objective
is speed, the leaf tiles for COT should also be large, which negates
the benefit of divide-and-conquer, as the leafs are already targeting
slower caches. Prefetching have little impact on parallel executions,
since prefetching is bandwidth limited. When multiple cores try to
prefetch at the same time, the bandwidth limit is quickly reached,
and the latency hiding effect is lost. Furthermore, smaller tile sizes
are better for parallel execution for load balancing reasons.
These factors limit the effectiveness of COT in various ways and
are also closely tied to the characteristics of the computation. Our
empirical study illustrate the impact of these factors on polyhedral
computations.
3 BACKGROUND
In this section, we present necessary background on tiling, cache
oblivious divide-and-conquer execution, and define the terminology
used in this paper.
3.1 Tiling
Tiling is a well-known loop transformation for partitioning com-
putations into smaller, atomic (all inputs to a tile can be computed
before its execution), units called tiles [35, 67]. The partitioning into
tiles improves data locality by altering the execution order of the
operations. Tiling also exposes coarse-grained parallelism making
it the core transformation for polyhedral automatic parallelizers
such as Pluto [7].
The natural legality condition of tiling is that the dependences
across tiles do not create a cycle. In compilers, this condition is
PCOT: Cache Oblivious Tiling of Polyhedral Programs Conference’17, July 2017, Washington, DC, USA
(a) Single-Level Tiling (b) Two-Level Tiling
Figure 1: Execution order of the smallest tiles under single-
and two-level tiling. Two-level tiling may be viewed as
CacheOblivious Tilingwhere the recursion reached the base
case after one recursive step.With hierarchical tiling, neigh-
boring tiles form a larger tile that increase intra-tile reuse.
typically expressed as fully permutability (i.e., dependences are
non-negative direction vectors), which is a sufficient condition. In
the rest of this paper, we assume that the programs have been
transformed to expose loop nests that satisfy this condition. For
polyhedral programs, scheduling techniques to expose such loop
nests are available [7].
3.2 Cache Oblivious Tiling
Cache oblivious algorithms [25, 51] are based on recursive formu-
lation into smaller subproblems (divide-and-conquer). The main
argument is that as the problem sizes are recursively made smaller,
a subproblem is going to fit on some level of the memory hierarchy
that may be caches, main memory, etc. This class of algorithms is
expected to take advantage of all memory hierarchies through this
strategy.
Cache Oblivious Tiling is a specialization of such algorithms
based on tiling. Tiles after a level of tiling can be tiled again with
smaller tile sizes to realize the divide-and-conquer execution pat-
tern. COT may be viewed as hierarchical tiling, except that the
number of tiling levels are determined at run-time through divide-
and-conquer [9].
The key effect of multi-level tiling is to change the execution
order of the tiles. As illustrated in Figure 1, the grouping of smaller
tiles; forming larger tiles increasing intra-tile reuse; is what ac-
counts for better cache utilization.
3.3 Terminology
We introduce a few terms, in addition to COT in the above.
Single-Level Tiling. SLT is when loop tiling is applied once to
improve data locality with respect to a level of cache. The reason we
restrict to SLT is because SLT and COT expose the same number of
tuning parameters. It is known that even cache oblivious algorithms
require the performance of the leaf subproblems to be tuned to have
good performance. The tuning effort required is similar to SLT with
the same number of tuning parameters, which would not be the
case if you apply multi-level tiling that multiplies the number of
tuning parameters.
Tile Size. We use tile size interchangeably with the base case
threshold (and leaf tile size) in COT. Whenever tile sizes are dis-
cussed for COT, it refers to the size of the leaf tile sizes, which is
its tuning parameter.
Off-Chip Accesses. OCA are all load accesses that read from the
main memory. This includes the accesses due to Last-Level Cache
misses due to load operations, as well as those arising from prefetch-
ing instructions.
4 CODE GENERATION OF PCOT
In this section, we describe our generalization of the cache oblivious
code generation to polyhedal programs: Polyhedral CacheOblivious
Tiling (PCOT).
4.1 Approach Overview
The input is any polyhedral loop nest that is fully permutable and
hence tiling it with hyper-rectangular tiles is a legal transformation.
We first (in Section 4.2) describe the case for tiling all dimensions
of a perfect loop nest. Other cases can be handled with additional
pre-processing, which we describe in Section 4.4.
Figure 2 illustrates the structure of our generated code. The input
loop nest is replaced by a call to start the recursion as shown in
Figure 2a. The computation of the bounding box from loop nests
are discussed in Section 4.5.
4.2 Codegen for Perfect Loop Nests
Given a perfectly nested loop with all d dimensions tilable, we seek
to generate, (i) a call to recursive function to start the recursion, (ii) a
recursive function, and (iii) a base function. The recursive function
takes origin and the size of the orthant as inputs, which are d
dimensional vectors. For the initial call to the recursive function,
the origin and the orthant size correspond to the bounding box of
the input loop nest. Bounding box of a domain is the smallest hyper-
rectangular shaped domain which encloses the given domain.
The recursive function visits the iteration domain in divide-and-
conquer order. It recursively divides iteration domain into orthants
until they are smaller than an input parameter. The call to the base
function is wrapped by a condition to check whether the size of
the current orthant is less than or equal to the base case threshold.
The orthants are visited sequentially in the lexicographic order.
For parallel execution the tasks are executed with wave-front paral-
lelism. We use the OpenMP tasks for parallel execution, where each
recursive function call is annotated with omp task pragma, and the
wave-front time boundaries are synchronized with omp taskwait.
The base function visits all points in the intersection of leaf
orthant and the input loop nest iteratively in lexicographical order.
The loop nest in the base function is identical to the loop nest for a
tile in SLT with parametric tile sizes we use. One can use any one
of the available parametrically tiled code generators [3, 34, 40, 41]
to generate parametrically tiled loops and then extract the point
loops.
Conference’17, July 2017, Washington, DC, USA Waruna Ranasinghe, Nirmal Prajapati, Tomofumi Yuki, and Sanjay Rajopadhye
void Heat2D(int T, int N, int ts[3],
double (*data)[2][N][N]) {
//Start with the initial tile 
//  being the bounding box
int ti[3] = ... //BB lexmin
int tw[3] = ... //BB width
recurHeat2D(T, N, ts, data, ti, tw);
}
void recurHeat2D(int T, int N, int ts[3], double (*data)[2][N][N], int ti[3], int tw[3]) {
//Optimization to avoid empty orthants/tiles
if (checkEmpty(ti, tw)) return;   
//Tile width reached threshold
if (tw[0] <= ts[0] && ...) return baseHeat2D(..., ti, tw);
//Recursive step halves width (same for htw[1] and htw[2])
int htw[3]; htw[0] = (tw[0] > ts[0]) ? tw[0]/2 : 0;
...
//Recurse at each new orthant with half widths
int new_ti0[3] = {ti[0], ti[1], ti[2]};                      
int new_tw0[3] = {htw[0], htw[1], htw[2]};                  recurHeat2D(..,new_ti0,new_tw0);
int new_ti1[3] = {ti[0]+htw[0], ti[1], ti[2]};               
int new_tw1[3] = {tw[0]-htw[0], htw[1], htw[2]};            recurHeat2D(..,new_ti1,new_tw1);
int new_ti2[3] = {ti[0], ti[1]+htw[1], ti[2]};               
int new_tw2[3] = {tw[0], htw[1]-htw[1], htw[2]};            recurHeat2D(..,new_ti2,new_tw2);
...
int new_ti7[3] = {ti[0]+htw[0], ti[1]+htw[1], ti[2]+htw[2]}; 
int new_tw7[3] = {tw[0]-htw[0], tw[1]-htw[1], tw[2]-htw[2]};recurHeat2D(..,new_ti7,new_tw7);
}
void baseHeat2D(...) {
//compute a tile with origin ti
for (t=max(ti[0],…); 
t<min(ti[0]+ts[0],…); t++)
for (i=max(ti[1],…); 
i<min(ti[1]+ts[1],…); i++)   
for (j=max(tj[2],…); 
j<min(ti[2]+ts[2],…); j++)   
data[t%2][i][j] = ...
}
T,N : problem sizes
ts : leaf tile size threshold 
ti : lex. min of the tile (origin)
tw : current width of the tile
(a)
(c) (b)
Figure 2: The structure of generated code for Heat2D stencils, which has loop depth d = 3. (a) The tileable loop nest is replaced
by a call to start the recursion. The input is the bounding box of the loop nest. (b) Structure of the recursive function. The input
bounding box is split into 2d new orthants by dividing each dimention in half, and the same function is recursively called for
each new orthant. When the orthant reaches the input tile size, the recursion is terminated by a call to the base function. (c)
Code for the base function that performs the computation of a tile in lexicographic order.
4.3 Optimizations
We implement a number of optimizations to improve the speed of
code. In this section we explain two important optimizations to exit
the recursion early.
4.3.1 Early Exit for Zero Orthants. The zero orthants surface
when we have tile sizes that are cubic bounding box with hyper-
rectangular tiles and vise versa. In this case, orthant size along
all the dimensions reaches the leaf size in different levels of the
recursion. But still our code generator generates 2d sequence of
recursive calls where size some of the new orthants may be zero
along dimensions where the input orthant size is already smaller
than the leaf threshold. This is a simple optimization where we
check whether a width along anyone of the dimensions of the
orthant is zero. If it is zero then exit the recursion.
4.3.2 Early Exit for Empty Orthants. Second optimization is due
to the fact we visit the bounding box of the input loop nest instead of
the actual domain. Some kernels operate on non-hyper-rectangular
domains(i.e, Cholesky operates on triangular domains) and some
kernels need iteration space skewing to enable hyper-rectangular
tiling (i.e., Heat2D). If the actual domain is not a hyper-rectangle
then the bounding box will have points where no computations are
defined. This may lead to orthants outside of the original iteration
space, analogous to empty tiles in iteration space tiling.
For example, a loop nest whose iteration space is triangular, the
boundingbox is a rectangle where the half of the points has no
computation defined. When we visit this rectangular box in divide-
and-conquer order, we will end up visiting empty orthants and we
want to identify these orthants. We generate conditions to check
whether all vertices of the orthant remain outside of the original
iteration space using isl library [65].
The checkEmpty method at the top of the Figure 2c implements
both optimizations. Without them, the code produces correct an-
swers but visits many base cases which are empty. The recursion
ends either when checkEmpty method returns true or when the
orthant reaches its input tile size.
4.4 Handling Imperfect Loop Nests
The discussion so far assumed perfectly nested loops as inputs.
We now extend this to imperfect loop nests, and loop nests where
subset of the d dimensions are marked as tilable.
The input imperfect loop nests are converted to perfect loop nests
with a pre-processing. This is called the embedding transformation
that are used to handle imperfectly nested loops in parametric tiled
code generation [41]. It involves, bringing all the statements into
the same loop depth by adding loops with one iteration as necessary.
Then affine guards are added to eliminate sequence of inner loops,
which lead to “perfect loop nest with affine guards”.
When a subset of the loops are marked as tilable, we first ex-
tract the marked band of loops parameterized by both program
input parameters as well as untiled outer loop iterators. We apply
the techniques we have discussed so far to generate code for the
extracted loop nest. In this case, the function call to start the re-
cursion is added as the body of outer untiled loop nest. The inner
untiled loop nests are added as the body of the point loop in the
base function.
4.5 Computing the Bounding Box
The bounding box is a hyper-rectangle containing the iteration
space of the loop nest. It is computed by eliminating the outer loop
indices from the loop bound expressions. There can be infinitely
many bounding boxes for a given loop nest, but we start with the
tightest (smallest) bounding box among all the possibilities.
PCOT: Cache Oblivious Tiling of Polyhedral Programs Conference’17, July 2017, Washington, DC, USA
Table 1: Machine Configuration
Architecture Parameters Haswell Broadwell
Processor E3-1231 v3 E5-1650v4
Base Frequency 3.4 GHz 3.6 GHz
Turbo Boost Frequency 3.8 GHz 4 GHz
Number of Cores 4 6
RAM 32 GB 16 GB
L3 Cache 8 MB 16-way 15 MB 20-way
L2 Cache 256 KB per core 8-way
L1 Cache 32 KB per core 8-way
Max Mem.Bandwidth 25.6 GB/s 76.8GB/s
Compiler icc 16.0.2
The bounding box also plays an important role in deciding the
leaf tile size. We want the leaf tile size to have the exact value as the
input tile size parameter to the generated code. Therefore, we pad
the size of bounding box along each dimension to the minimum
value of bi × 2ki ≥ Ni where bi input leaf size parameter, Ni
the size along the ith dimension of bounding box and ki ≥ 0 is
an integer. Now, at ki th level of recursion the orthant size along
the ith dimension will be bi . The padding of the bounding box
introduces iteration points outside of original iteration space. These
empty points get optimized away by the optimizations described
in Section 4.3
5 EMPIRICAL STUDY
In this section, we empirically study different tiling strategies. The
goal of the experiments is to characterize SLT and COT, and to
show the influence of the factors discussed in Section 2.
5.1 Experimentation Setup
For our experiments, we use two CPU architectures: Haswell and
Broadwell. The configuration of these machines are shown in Ta-
ble 1. The benchmarks are compiled with ICC 16.0.2 using the
following flags: -O3 -xHost -ipo. We use perf tool to count off-
chip memory accesses (OCA), which is the total number of LLC
misses. The tool counts the number of OCA for the entire program
that consists of kernel, timing functions and reading program in-
puts. The number of OCA is dominant in the kernel, therefore we
can safely use perf to measure OCA for the kernels.
5.1.1 Code Generators. We experiment with multiple state-of-
the-art code generators. We use D-Tiling [40, 41] implemented in
AlphaZ [71] to generate SLT code with parametric tile sizes. We
compare the performance of parametric SLT and PCOT, which
are main subject of our study. In addition, we compare speed of
PCOT with Pluto [7] for fixed size SLT code, Pochoir [59] for cache
oblivious Jacobi-style stencils, and Autogen [11] for cache oblivious
dynamic programming.
5.1.2 Benchmark Suite. We use kernels from stencils, linear al-
gebra, and dynamic programming for our experiments. Specifically,
we include FDTD-2D, GaussSeidel-2D (GS-2D), Heat-2D and Heat-
3D for stencils; Cholesky decomposition and LU decomposition
(LUD) for linear algebra, and Optimal String Parenthesizing (OSP)
from the dynamic programming. These kernels are selected to cover
the types of kernels handled by existing tools, as well as additional
polyhedral kernels that can now be supported with the PCOT gen-
erator. All the benchmarks are from PolyBench/C 1 version 4.2
except for OSP. We include OSP, since it was used to evaluate Au-
togen [11] and we obtained OSP code from authors of Autogen. All
benchmarks are scheduled using the Pluto algorithm, and use the
same memory allocation as PolyBench kernels.
Pluto algorithm is unable to find tiling hyperplanes for all three
dimensions for OSP. However, there is a well-known transforma-
tion that reorders the summation described by Guibas et al. [29]
to transform the dependences to be tileable. We apply this trans-
formation as a pre-processing to tile all three dimensions of OSP.
Autogen is able to tile all three dimensions automatically, making
this necessary to make a fair comparison of the tile execution order.
5.1.3 Problem Size Selection. We select the problem size of each
benchmark such that the memory footprint of a wavefront of tiles,
i.e., the set of tiles that may be executed in parallel with single-level
tiling code, does not fit in LLC. Therefore, during the execution
there will be off-chip memory accesses. We do not use problem
sizes that are powers of 2 since they exacerbate the occurrences of
conflict misses unless padded. Table 2 shows the chosen problem
sizes for all the benchmarks.
5.1.4 Consistent Optimization Level Across Codegen Techniques.
We want to ensure that the performance of each tile is similar
across all techniques to focus on the difference due to tile execution
order. Autogen and Pochoir apply some low-level optimizations not
supported by the PCOT generator and Pluto. These optimizations
were applied manually to SLT, PCOT, and Pluto generated code.
These optimizations are:
• Copy optimization: copying transpose of a column-major
input matrix inside a base function to a local array, so that it
can be accessed in unit stride, and
• Write optimization: if there is an accumulation in the inner-
most loop, first perform the accumulation on a local variable
and then update the array at the end.
• Indexing Simplification: array accesses are simplified by use
of pointer arithmetic instead of accesses tomulti-dimensional
arrays.
In addition we used the restrict keyword and additional compiler
options to ensure that parallel innermost loops are vectorized by
ICC.
5.1.5 Tile Size Exploration. In our study, we explore the differ-
ence between SLT and COT. An important aspect of this study is the
influence of tile sizes, which have large impact on both approaches.
We are not interested in the “best” tile size in this paper, but in the
characterization of the effect of tile sizes. Therefore, our main focus
is in the statistical property over a same set of tile sizes for both
tiling schemes.
We explore a set of tile sizes ranges from 10 to 50 in steps of
10, 100 to 500 in steps of 100 in all the dimensions, and 1000 to
problem-size in steps of 1000 along innermost dimension of a tile.
For Heat-3D, tile sizes 10, 30, 50, 100 and 300 were explored along
1Available online at https://sourceforge.net/projects/polybench/
2OSP performance numbers are in GUpdateS
Conference’17, July 2017, Washington, DC, USA Waruna Ranasinghe, Nirmal Prajapati, Tomofumi Yuki, and Sanjay Rajopadhye
Table 2: Benchmarks, problem sizes and achieved perfor-
mance in GFLOPs per second for the sequential single-level
tiling code on both architectures.
Benchmark Problem size SLT-seq GFLOPSHaswell Broadwell
Cholesky 40002 12.3 11.7
LUD 30002 4.1 4.1
FDTD-2D 500 × 20002 6.4 5.6
GS-2D 500 × 20002 1.9 2.7
Heat-2D 500 × 30002 20.5 20.4
Heat-3D 50 × 3503 16.5 17.6
OSP2 40002 21.0 22.8
all the dimensions. These ranges of tile sizes give a relatively coarse-
grained sampling of possible tile sizes that fits in different levels of
cache. Exhaustively searching the entire space of legal tile sizes is
impractical due to time limitations. For each tile size, we use the
mean over three runs, which had low variance across runs.
We do not explore tile sizes on both Pluto and Pochoir. It is
impossible to automatically explore tile sizes for Pluto since it
performs fixed sized tiling, and we had to manually modify the
output to make the performance of tiles consistent with others. The
tile sizes in Pochoir are hardwired into the code generator, and we
could not find an easy way to explore the tile sizes.
5.2 Characterization of SLT and COT
Figures 3, and 4 summarize the result of our study in the two
platforms.
We first discuss the variability of OCA with respect to tile size.
COT has much smaller number of OCA and variability compared to
SLT when we restrict to smaller tile sizes (Figures 3a and 3c). This
confirms the benefits of COT as it automatically takes advantage
of the multiple levels of memory hierarchy, owing to the recursive
structure of COT.
Heat-3D does not exhibit the expected behavior since COT intro-
duces more conflict misses for higher dimensional data as explained
below. The theoretical results of COT were derived with the as-
sumption of fully associative caches where a cache line can be
stored at any location in the cache. Therefore, there will not be any
occurrences of conflict misses but capacity misses. In reality, caches
are n-way (n=16 and 20 for the cpus used in our experiments) asso-
ciative where there are n candidate locations in the cache to store a
cache line. This leads to conflict misses. Let us consider sequential
execution of a d-dimensional stencil where all d+1 (including time
dimension) dimensions are tiled. A tile uses d of the faces as inputs
and generates d faces of outputs. Let us assume that the tile sizes
are such that data correspond to all d output faces fit in LLC. SLT
executes tiles in lexicographic order. Therefore, lexicographically
next tile will read one of the input faces through LLC. But all other
d-1 input faces were produced long time ago, therefore data is no
longer in LLC due to potential capacity misses. Hence, conflict
misses can potentially occur when the input face is read through
LLC and when accessing data while computing values within the
tile. COT visits the neighbouring tiles first. Therefore, a tile reads
more than one (ideally d) input faces through LLC. Hence, con-
flict misses can occur when all the input faces are read and when
compute values within the tile. Since COT reads more input faces
through LLC, there is a higher chance of occurring conflict misses
compared to SLT. When the number of dimensions are increased,
the dimensions of the memory footprint of a tile increase as well
as the number of input faces COT read through LLC. The higher
number of dimensions, the larger the distance between memory
footprints of neighbouring tiles. Therefore, the chance of occurring
conflict misses is further increased. Hence for Heat-3D, the number
of conflict misses for COT is as high as the number of capacity
misses for SLT. This behaviour diminish the benefits of COT for
higher dimensional stencils.
When data points correspond to all the tile sizes are considered
(Figures 3b and 3d), the absolute number of OCA remains similar
for COT since all the tile sizes fits in LLC. For SLT, the mean of
OCA is slightly decreased compared to smaller tile sizes since the
larger tiles fit in LLC reducing the number of LLC misses. Heat3D
remains an outlier due to the conflict misses as described in previous
paragraph. The behavior of OCA we discussed so far similar for
both sequential and parallel codes except that parallel codes have
more OCA due to the increased number of conflict misses.
The variability of execution time is high for both COT and SLT
on both platforms. It is related to the recursion overhead and re-
strictive tile sizes due to recursive split of COT as described in
Section 2. This shows that speed of COT is sensitive to tile sizes,
therefore, both COT and SLT need tile size exploration/tuning for
speed. For sequential runs, large tile sizes make more effective use
of prefetcher with increased bandwidth requirement as described
under hardware prefetching in Section 2. We verified this behavior
of prefetcher by counting the L2 prefetcher activities. We noticed
that for large tile sizes the number of prefetcher activities are 3
order of magnitude higher compared to small tile sizes. But for par-
allel runs, the bandwidth limit is reached faster, therefore, benefits
of prefetching diminish.
5.3 Comparison of Speed Against State of the
Art
In the following we compare the performance of SLT and PCOT
with state-of-the-art code generators. The performance is normal-
ized to the GFLOPS (giga flops per second) achieved by sequential
parametric single-level tiling (SLT-seq) code. We use the best tile
size found in the tile size exploration described above. For Pluto, we
used the best tile size for SLT. The absolute performance numbers
are shown on top each bar in Figure 5
The parallel variants are executed with the number of threads
equal to the number of physical cores (i.e., hyper-threading is not
considered) on each machine. Figure 5 shows the normalized per-
formance (GFLOPS) on both Haswell and Broadwell. The speed of
SLT and PCOT closely matches the speed of state-of-the-art.
PCOT out performs Pochoir on Heat-2D. For this benchmark,
Pochoir does not vectorize the iterations at the boundaries. Pochoir
scales better (20%) with 6 threads for Heat-3D due to the concurrent
start of tiles. For OSP, PCOT is 10-15% faster than Autogen. This
difference is attributed to the tile shapes. The Autogen generated
code is restricted to cubic tile sizes where as the PCOT generated
PCOT: Cache Oblivious Tiling of Polyhedral Programs Conference’17, July 2017, Washington, DC, USA
PCOT-seq SLT-seq PCOT-par SLT-par
0
130
260
390
520
M
e
a
n
 n
o
. 
o
f 
o
ff
-c
h
ip
 a
cc
e
ss
e
s(
m
ill
io
n
s)
(a) Haswell OCA (tile sizes<=50)
PCOT-seq SLT-seq PCOT-par SLT-par
0
130
260
390
520
M
e
a
n
 n
o
. 
o
f 
o
ff
-c
h
ip
 a
cc
e
ss
e
s(
m
ill
io
n
s)
(b) Haswell OCA (all tile sizes)
LUD Cholesky FDTD2D GS2D Heat2D Heat3D OSP
PCOT-seq SLT-seq PCOT-par SLT-par
0
75
150
225
300
M
e
a
n
 n
o
. 
o
f 
o
ff
-c
h
ip
 a
cc
e
ss
e
s(
m
ill
io
n
s)
(c) Broadwell OCA (tile sizes<=50)
PCOT-seq SLT-seq PCOT-par SLT-par
0
75
150
225
300
M
e
a
n
 n
o
. 
o
f 
o
ff
-c
h
ip
 a
cc
e
ss
e
s(
m
ill
io
n
s)
(d) Broadwell OCA (all tile sizes)
Figure 3: Mean and standard deviation of number of OCA across the tile sizes that fits in LLC for both Haswell and Broadwell.
The figures in the left (a,c) show the OCA behavior when restricted to tile sizes within 50 in all dimensions and the figures
in the right (b,d) show the behavior when all the tile sizes are used. We have to focus on both relative standard deviation and
mean number of OCA. When the tile sizes are restricted to 50, we see the expected behavior where the variability is much
lower with COT. This confirms the benefits of the recursive structure of COT as it automatically takes advantage of multiple
levels of memory hierarchy. When the tile sizes are not restricted, the number of OCA remains similar for COT. The mean
number of OCA for SLT decreases compared to smaller tiles since the larger tiles fits in LLC reducing the number of OCA. The
behavior of OCA on both cpu platforms is similar except that Haswell has relatively more OCA since its LLC size is smaller
compared to Broadwell.
code have more degrees of freedom, and we found that using large
tile sizes in the innermost dimension is better for performance in
many cases. When we try cubic tile sizes in PCOT, the speed is
similar to Autogen. Performance of SLT, Pluto and PCOT are similar
with each other.
Despite the significant reduction in off-chip memory accesses
of benchmarks in PCOT, the absolute speeds of codes from PCOT
and other techniques remain similar. For compute-bound kernels
(i.e., after applying SLT), the latency of OCA are already hidden.
Therefore, further reducing OCA does not translate into savings in
execution time.
Finally we compare the speed (GFLOPS) of the benchmarks to
the peak performance per benchmark. The compiler does clever
optimizations for both Haswell and Broadwell architectures. For
example, 5-point Heat-2D stencil has 4 muls, 2 subs and 4 adds
and the throughput of vector (4 doubles wide) instructions is 2, 1
and 1 per cycle respectively 3. The compiler replace some of the
operations with fused multiply-add/sub (FMA) operations where
the throughput is 2 per cycle. The resulting assembly code has only
2 subs, and 4 FMAs. Hence, it takes only 4 cycles to execute 6 vector
instructions which corresponds to 40 scalar operations. Therefore,
achievable peak (ceiling) is 36 GFLOPS on Broadwell. The achieved
3Data available at http://software.intel.com/sites/landingpage/IntrinsicsGuide/
Conference’17, July 2017, Washington, DC, USA Waruna Ranasinghe, Nirmal Prajapati, Tomofumi Yuki, and Sanjay Rajopadhye
PCOT-seq SLT-seq PCOT-par SLT-par
0
5
10
15
M
e
a
n
 e
x
e
cu
ti
o
n
 t
im
e
(s
)
(b)
PCOT-seq SLT-seq PCOT-par SLT-par
0
5
10
15
M
e
a
n
 e
x
e
cu
ti
o
n
 t
im
e
(s
)
(a)
LUD Cholesky FDTD2D GS2D Heat2D Heat3D OSP
Figure 4:Mean and standard deviation of execution time on (a)Haswell and (b) Broadwell across all explored tile sizes. Variance
is high for both COT and SLT on both machines. Therefore, tile size exploration/tuning is needed for speed.
LU
D
Ch
ol
es
ky
FD
TD
-2
D
GS
-2
D
He
at
-2
D
He
at
-3
D
OS
P
0
1
2
3
4
5
N
o
rm
a
liz
e
d
 G
FL
O
P
S
4 12 6 3 21 17 23
20
46
17
11
71
40
98
21
39
19
12
77
42
97
4 12
4
3
13
15 20
16
44
19
12
63
45
88
(b)
LU
D
Ch
ol
es
ky
FD
TD
-2
D
GS
-2
D
He
at
-2
D
He
at
-3
D
OS
P
0
1
2
3
4
5
N
o
rm
a
liz
e
d
 G
FL
O
P
S
4 12 6 2 20 17 26
14
33
13
5
50
25
75
13
31 17
6
55
31
79
4
11
4
2
13
16 22
11 32
14
6
46
31
73
(a)
SLT-seq PCOT-seq Other-seq SLT-par PCOT-par Other-par
Figure 5: Performance of SLT and PCOT on (a) Haswell and (b) Broadwell, normalized to sequential parametric SLT SLT-seq
(higher the better). The absolute performance numbers are shown on top of each bar in GFLOPS. Both the sequential and the
parallel performance of SLT and PCOT closely matches the performance of other code generators. Other represents Pluto for
the first 4 benchmarks, Pochoir for Heat-2D and 3D, and Autogen for OSP.
performance is 20.5 GFLOPS which is 57% of the peak. If we account
for the cycles for the non-arithmetic operations in the loop body
of the assembly code, then the achievable ceiling will be further
lowered making the achieved performance much closer to the peak.
6 RELATEDWORK
We put our work into context and discuss other approaches to
cache-oblivious methods. We also discuss previous works on tiling
and code generation.
6.1 Empirical Study of Cache-Oblivious
Methods
The most closely related work is by Yotov et al. [70] that explored
answers to the question “what is the cost of cache obliviousness?”
Their study is primarily on matrix multiplication and is only about
sequential execution. Our work extends the study to a broader range
of programs and to parallel executions. We also provide variability
of both speed and cache behavior with respect to tile size.
Some prior work on code generators for various types of tiling
have empirically compared iteration space tiling and cache-oblivious
PCOT: Cache Oblivious Tiling of Polyhedral Programs Conference’17, July 2017, Washington, DC, USA
methods [2, 11, 44, 58, 73]. The objective of these experiments are
slightly different from ours; the main target of evaluation is the code
generation tools. Our interest is in understanding the difference due
to the tile execution order. The reported performance difference in
these experiments mostly come from differences in other aspects
that influence the performance (e.g., copy optimization and other
low-level optimizations), which we carefully checked that codes
from all tools have similar behavior in our work. This explains the
apparent inconsistency with earlier results.
6.2 Cache Oblivious Code Generation
Prokop [25, 51] introduced the cache-oblivious algorithms. Later,
Frigo and Strumpen proposed the serial [26] and the parallel [27] di-
vide and conquer implementations of stencil programs. Strzodka et
al. [58] proposed cache oblivious parallelograms method (CORALS)
which is similar to the time-skewedwavefront parallelization of iter-
ative stencil computations. CORALS takes advantage of the regular
dependence patterns of stencil computations and applies oblique
cuts in both space and time dimensions simultaneously. They use a
load-balancing scheme to evenly distribute all the threads amongst
the tiles. Their technique benefits from data locality, parallelism
and vectorization simultaneously.
Pochoir [59] is a domain specific compiler for stencil programs.
It generates a divide and conquer implementation of the program
based on trapezoidal decompositions using hyperspace cuts. Pochoir
generates efficient code, but the code generated for the hyper-
trapezoidal tile shape is very complex and that the base case sizes
are fixed at compile time. Pochoir, also, cannot handle dependences
along the time dimension. The dependence should be always from
a previous time step. Therefore, Gauss-Seidel-like dependences can-
not be handled. Pochoir can generate the divide and conquer code
for periodic stencils. Periodic stencils can be handled by our COT
code generator after the smashing transformation [6, 49]. Pochoir
is capable of applying oblique cuts whereas we make canonic cuts.
Serial and parallel implementations of recursive divide and con-
quer algorithms with optimal cache complexity have been devel-
oped and evaluated for a specific dynamic programming algorithm
such as Longest Common Subsequence [10, 13], global pairwise
sequence alignment problem in bioinformatics [12], Gaussian Elimi-
nation Paradigm [14], etc. Tithi et al. [63] described a way to obtain
divide and conquer algorithms manually for a class of dynamic pro-
gramming problems. The base cases of these programs are similar
to matrix multiplication computations. The code is hand optimized
with the non-trivial Z-Morton matrix conversions. Unlike the work
of [63], Autogen does not use Z-Morton matrices. The derived
cache-oblivious implementation is parameterized by a single base
value for all the dimensions which limits the tile shape to a hy-
percube. Autogen is mostly applicable to dynamic programming
problems where the fractal property is often available.
Tang et al. [60] proposed the Cache-Oblivious-Wavefront (COW)
technique, that improves the parallelism of the cache-oblivious al-
gorithms, which also preserves locality. They use classic divide and
conquer strategy to partition the iteration space and schedule the
execution of tasks across different levels of the recursion as soon as
the data dependency constraints are satisfied. The execution pattern
of the COW algorithm is conceptually similar to classic wavefront
schedules. However, the implementation is done by hand and their
technique can handle only dynamic programming algorithms.
Autogen [11] executes an iterative implementation of the input
program using very small problem sizes to determine the access
patterns (dependences). It then determines a recursive decomposi-
tion (called fractal property) that can be used to derive a divide and
conquer implementation. The structure of the recursive function
is automatically generated but the body of the base case is hand
implemented.
Bellmania [36] is an interactive system, based on solver-aided
tactics, that re-writes the rules and generates provably correct
divide and conquer implementations of dynamic programs. The
dependences are statically analyzed, however, the derivation of the
recursive algorithm is not automatic.
The COT code generator presented in this paper takes the base
case tile sizes as input, therefore, the tile shape can be hyper-
rectangular; and supports more general class of programs, i.e. all
polyhedral programs.
6.3 Tiled Code Generation
Tiling [35, 68] is a classic iteration space partitioning technique
which combines a set of points into tiles, where each tile can be
executed atomically. Tiling comes in handy for exploiting data local-
ity [38, 50, 67], minimizing communication [1, 69] and maximizing
parallelism [2].
Time skewing [47] enables time tiling which increases the tem-
poral reuse, especially in stencils. Bondhugula et al. [7] developed
PLuTo, a system for tiling imperfectly nested affine loops with fixed
sized tiles. The PLuTo algorithm obtains tiling hyperplanes that
minimizes communication across fixed size hyper-parallelopiped
tiles. Bandishti et al. [2] proposed a diamond tiling technique that
enables the concurrent start of tiles and eliminates the pipeline fill
and flush cost of classic wavefront tiling. Many other tiling tech-
niques [28, 31–33, 37, 42] also propose ways to eliminate pipeline
start-up cost in a similar fashion.
However, in all of the above, the generated code has fixed sized
tiles and does not guarantee optimal usage of caches. Parametric
tile sizes enable efficient auto-tuning and performance portability.
Techniques [3, 16, 30, 39, 41, 55] have been thus developed to allow
parametric tiling. More studies on parametric tiling such as D-
Tiling [40, 41], P-Tiling [3] and Mono-parametric tiling [34] are
conducted where the latter is a polyhedral transformation.
Our COT code generator produces code with parametric base
tile sizes and the point loops can be generated using any of the
aforementioned parametric tiled code generators.
7 CONCLUSION AND FUTUREWORK
In this paper, we have presented an empirical study of two ap-
proaches for tiling: SLT and COT. We have developed cache obliv-
ious code generator that support any polyhedral computation to
widen the scope of study to polyhedral programs. The takeaway is
that COT does not save you from tuning effort for speed, but it gives
you reduced off-chip memory accesses without tuning. However,
when the speed is the primary objective, situations where good
utilization of hardware prefetcher is essential negate the benefit of
COT.
Conference’17, July 2017, Washington, DC, USA Waruna Ranasinghe, Nirmal Prajapati, Tomofumi Yuki, and Sanjay Rajopadhye
It is interesting to note that the access pattern of Jacobi-style
stencils, which has been one of the main beneficiary of COT, is
prefetcher-friendly, and are not fully benefiting from COT in our
study. Our code generator now offers the option to use COT instead
of other execution strategies implemented in the polyhedral tools
for a much wider class of programs than stencils.
Since the speed of PCOT and SLT are similar, the savings off-
chip memory accesses may result in saving in energy. It would be
interesting to quantify the energy savings as a future work. This
work only investigated Single-Level Tiling where multi-level tiling
is a known alternative to improve data locality at multiple levels of
the memory hierarchy. Further study that also explore the use of
multi-level tiling is a natural future work.
REFERENCES
[1] R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev. 2001. Optimal Semi-Oblique
Tiling. In Proceedings of the thirteenth annual ACM symposium on Parallel algo-
rithms and architectures (SPAA ’01). ACM, New York, NY, USA.
[2] Vinayaka Bandishti, Irshad Pananilath, and Uday Bondhugula. 2012. Tiling
Stencil Computations to Maximize Parallelism. In the International conference
on high performance computing, networking, storage and analysis (SC ’12). IEEE
Computer Society Press, Los Alamitos, CA, USA, 40:1–40:11.
[3] M. M. Baskaran, A. Hartono, S. Tavarageri, T. Henretty, J. Ramanujam, and P.
Sadayappan. 2010. Parameterized tiling revisited. In CGO. 200–209.
[4] Somashekaracharya G. Bhaskaracharya, Uday Bondhugula, and Albert Cohen.
2016. Automatic Storage Optimization for Arrays. ACM Trans. Program. Lang.
Syst. 38, 3 (April 2016), 11:1–11:23.
[5] Somashekaracharya G. Bhaskaracharya, Uday Bondhugula, and Albert Cohen.
2016. SMO: An Integrated Approach to Intra-array and Inter-array Storage Opti-
mization. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages (POPL ’16). ACM, New York, NY, USA,
526–538.
[6] U. Bondhugula, V. Bandishti, A. Cohen, G. Potron, and N Vasilache. 2014. Tiling
and Optimizing Time-Iterated Computations over Periodic Domains. In PACT.
[7] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. 2008. PLuTo:
A Practical and Fully Automatic Polyhedral Program Optimization System. In
ACM Conference on Programming Language Design and Implementation. ACM
SIGPLAN, Tuscon, AZ, 101–113.
[8] Jichun Bu and Ed F. Deprettere. 1988. Converting Sequential Iterative Algorithms
to Recurrent Equations for Automatic Design of Systolic Arrays. In Proceedings
of ICASSP. 2025–2028.
[9] L. Carter, J. Ferrante, and S. F. Hummel. 1995. Hierarchical tiling for improved
superscalar performance. In Proceedings of 9th International Parallel Processing
Symposium (IPPS ’95). 239–245.
[10] Rezaul Chowdhury. 2007. Cache-efficient Algorithms and Data Structures: The-
ory and Experimental Evaluation. Ph.D. Dissertation. Department of Computer
Sciences, The University of Texas at Austin.
[11] Rezaul Chowdhury, Pramod Ganapathi, Jesmin Jahan Tithi, Charles Bachmeier,
Bradley C. Kuszmaul, Charles E. Leiserson, Armando Solar-Lezama, and Yuan
Tang. 2016. AUTOGEN: Automatic Discovery of Cache-oblivious Parallel Recur-
sive Algorithms for Solving Dynamic Programs. In Proceedings of the 21st ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP
’16). ACM, New York, NY, USA, 10:1–10:12.
[12] Rezaul Alan Chowdhury, Hai-Son Le, and Vijaya Ramachandran. 2010. Cache-
Oblivious Dynamic Programming for Bioinformatics. IEEE/ACM Trans. Comput.
Biol. Bioinformatics 7, 3 (July 2010), 495–510.
[13] Rezaul Alam Chowdhury and Vijaya Ramachandran. 2006. Cache-oblivious
Dynamic Programming. In Proceedings of the Seventeenth Annual ACM-SIAM
Symposium on Discrete Algorithm (SODA ’06). Society for Industrial and Applied
Mathematics, Philadelphia, PA, USA, 591–600.
[14] Rezaul Alam Chowdhury and Vijaya Ramachandran. 2010. The Cache-Oblivious
Gaussian Elimination Paradigm: Theoretical Framework, Parallelization and
Experimental Evaluation. Theory of Computing Systems 47, 4 (01 Nov 2010),
878–919.
[15] Alain Darte. 1991. Regular Partitioning for Synthesizing Fixed-Size Systolic
Arrays. Integration: The VLSI Journal 12 (December 1991), 293–304.
[16] Alain Darte, Alexandre Isoard, et al. 2014. Parametric tiling with inter-tile data
reuse. In Proceedings of the 2014 International Working on Polyhedral Compilation
Techniques (IMPACT ’14).
[17] Alain Darte, Alexandre Isoard, and Tomofumi Yuki. 2016. Extended Lattice-
based Memory Allocation. In Proceedings of the 25th International Conference on
Compiler Construction (CC ’16). 218–228.
[18] A. Darte, Y. Robert, and F. Vivien. 2000. Scheduling and Automatic Parallelization.
Birkhäuser.
[19] A. Darte, R. Schreiber, and G. Villard. 2005. Lattice-Based Memory Allocation.
IEEE Trans. Computers 54, 10 (2005), 1242–1257.
[20] E. De Greef, F. Catthoor, and H. De Man. 1997. Memory Size Reduction through
Storage Order Optimization for Embedded Parallel Multimedia Applications. In
Parallel Processing and Multimedia. Geneva, Switzerland.
[21] P. Feautrier. 1991. Dataflow analysis of array and scalar references. International
Journal of Parallel Programming 20, 1 (Feb 1991), 23–53.
[22] Paul Feautrier. 1992. Some Efficient Solutions to the Affine Scheduling Problem.
Part I. One-dimensional Time. International Journal of Parallel Programming 21,
5 (1992), 313–347.
[23] Paul Feautrier. 1992. Some Efficient Solutions to the Affine Scheduling Problem.
Part II. Multidimensional Time. International Journal of Parallel Programming 21,
6 (1992), 389–420.
[24] Paul Feautrier and Christian Lengauer. 2011. The Polyhedral Model. In Encyclo-
pedia of Parallel Programming, David Padua (Ed.).
[25] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. 1999. Cache-oblivious
Algorithms. In FOCS: IEEE Symposium on Foundations of Computer Science. IEEE,
New York, NY, 285–297.
[26] M. Frigo and V. Strumpen. 2005. Cache Oblivious Stencil Computations. In
International Conference on Supercomputing (ICS), 2005. Cambridge, MA, 361–
366.
[27] Matteo Frigo and Volker Strumpen. 2006. The cache complexity of multithreaded
cache oblivious algorithms. In Proceedings of the eighteenth annual ACM sympo-
sium on Parallelism in algorithms and architectures. ACM, 271–280.
[28] Tobias Grosser, Albert Cohen, Justin Holewinski, P. Sadayappan, and Sven Ver-
doolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of
12th International Symposium on Code Generation and Optimization (CGO ’14).
66:66–66:75.
[29] L. Guibas, H. T. Kung, and Clark D. Thompson. 1979. Direct VLSI Implementation
of Combinatorial Algorithms. In Proc. Conference on Very Large Scale Integration:
Architecture, Design and Fabrication. 509–525.
[30] Albert Hartono, Muthu Manikandan Baskaran, Cédric Bastoul, Albert Cohen,
Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan. 2009.
Parametric Multi-Level Tiling of Imperfectly Nested Loops. In Proceedings of the
23rd international conference on Supercomputing (ICS ’09). ACM, New York, NY,
USA.
[31] A. Hartono, M. M. Baskaran, J. Ramanujam, and P. Sadayappan. 2010. DynTile:
Parametric tiled loop generation for parallel execution on multicore processors.
In 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS).
1–12.
[32] Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, J. Ramanu-
jam, and P. Sadayappan. 2013. A Stencil Compiler for Short-vector SIMD Archi-
tectures. In Proceedings of the 27th International ACM Conference on International
Conference on Supercomputing (ICS ’13). ACM, New York, NY, USA, 13–24.
[33] Justin Holewinski, Louis-Noël Pouchet, and P. Sadayappan. 2012. High-
performance Code Generation for Stencil Computations on GPU Architectures.
In Proc. of the 26th ACM Int. Conf. on Supercomputing (ICS ’12). ACM, New York,
NY, USA, 311–320.
[34] Guillaume Iooss, Sanjay Rajopadhye, Christophe Alias, and Yun Zou. 2015. Mono-
parametric Tiling is a Polyhedral Transformation. Research Report RR-8802. INRIA
Grenoble - Rhône-Alpes ; CNRS. 40 pages.
[35] F. Irigoin and R. Triolet. 1988. Supernode Partitioning. In 15th ACM Symposium
on Principles of Programming Languages. ACM, 319–328.
[36] Shachar Itzhaky, Rohit Singh, Armando Solar-Lezama, Kuat Yessenov, Yongquan
Lu, Charles Leiserson, and Rezaul Chowdhury. 2016. Deriving Divide-and-
conquer Dynamic Programming Algorithms Using Solver-aided Transformations.
In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-
Oriented Programming, Systems, Languages, and Applications (OOPSLA 2016).
ACM, New York, NY, USA, 145–164.
[37] Tian Jin, Nirmal Prajapati, Waruna Ranasinghe, Guillaume Iooss, Yun Zou, Sanjay
Rajopadhye, and David G. Wonnacott. 2016. Hybrid Static/Dynamic Schedules
for Tiled Polyhedral Programs. CoRR abs/1610.07236 (2016). http://arxiv.org/abs/
1610.07236
[38] S. Kamil, C. Chan, L. Oliker, J. Shalf, and S. Williams. 2010. An auto-tuning
framework for parallel multicore stencil computations. In Parallel Distributed
Processing (IPDPS), 2010 IEEE International Symposium on. 1–12.
[39] Daegon Kim. 2010. Parameterized and Multi-Level Tiled Loop Generation. Ph.D.
Dissertation. Colorado State University, Fort Collins, CO, USA. Advisor(s) Ra-
jopadhye, Sanjay V.
[40] DG. Kim and S. Rajopadhye. 2009. Efficient Tiled Loop Generation: D-tiling.
In LCPC 2009: The 22nd International Workshop on Languages and Compilers for
Parallel Computing, Cavazos Gao, Pollock and Li (Eds.). Springer, LNCS, Newark,
DE, 293–307.
[41] DG Kim and S. Rajopadhye. 2010. On Parameterized Tiled Loop Generation and
Its Parallelization. Technical Report CSU Tech Report CS-10-101. Colorado
PCOT: Cache Oblivious Tiling of Polyhedral Programs Conference’17, July 2017, Washington, DC, USA
State University, Computer Science Dept. See http://www.cs.colostate.edu/
TechReports/Reports/2010/tr10-101.pdf.
[42] Sriram Krishnamoorthy, Muthu Baskaran, Uday Bondhugula, J. Ramanujam,
Atanas Rountev, and P Sadayappan. 2007. Effective Automatic Parallelization of
Stencil Computations. In Proceedings of the 2007 ACM SIGPLAN Conference on
Programming Language Design and Implementation (PLDI ’07). ACM, New York,
NY, USA, 235–244.
[43] Vincent Lefebvre and Paul Feautrier. 1998. Automatic storage management for
parallel programs. Parallel Comput. 24, 3–4 (1998), 649–671.
[44] Jonathan Lifflander and Sriram Krishnamoorthy. 2017. Cache Locality Optimiza-
tion for Recursive Programs. In Proceedings of the 38th ACM SIGPLAN Conference
on Programming Language Design and Implementation (PLDI ’17). 1–16.
[45] C. Mauras. 1989. ALPHA: un langage équationnel pour la conception et la program-
mation d’Architectures parallèles synchrones. Ph.D. Dissertation. L’Université de
Rennes I, IRISA, Campus de Beaulieu, Rennes, France.
[46] C. Mauras, P. Quinton, S. Rajopadhye, and Y. Saouter. 1990. Scheduling Affine
Parameterized Recurrences by means of Variable Dependent Timing Functions.
In International Conference on Application Specific Array Processing, S. Y. Kung and
E. Swartzlander (Eds.). IEEE Computer Society, Princeton, New Jersey, 100–110.
[47] J. McCalpin and D. Wonnacott. 1999. Time Skewing: A Value-Based Approach to
Optimizing for Memory Locality. Technical Report TR-379. Rutgers University,
Department of Computer Science.
[48] Sanyam Mehta, Rajat Garg, Nishad Trivedi, and Pen-Chung Yew. 2016. TurboTil-
ing: Leveraging Prefetching to Boost Performance of Tiled Codes. In Proceed-
ings of the 2016 International Conference on Supercomputing (ICS ’16). Article 38,
12 pages.
[49] N. Osheim, M. Strout, D. Rostrom, and S. Rajopadhye. 2008. Smashing: Folding
Space to Tile Through Time. In LCPC 2008: 15th Workshop on Languaged and
Compilers for Parallel Computing. Alberta, CA, 80–93.
[50] Liu Peng, Richard Seymour, Ken ichi Nomura, Rajiv K. Kalia, Aiichiro Nakano,
Priya Vashishta, Alexander Loddoch, Michael Netzband, William R. Volz, and
Chap C. Wong. 2009. High-order stencil computations on multicore clusters. In
IPPS.
[51] Harald Prokop. 1999. Cache-Oblivious Algorithms. Master’s thesis. Massachusetts
Institute of Technology, Department of Electrical Engineering and Computer
Science.
[52] F. Quilleré and S. Rajopadhye. 2000. Optimizing Memory Usage in the Polyhe-
dral Model. ACM Transactions on Programming Languages and Systems 22, 5
(September 2000), 773–815.
[53] P. Quinton and V. Van Dongen. 1989. The Mapping of Linear Recurrence Equa-
tions on Regular Arrays. Journal of VLSI Signal Processing 1, 2 (1989), 95–113.
[54] S. V. Rajopadhye, S. Purushothaman, and R. M. Fujimoto. 1986. On Synthesizing
Systolic Arrays from Recurrence Equations with Linear Dependencies. In Pro-
ceedings, Sixth Conference on Foundations of Software Technology and Theoretical
Computer Science. Springer Verlag, LNCS 241, New Delhi, India, 488–503.
[55] Lakshminarayanan Renganarayana. 2008. Scalable and Efficient Tools for Multi-
level Tiling. Ph.D. Dissertation. Department of Computer Science, Colorado State
University.
[56] R. Schreiber and J. Dongarra. 1990. Automatic blocking of nested loops. Technical
Report 90.38. RIACS, NASA Ames Research Center.
[57] M. Strout, L. Carter, J. Ferrante, and B. Simon. 1998. Schedule-Independent
StorageMapping for Loops. InASPLOS 1998: Proceedings of the eighth International
Conference on Architectural Support for Programming Languages and Operating
Systems. ACM/IEEE, San Jose, CS, 24 – 33.
[58] Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2010.
Cache Oblivious Parallelograms in Iterative Stencil Computations. In Proceedings
of the 24th ACM International Conference on Supercomputing (ICS ’10). 49–59.
[59] Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and
Charles E. Leiserson. 2011. The Pochoir Stencil Compiler. In Proceedings of the
23rd ACM symposium on Parallelism in algorithms and architectures (SPAA ’11).
ACM, New York, NY, USA.
[60] Yuan Tang, Ronghui You, Haibin Kan, Jesmin Jahan Tithi, Pramod Ganapathi, and
Rezaul A. Chowdhury. 2015. Cache-oblivious Wavefront: Improving Parallelism
of Recursive Dynamic ProgrammingAlgorithmsWithout Losing Cache-efficiency.
In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPoPP 2015). ACM, New York, NY, USA, 205–214.
[61] J. Teich and L. Thiele. 1993. Partitioning of processor arrays: A piecewise regular
approach. INTEGRATION: The VLSI Journal 14, 3 (Feb 1993), 297–332.
[62] W. Thies, F. Vivien, J. Sheldon, and S. Amarasinghe. 2001. A unified framework
for schedule and storage optimization. In Proceedings of the ACM SIGPLAN 2001
conference on Programming language design and implementation. ACM Press,
232–242.
[63] J. J. Tithi, P. Ganapathi, A. Talati, S. Aggarwal, and R. Chowdhury. 2015. High-
Performance Energy-Efficient Recursive Dynamic Programming with Matrix-
Multiplication-Like Flexible Kernels. In 2015 IEEE International Parallel and Dis-
tributed Processing Symposium. 303–312.
[64] Nicolas Vasilache, Benoit Meister, Albert Hartono, Muthu Baskaran, David
Wohlford, and Richard Lethin. 2012. Trading Off Memory For Parallelism Quality.
In IMPACT 201: 2nd International Workshop on Polyhedral Compilation Techniques.
Paris, France.
[65] Sven Verdoolaege. 2010. isl: An Integer Set Library for the Polyhedral Model.. In
ICMS, Vol. 6327. Springer, 299–302.
[66] D. Wilde and S. Rajopadhye. 1996. Memory Reuse Analysis in the Polyhedral
Model. In EUROPAR 96: Parallel Processing Conference. Springer Verlag, Lyon,
France, 389–397.
[67] Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algo-
rithm. In the ACM SIGPLAN 1991 conference on Programming language design and
implementation (PLDI ’91). ACM, New York, NY, USA, 30–44.
[68] M. J. Wolfe. 1987. Iteration space tiling for memory hierarchies. Parallel Processing
for Scientific Computing (SIAM) (1987), 357–361.
[69] J. Xue. 1997. Communication-Minimal Tiling of Uniform Dependence Loops. J.
Parallel and Distrib. Comput. 42, 1 (January 1997), 42–59.
[70] Kamen Yotov, Tom Roeder, Keshav Pingali, John Gunnels, and Fred Gustavson.
2007. An Experimental Comparison of Cache-oblivious and Cache-conscious Pro-
grams. In Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms
and Architectures (SPAA ’07). 93–104.
[71] Tomofumi Yuki, Gautam Gupta, DaeGon Kim, Tanveer Pathan, and Sanjay Ra-
jopadhye. 2013. AlphaZ: A System for Design Space Exploration in the Polyhedral
Model. Springer Berlin Heidelberg, Berlin, Heidelberg, 17–31.
[72] T. Yuki and S. Rajopadhye. 2011. Canonic Multi-Projection: Memory Allocation for
Distributed Memory Parallelization. Technical Report CSU-CS-11-106. Colorado
State University, Computer Science Department.
[73] Yun Zou and Sanjay Rajopadhye. 2015. Automatic Energy Efficient Paralleliza-
tion of Uniform Dependence Computations. In Proceedings of the 29th ACM on
International Conference on Supercomputing (ICS ’15). 373–382.
A SCHEDULE INDEPENDENT MEMORY
ALLOCATION
We also address a memory-based limitation of polyhedral compi-
lation tools. It is well known that in any parallelization (of any
program), it is essential to respect (only) the true or flow depen-
dences. Other (memory-based) dependences can be ignored if one
can re-allocate memory. In practice, this is limited by the fact that
the associated memory expansion may be prohibitively expensive,
and there has beenwork onmitigating this expansion [43, 52, 64, 66].
We propose a novel yet simple schedule-independent memory allo-
cation strategy. Our work also generalizes polyhedral compilation
by enabling polyhedral tools to use alternate, hybrid schedules con-
sisting of affine loops for certain parts of the iteration space and
cache-oblivious divide-and-conquer schedules for others.
A.1 Background
In this section, we introduce the necessary background of our work.
We first give a brief description of the polyhedral representation
of programs, and the general flow of a polyhedral compiler. Then,
we discuss the legality of tiling, which is related to the input of our
code generator.
A.1.1 Polyhedral Compilation and Representation. Figure 6 shows
the flow of polyhedral compilation. First, dependence analysis of
an input program (or a “polyhedral section” thereof) produces an
intermediate representation (IR) in the form of [18] a Polyhedral
Reduced Dependence (hyper) Graph (PRDG). Various analyses are
performed on the PRDG to choose a number of mappings in the
form of Piecewise Quasi-Affine Functions (PQAFs) that specify the
schedule as a set of multi-dimensional vectors. The PQAFs come
with annotations to indicate whether each dimension is sequential
or parallel, and also whether it is part of a tilable band, i.e., whether
tiling this band of dimensions is legal. The transformations may
be applied to the PRDG iteratively, and (eventually) the PRDG and
Conference’17, July 2017, Washington, DC, USA Waruna Ranasinghe, Nirmal Prajapati, Tomofumi Yuki, and Sanjay Rajopadhye
Fortran
C source
PRDG
Array 
Dataflow 
Analysis
Stencil 
DSL
DSLs
Polyhedral 
analyses
Scheduler
Parallelizer
Tiler
Memory (re) 
Allocator
Code 
Generator(s)
PQAF(s)
C+OPenMP
C+MPI
C+CUDA
PGAS
FPGA
Transformation 
Engine
Figure 6: Polyhedral Compilation: the Polyhedral Reduced Dependence (hyper) Graph (PRDG) serves as the intermediate representation.
Piecewise Quasi-Affine Functions (PQAFs) describe transformations.
QLAF are provided to a code-generator that produces code for
various targets.
One of the strengths of the polyhedral model is that a parametric
program may be concisely represented with a PRDG with finite
number of nodes (statements) and edges (dependences). The poten-
tially unbounded sets of instances of a statement are represented
in abstract forms of integer sets, called domains, and dependences
between them as affine functions (or relations, which are viewed
as a set-valued function) over these statement domains. Indeed,
every edge, e from node v tow , in the PRDG is annotated with two
objects: (i) a domain, De specifying the (subset of) the domain, Dv
of its source node, where the dependence occurs, and (ii) the affine
function, f , such that for any point z ∈ De , the (set of) point(s) in
Dw on which it depends is given by f (z). De is called the context of
the edge, and f is its dependence function. We also use the notation
f (De ) to denote the set valued image of De by f .
An affine function Zn → Zm may be expressed as f (x) = A®x + ®b,
where ®x , function domain, is an integer vector of size n; A, linear
part, is an n ×m matrix; and ®b, constant part, is an integer vector
of sizem. A dependence is said to be uniform if the dependence
function is only a constant offset, i.e., when the linear part A is the
identity.
A.1.2 Legality of Tiling. Tiling is a well-known loop transforma-
tion for partitioning computations into smaller, atomic (all inputs to
a tile can be computed before its execution), units called tiles [35, 67].
The natural legality condition is that the dependences across tiles
do not create a cycle. In compilers, this condition is typically ex-
pressed as fully permutability (i.e., dependences are non-negative
direction vectors), which is a sufficient condition. Our transforma-
tion for cache oblivious tiling takes as inputs a loop nest that is
f o r ( i = 0 ; i < N ; i ++ ) {
S0 : X[ 0 , i ] = A[ i ] ; } / / I n i t i a l i z e
f o r ( t = 1 ; t <= 2 ∗N; t ++ ) { / / Note : ub i s even
f o r ( i = 1 ; i < N−1 ; i ++ ) {
S1 : X[ t %2] [ i ] = f (X [ ( t −1 )%2] [ i −1] ,
X [ ( t −1 )%2] [ i ] , X [ ( t −1 )%2] [ i +1 ] ,
X [ ( t − 1 ) % 2 ] [ 0 ] ) ;
}
S2 : X[ t %2 ] [ 0 ] = g (X [ ( t − 1 ) % 2 ] [ 0 ] ) ;
S3 : X[ t %2] [N−1] = g (X [ ( t −1 )%2] [N− 1 ] ) ;
}
f o r ( i = 0 ; i < N ; i ++ ) {
S4 : Aout [ i ] = X[ 0 , i ] ; } / / Output copy
Figure 7:Neither Pochoir norAutogen canhandle the computation
performed by this simple loop. Moreover, it has a memory based de-
pendence that prevents polyhedral compilers like Pluto from tiling
both dimensions. However, the true dependences of the program ad-
mit a tilable schedule, but at the potentical cost of O (N 2) memory.
Our scheme reduces this to O (N )
fully permutable. For polyhedral programs, scheduling techniques
to expose such loop nests are available [7].
A.2 Memory Allocation
In this section, we first describe how memory based dependences
prevent tiling, using our motivating example (Fig. 7), and show that
simply ignoring these (false) dependences would lead to memory
explosion. After formulating our problem, we next propose a simple,
schedule independent memory allocation scheme that resolves it.
PCOT: Cache Oblivious Tiling of Polyhedral Programs Conference’17, July 2017, Washington, DC, USA
Consider the statement S1, and note that its domain, D1 is the
polyhedral set, {t , i | 1 ≤ t ≤ 2N ∧ 1 ≤ i ≤ N − 1}. S1 has four true
dependences (for points sufficiently far from the boundaries), three
of which are S1[t −1, i−1], S1[t −1, i] and S1[t −1, i+1], the typical,
1D-Jacobi stencil dependences, and the fourth one is S2[t − 1, 0],
which is a truly affine dependence on the most recent writer to
the memory location X[(t − 1)%2, 0] when the statement ⟨S1, [t , i]⟩
is being executed. All these dependences are captured as edges
with affine functions in the PRDG. In addition, there is a memory
based dependence, that wemust also respect. Consider statement S2,
whose domain,D2 = {t | 1 ≤ t ≤ 2N } is just one dimensional. The t-
th instance of S2 (over) writes X[t%2, 0], therefore all computations
that read the previous value must be executed before it. In this
sense, S2[t] “depends on” the set S1[t , i], for all 1 ≤ i ≤ N − 1. This
dependence (which is a relation rather than a function) is captured
by another a special edge in the PRDG.
The only schedule that respects all these dependences is the
family of lines parallel to the t axis (provided all iterations of S1 are
done first). Although this has maximal parallelism, it has very poor
locality. Note that the Pluto scheduler does not seek maximal par-
allelism, but rather, to maximize the number of linearly independent
tiling hyperplanes. Unfortunately, the t = const is the only legal
tiling hyperplane for this set of dependences, and the tilable band
obtained by Pluto is only 1-dimensional.
What if we did not have the memory-based dependences, i.e.,
what if we ignored the memory allocation of the original program,
and stored each computed value in a distinct memory location? In
this case, there would be no memory based dependences, and we
can indeed find another family of (actually, infinitely many) legal
tiling hyperplanes: say, the lines i + t = const. As a result, if we use
the mapping (t , i) 7→ (t , i + t) as our “schedule,” the new loops in
the transformed program would be fully permutable, and could be
legally tiled.
Thus, the problem we seek to solve is: how to avoid memory
based dependences, but without the cost of memory expansion that it
seems to imply.
Memory allocation for polyhedral programs is a well studied
problem, and there are two main approaches. One either does mem-
ory allocation after the schedule is chosen [4, 5, 19, 20, 43, 52, 64, 66]
since it often leads to a smaller memory footprint, or else uses a
schedule independent memory allocation, based on the so called uni-
versal occupancy vectors (UOV). This problem is solved when the
program has uniform dependences, i.e., when each dependence can
be described by a constant vector, and for some simple extensions
of this [57, 72].
It is important to note that tiling actually modifies a schedule: the
so called, “schedule dimensions” become fully permutable loops,
and indeed, these loops are actually permuted in the generated
tiled code. So, when a tiling schedule specified by a family of d
tiling hyperplanes is finally implemented by the generated code,
the actual time-stamps are not really d-dimensional vectors, but
rather 2d-dimensional ones obtained as some complicated function
of these indices. Furthermore, we will see when we generate cache-
oblivious tiled codes, these tilable loops will actually be visited in
the divide-and-conquer order of execution, as required by COT.
As a result, finding a memory map that takes into account such a
rather complicated final schedule is a tricky problem. We therefore
seek and propose schedule-independent memory allocations.
The intuition behind our solution is (deceptively) simple, and we
first illustrate it on our motivating example (Fig. 7). Rather than the
so-called “single assignment” program for the entire iteration space
of the program (i.e., full memory expansion), could we find lower-
dimensional subsets, such that a single assignment memory for
only these subsets is sufficient? A careful examination of the code
reveals that the memory based dependences arise due to statement
S2, and its domain is only 1-dimensional. So we store the results of
this statement into an auxiliary array, Y, and modify the program
so that the fourth dependence simply reads Y[t-1], rather than
X[(t-1)%2,0]. For the variable, X, we use the old (t , i) 7→ (t%2, i)
memory allocation that was used in the original code. This results
in 4N memory, which is a polynomial degree better than quadratic.
Of course, the challenge is how to discover this automatically.
We now outline how this is done. At a high level, our algorithm
takes a PRDG as input, applies some (piecewise) affine transfor-
mations to it, and outputs the transformed PRDG together with
a separate memory map for each node in the transformed PRDG.
More specifically, it works as follows.
• Preprocessing. For each edge, e , in the PRDG, with context
De , and function, f , we first identify whether f is uniform
in context in the sense that, for all points, z ∈ De , the value
of z − f (z) is a constant vector, independent of z.
For example, consider a dependence function, (i, j) 7→ (i −
1, i − 1) which, maps any point [i, j] in the plane to a point
on the diagonal, and is clearly not uniform. However, what
if De = {i, j | i = j − 1}? With this contextual information,
the dependence is actually uniform: (i, j) 7→ (i − 1, j − 2).
All edges/dependences that are neither uniform to begin
with, not uniform in context, are marked as truly affine.
• Affine Split. For every node, v , in the PRDG that has at least
one truly affine edge e incident on it, we create a new node,
v ′. Its domain Dv ′ is the union of f (De ) of all such incident
edges.
The edges in the PRDG are modified as follows. All the truly
affine edges that were incident on v are now made incident
on v ′; and v ′ has a single outgoing edge e ′, annotated with
⟨Dv ′ , I ⟩ (its dependence function is the identity map) and
whose destination is v .
It is easy to see that we have not changed the program seman-
tics. In effect, we have simply copied the value of every point
inDv that was the target of any truly affine dependence over
to a new variable v ′, and “diverted” all the truly affine edges
that used to be incident on v over to v ′. Moreover, since the
identity function is uniform by definition, all edges incident
on v are now either uniform, or uniform in context.
• We now use existing UOV based methods [57, 72] to choose
a schedule-independent memory allocation for all the orig-
inal nodes in the PRDG, and a single-assignment memory
allocation for all the newly introduced variables.
The key insight into why this leads to significant memory sav-
ings, is the fact that in all polyhedral programs that we encountered,
truly affine dependences are almost always rank deficient, i.e., are
Conference’17, July 2017, Washington, DC, USA Waruna Ranasinghe, Nirmal Prajapati, Tomofumi Yuki, and Sanjay Rajopadhye
many-to-one mappings from the consumer index points to the pro-
ducers. The only exceptions are either pathological programs, or
programs that do multi-dimensional data reorganizations via bijec-
tions (e.g., matrix transpose, tensor permutations, etc.) where here
is no scope nor need to reduce the total memory footprint. As a
result, f (De ) is almost always a lower dimensional polyhedron, and
requires significantly less memory, even when stored supposedly
inefficiently.
A.3 Related Work
Memory allocation for polyhedral programs is a well studied prob-
lem for almost two decades. DeGreef and Cathoor [20] tackled the
problem of sharing the memory across multiple arrays in the pro-
gram.the so called inet-array memory reuse problem, and proposed
an ILP based solution. Wilde and Rajopadhye, in dealing with an
intrinsically memory-inefficient functional language Alpha [45]
(one can think of this as a program after full expansion) first ad-
dressed the memory reuse for points of an iteration space [66].
They gave necessary and sufficient conditions for the legality of
a memory allocation fucntion, which they allowed to be “in any
direction.” but they did not provide any insight into how to choose
the mapping. Lefebvre and Feautrier [43] on the other hand, con-
sidered only canonic projections, combined with a modulo factor,
but showed how to choose the mapping optimally. Later, Quilleré
and Rajopadhye [52] revisited multiprojections, extended them to
quasi-affine functions, and proved a tight bound on the number of
dimensions of reuse. They also showed that cananic projections
with modulo factors was sometimes a constant factor better, and
sometimes a constant factor worse. Darte at al. [19] took a fresh and
elegant approach to the problem, and formulated the conditions
for legal memory allocations by defining the conflict set. This led
to techniques for choosing provably optimal memory allocations,
initially for non-parameterized iteration spaces, and recently in the
context of FPGA acelerators, for parametrically tiled spaces [16, 17].
Vasilache et al. [64] developed a tool to combine the scheduling and
limiting memory expansion using an ILP formulation, implemented
in the R-Stream compiler. Recently, Bhaskaracharya et al. [4] devel-
oped methods to optimally choose quasi-affine memory allocations,
and showed how they are beneficial for tiled codes, especialy with
live-out data. Furthermore, they also showed [5] how to combine
iner-and intra array reuse in a unifying framework.
The other schedule independent memory allocation was pio-
neered by Strout et al. [57]. Here, the memory allocation is chosen
based only on the dependences, and is guaranteed to be legal, regard-
less of the schedule. This problem is solved when the program has
uniform dependences, i.e., when each dependence can be descibed
by a constant vector, and for some simple extensions of this [57, 72].
Thies at al. [62] have also formulated the problem of simultane-
ously choosing the schedule and memory allocation as a combined
optimization problem.
