Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories by Muthu Manik et al.
Automatic Data Movement and Computation Mapping for
Multi-level Parallel Architectures with Explicitly Managed
Memories
Muthu Manikandan Baskaran1 Uday Bondhugula1 Sriram Krishnamoorthy1
J. Ramanujam2 Atanas Rountev1 P. Sadayappan1
1Dept. of Computer Science and Engineering 2Dept. of Electrical & Computer Engg. and
The Ohio State University Center for Computation & Technology
2015 Neil Ave. Columbus, OH, USA Louisiana State University
{baskaran,bondhugu,krishnsr,rountev,saday}@cse.ohio-state.edu jxr@ece.lsu.edu
Abstract
Several parallel architectures such as GPUs and the Cell proces-
sor have fast explicitly managed on-chip memories, in addition to
slow off-chip memory. They also have very high computational
power with multiple levels of parallelism. A signiﬁcant challenge
in programming these architectures is to effectively exploit the par-
allelism available in the architecture and manage the fast memories
to maximize performance.
In this paper we develop an approach to effective automatic data
management for on-chip memories, including creation of buffers in
on-chip (local) memories for holding portions of data accessed in a
computational block, automatic determination of array access func-
tions of local buffer references, and generation of code that moves
data between slow off-chip memory and fast local memories. We
also address the problem of mapping computation in regular pro-
grams to multi-level parallel architectures using a multi-level tiling
approach, and study the impact of on-chip memory availability on
the selection of tile sizes at various levels. Experimental results on
a GPU demonstrate the effectiveness of the proposed approach.
Categories and Subject Descriptors D.3.4 [Programming Lan-
guages]: Processors—Code generation, Compilers, Optimization
General Terms Algorithms, Management, Performance
Keywords Scratchpad memory, Datamovement, Multi-level tiling,
Graphics Processor Unit
1. Introduction and Motivation
Modern high-performance computer architectures have increasing
numbers of processing elements on chip. Architects must ensure
that memory bandwidth and latency are also optimized to exploit
the full beneﬁts of the available computational resources. Introduc-
ing a cache hierarchy has been the traditional way to alleviate the
memory bottleneck. Caches are hardware-controlled, and it is dif-
ﬁcult to model their exact behavior and to predict program execu-
tion times. While using caches, useful data may be evicted from
the cache and replaced with other data without the programmer’s
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proﬁt or commercial advantage and that copies bear this notice and the full citation
on the ﬁrst page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior speciﬁc permission and/or a fee.
PPoPP’08, February 20–23, 2008, Salt Lake City, Utah, USA.
Copyright c   2008 ACM 978-1-59593-960-9/08/0002...$5.00
control. Due to this and other reasons concerning performance and
power, various modern parallel architectures have fast explicitly
managed on-chip (local) memories, often referred to as scratch-
pad memories, in addition to slower off-chip (global) memory in
the system. The scratchpad memories are software-managed and
hence software has complete control over the movement of data
into and out of such memories. The execution times of programs
using scratchpad memories can be more accurately predicted and
controlled. Scratchpad memories help to minimize memory load/-
store latency and maximize on-chip bandwidth by providing more
paths without any issues about coherency.
Numerous challenges arise for a compiler writer provided with
an architecture with explicitly managed memories. The compiler
has to make good decisions on what elements to move in and move
out of local memory, when to move them, and how to efﬁciently ac-
cess the elements brought into local memory, while ensuring pro-
gram correctness. Programmers using architectures with scratch-
pad memories shoulder the burden of orchestrating the movement
of data between global and local memory.
Another signiﬁcant problem to be addressed in modern high-
performance multi-level parallel architectures is the exploitation of
available parallelism. Parallelization of arbitrary regular programs
in these architectures is challenging as the architecture imposes a
number of constraints that have to be addressed in order to effec-
tively map the computation onto the parallel units of the architec-
ture. The computation mapping should utilize the beneﬁts of local
memories and hence the mapping is constrained by the amount of
local memory available.
In this paper we develop approaches to address the challenges
of effective automatic data management in scratchpad memories
and of effective mapping of computation in regular programs to
architectures with multiple levels of parallelism. To address the
ﬁrst problem, we develop a framework that automatically creates
buffers in local memory for storing portions of data that are ac-
cessed in ablock of computation, and determines array access func-
tions for local buffer references. The framework also automatically
generates code to move data between local memory and global
memory. To address the second problem, we perform multi-level
tiling to distribute computation on multiple levels of parallel units
in the architecture.
The rest of the paper is organized as follows. Section 2 intro-
duces the polyhedral model for representing programs and trans-
formations. Section 3 presents the framework to perform automatic
data allocation and code generation for data movement in scratch-
pad memories. Section 4 describes the multi-level tiling approach
for computation mapping on multi-level parallel architectures. Sec-
tion 5 gives an overview of the GPU architecture that is used as atestbed for the experiments presented in Section 6. We discuss re-
lated work in Section 7 and conclude in Section 8.
2. Overview of Polyhedral Model
This section provides some background information on the poly-
tope/polyhedral model, a powerful algebraic framework for rep-
resenting programs and transformations [29, 33]. The polyhedral
model is used by our framework to perform automatic allocation
and data movement in scratchpad memories (discussed in detail in
Section 3).
A hyperplane is an n−1 dimensional afﬁne subspace of an n-
dimensional space; a hyperplane can be represented by an afﬁne
equality. A halfspace consists of all points of an n-dimensional
space that lie on one side of a hyperplane (including the hyper-
plane); it can be represented by an afﬁne inequality. A polyhe-
dron is the intersection of ﬁnitely many halfspaces. A polytope is a
bounded polyhedron. A polytope is represented as
Mx+b ≥  0
where x is a vector of variables, b is a constant vector, and M is a
matrix in which every row describes a hyperplane that bounds the
polytope.
In the polyhedral model, a statement s surrounded by m loops
is represented by an m-dimensional polytope, referred to as an it-
eration space polytope. The coordinates of a point in the polytope
(referred to as the iteration vector   is) correspond to the values of
the loop indices of the surrounding loops, starting from the outer-
most one. In this work, we focus on regular programs where loop
bounds are afﬁne functions of outer loop indices and global param-
eters (e.g., problem sizes) and similarly, array access functions are
also afﬁne functions of loop indices and global parameters. Hence
the iteration space polytope Is can be deﬁned by a system of afﬁne
inequalities derived from the bounds of the loops surrounding s.
Using matrix representation in homogeneous form to express sys-
tems of afﬁne inequalities, the iteration space polytope can be rep-
resented as
Is.
⎛
⎝
  is
  p
1
⎞
⎠ ≥  0
where Is is a matrix representing loop bound constraints and   p is a
vector of global parameters. Each point of the polytope corresponds
to an instance of statement s in program execution.
Afﬁne array access functions can also be represented using
matrices. If a[Fras(  is)] is the rth reference to an array a in statement
s with a corresponding iteration vector  is, then
Fras(  is)=Fras.
⎛
⎝
  is
  p
1
⎞
⎠
where Fras is a matrix representing an afﬁne mapping from the
iteration space of statement s to the data space of array a. Each row
in the matrix deﬁnes a mapping corresponding to a dimension of
the data space. Hence the image of an iteration space polytope Is by
an afﬁne access function Fras deﬁnes the set of elements accessed
by the afﬁne reference, i.e., the data space accessed by the afﬁne
reference. In the subsequent discussion, we denote the image of Is
by Fras as FrasIs.
There has been a signiﬁcant body of work on dependence anal-
ysis in the polyhedral model [15, 34, 39]. We now discuss brieﬂy
the representation of dependences in the polyhedral model. An in-
stance of statement s (denoted by iteration vector   is) depends on
an instance of statement t (denoted by iteration vector  it)i f  is and
  it are valid points in the corresponding iteration space polytopes,
they access the same memory location, and  is is executed before
  it. Since array accesses are assumed to be afﬁne functions of loop
indices and global parameters, the constraint that deﬁnes conﬂict-
ing accesses of memory locations can be represented by an afﬁne
equality (obtained by equating the array access functions in source
and target statement instances). Hence all constraints to capture a
dependence can be represented as a system of afﬁne inequalities
with a corresponding polyhedron (referred to as a dependence poly-
hedron).
The technique of employing the polyhedral model to ﬁnd
(afﬁne) program transformations has been widely used for im-
provement of sequential programs (source-to-source transforma-
tion) [16, 17] as well as automatic parallelization of programs [28,
21, 18, 11]. An afﬁne transform of a statement s is deﬁned as an
afﬁne mapping that maps an instance of s in the original program
to an instance in the transformed program. The afﬁne mapping
function of a statement s is
φs(  is)=Cs.
⎛
⎝
  is
  p
1
⎞
⎠
WhenCs is a row vector, the afﬁne mapping φs isa one-dimensional
mapping. An m-dimensional mapping can be represented as a com-
bination of m (linearly independent) one-dimensional mappings, in
which case Cs is a matrix with m rows.
3. Automatic Data Management in Scratchpad
Memories
In this section we discuss in detail the proposed framework (based
on the polyhedral model discussed in Section 2) for effective auto-
matic data management in scratchpad memories. We discuss three
different aspects of the framework: (1) automatic allocation of stor-
age space (in the form of arrays) in scratchpad memories for hold-
ing portions of data accessed in a block of a program, (2) deter-
mination of access functions of references to arrays in scratchpad
memories, and (3) automatic generation of code for moving data
between scratchpad (local) memory and off-chip (global) mem-
ory. The local memory storage allocation is done independently
of whether the access functions of various array references are
uniformly generated or non-uniformly generated. We create local
memory storage for each non-overlapping region of the data space
of an array that is accessed in a program block. By default, data is
moved in to local memory at the start of the block and moved out
after the completion of the block. Data dependences are not vio-
lated because of the framework’s automatic storage allocation and
data movement.
In modern parallel architectures such as the Cell processor, any
data that is accessed in a computation has to be moved into scratch-
pad memory before access, as data cannot be accessed from global
memory during computation. But in architectures such as GPUs,
data can be accessed from both global memory and scratchpad
memory during computation. Forsuch architectures, theframework
optimally moves only data that have sufﬁcient reuse.
3.1 Details of the Framework
The framework takes as input the iteration spaces of all statements
in a program block as well as the access functions of all array
references. The procedure explained below is applied to all arrays
(one at a time) in the block.
For an array A, let S1,S2,...,Sq be the statements in the given
program block containing references to A, and let Ik represent the
iteration space of statement Sk (1 ≤ k ≤ q). Let F1
k ,F2
k ,...,F
p
k be
the matrices representing the afﬁne read reference functions of A in
statement Sk, and let Gk be the matrix representing the afﬁne write
reference function of A (if there is a write reference of A in Sk).Algorithm 1 Reuse beneﬁt calculation algorithm
Input Set of data spaces (D), Number of references in D (Nref),
Iteration space dimensionality of each reference (ISi,1 ≤ i ≤
Nref), Rank of afﬁne function of each reference (Ri,1 ≤ i ≤
Nref)
1: for i=1 to Nref do
2: if Ri < ISi then
3: Mark yes
4: end if
5: end for
6: if Not Marked yes then
7: if Volume(overlapped regions in D) > δ× Volume(D) then
8: Mark yes
9: end if
10: end if
Output Yes, if beneﬁcial reuse, No, otherwise
As discussed in Section 2, the data space accessed by a refer-
ence (represented by an access function matrix F) in a statement
represented by an iteration space I is given by FI. Hence the set of
all data spaces accessed by all read references of A in all statements
in the program block is
DSA
r = {Fl
kIk | 1 ≤ l ≤ p ∧ 1 ≤ k ≤ q}
Similarly, the set of all data spaces accessed by all write references
of A in the block is
DSA
w = {GkIk | 1 ≤ k ≤ q}
The set of all data spaces accessed by all references of A in the
block is
DSA
rw = DSA
r ∪DSA
w
We partition the set of all data spaces DSA
rw into maximal
disjoint sets DSA
rwd1,DSA
rwd2,...,DSA
rwdm such that each partition
has a subset of data spaces each of which is non-overlapping with
any data space in other partitions, i.e.,
∀d ∈DSA
rwdi,∀e ∈ DSA
rwdj s.t. 1 ≤ i, j ≤ m∧ j  = i : d ∩e = / 0
The problem of partitioning issolved by mapping itto an equivalent
problem of ﬁnding connected components of an undirected graph.
The undirected graph is created withvertices representing each data
space in the set DSA
rw; an edge exists between two vertices if the
intersection of the data spaces corresponding to the vertices is not
empty.
The convex union or the minimum convex polytope that en-
closes all data spaces in a partition DSA
rwdi is
CDSA
rwdi = ConvexHull(DSA
rwdi)
3.1.1 Determining Local Memory Storage
We now describe the procedure that automatically allocates storage
in local memory for portions of array A that are accessed in the
given program block. We generate one local memory array for each
partition DSA
rwdi.
Algorithm 1 explains a procedure to determine if a partition of
data spaces has sufﬁcient reuse in a program block. When the rank
of the access matrix of an array reference is less than the iteration
space dimensionality of the statement in which it is accessed, the
data elements of the array accessed in the reference are said to have
an order of magnitude (or non-constant) reuse. Thus, the condition
for non-constant reuse of a data space of an array a accessed in a
Algorithm 2 Automatic data allocation algorithm
Input Iteration space polyhedra, Afﬁne array access functions
1: for each array A do
2: for each reference of the array do
3: Find the data space accessed by the reference
4: end for
5: Partition the set of all data spaces into maximal disjoint sets
such that each partition has a subset of data spaces each
of which is non-overlapping with any data space in other
partitions
6: for each partition of data spaces do
7: Find the convex union of data spaces in the partition
8: Find the lower and upper bounds of each dimension of
the convex union, as an afﬁne function of program param-
eters. Let the number of dimensions be n. Let the dimen-
sion variables be i1,i2,...,in. Let lbk and ubk be the lower
and upper bounds of the dimension variable ik.
9: Deﬁne the local storage for a partition of accessed data
spaces of array A as an array of dimension n and size
(ub1 −lb1+1)×(ub2 −lb2+1)×···×(ubn −lbn+1)
10: end for
11: end for
Output Local memory storage for each non-overlapping accessed
region of original arrays
reference Fras(  is) is
rank(Fras) < dim(  is) (1)
If a given partition of data spaces has at least one reference whose
accessed data space has non-constant reuse (i.e., it satisﬁes Condi-
tion (1)), then the partition is marked as beneﬁcial to be copied to
scratchpad memory. Otherwise we check if the data spaces in the
partition have signiﬁcant constant reuse. Constant reuse in the set is
estimated by considering each pair of data spaces, determining the
volume of their intersection, and summing up these volumes. If the
sum of the volumes constitutes a signiﬁcant portion, determined by
a fraction δ, of the total volume of the set of all data spaces, then
the partition is marked as beneﬁcial. We have empirically ﬁxed a
value of 30% for δ.
The procedure explained below is applied to each partition of
data spaces DSA
rwdi to determine local memory storage space for
the partition. For architectures such as GPUs, only those partitions
that are marked as beneﬁcial by Algorithm 1 are considered. The
overall approach is summarized in Algorithm 2.
For each partition of data spaces DSA
rwdi we consider the cor-
responding convex hull CDSA
rwdi and ﬁnd the lower and upper
bounds of each dimension of this convex hull. The bounds are de-
termined in the form of an afﬁne function of parameters of the pro-
gram block, using the Parametric Integer Programming (PIP) soft-
ware [14]. These bounds determine the size of the local memory
array created for the partition. Let the dimensionality of the con-
vex hull be n. Let i1,i2,...,in represent the variables denoting each
dimension of the convex hull and let lbk and ubk be the lower and
upper bounds of dimension variable ik. Local memory storage cre-
ated for a partition of accessed data spaces of array A is an array
of dimension n with size (ub1 −lb1+1)×(ub2 −lb2+1)×···×
(ubn−lbn+1). The order of array dimensions of the local memory
array follow that of the global memory array.
3.1.2 Determining Access Functions of Local Memory Array
References
For each reference to the original array A in the given program
block, we ﬁnd the corresponding access function for the localmemory array reference. The dimensionality (say m) of the original
global memory array A might be greater than the dimensionality
(say n) of the local memory array created for a partition of accessed
data spaces of A. We use CLooG [10], an efﬁcient code generator
tool, to ﬁnd the bound expressions of each dimension of the convex
hull of the partition of accessed data spaces. The bound expressions
generated from CLooG are represented as a loop structure with n
nested loops. The dimensions of the original data space that do not
appear in the convex union polytope (convex hull) are represented
as afﬁne functions of dimensions that appear in the polytope, and
program parameters, in their respective positions within the loop
structure.
For any array access F (  y), each row of the access function ma-
trix F represents the array subscript of a dimension in the original
data space. Let F  be the matrix that is derived from F by removing
the rows (if any) corresponding to the dimensions in the original ar-
ray that do not appear in the local memory array. (Note that F  = F
when m = n).
For any reference A[F (  y)] in the original program block, the
corresponding local memory array access function is
F  (  y)−g, where g =( lb1,lb2,...,lbn)T
3.1.3 Generating Data Movement Code
This subsection describes the procedure to generate code for data
movement. This procedure is applied to each partition of data
spaces DSA
rwdi for which a local memory array is created.
From DSA
rwdi, we select the data spaces that are accessed by
read references. We generate the loop structure of the code that
moves data from global memory to local memory by scanning the
selected data spaces using CLooG. Similarly, from DSA
rwdi,w e
select the data spaces that are accessed by write references and
use CLooG to generate the loop structure of the code that moves
data from local memory to global memory by scanning the selected
data spaces. CLooG scans the data spaces in an efﬁcient way such
that the generated loop structure leads to single load/store of each
data element that is read/written even if the accessed data spaces of
references are overlapping.
Having generated the loop structures, the loop body of the
data move in and move out code is generated as follows. Let the
dimensionality of the original array and the local memory array
be m and n, respectively. Let H be a m×n matrix derived from
an n ×n identity matrix Idn by adding a row (in the respective
position) for each of the m−n dimensions that do not appear in
the convex union polytope CDSA
rwdi. The added row represents
the corresponding dimension as afﬁne function of dimensions that
appear in the polytope, and program parameters. (Note thatH =Idn
when m = n). Let  y be the iteration vector of the loop structure of
data move in (and data move out) code.
The loop body of the data move in code for each local array Li
created for A is
Li[Idn.  y−g]=A[H.  y]
and the loop body of the data move out code is
A[H.  y]=Li[Idn.  y−g]
where g =( lb1,lb2,...,lbn)T
To estimate an upper bound on the volume of data moved in to
the local memory array created for DSA
rwdi, we partition the set of
data spaces in DSA
rwdi, that are accessed by read references, into
maximal non-overlapping subsets of data spaces (as explained ear-
lier in the section), and ﬁnd the space required in the local memory
array for each such partition (using the procedure explained in Al-
gorithm 2). The upper bound on the volume of data moved in to the
local array is given by the total space needed in the local array for
all such partitions. Similarly, we estimate the upper bound on the
volume of data moved out of the local memory array by ﬁnding the
total space needed for all non-overlapping partitions of data spaces
in DSA
rwdi, that are accessed by write references.
We use the Polylib [32] tool to perform operations over poly-
topes, such as ﬁnding the image of iterationspace polytopes formed
by afﬁne functions, ﬁnding the union of data spaces, and ﬁnding the
convex union of data spaces.
Figure 1 shows an example that illustrates automatic data allo-
cation in local memory storage and code generation for data move-
ment.
3.1.4 Optimizing Data Movement
All data that are accessed by read references in a program block
need not be moved in to scratchpad memory; similarly, all data
that are accessed by write references need not be written out from
scratchpad memory. Only those data elements that are read by
statement instances in the block but whose values are written by
earlier statement instances outside the block, need to be moved in.
Of course, data elements corresponding to an input array (array that
is only read in the program but not written) also need to be brought
in. Similarly, data elements that are written by statement instances
inthe block butnot read by any statement instance outside theblock
need not be copied out to global memory from scratchpad memory,
unless the data elements belong to an output array (array that is
only written in the program but not read).
The optimal strategy for determining data elements that need to
be copied in and copied out requires data dependence information.
We are given the iteration spaces of all statements in a program
block and the set of all dependence polyhedra. For each true de-
pendence (deﬁned by a polyhedron), we identify each statement
instance in the block that is a target of the dependence but whose
corresponding source statement instance does not belong to the
block. The data accessed due to the array read references involved
in true dependences in the set of all identiﬁed statement instances
constitute the data that needs to be copied into local memory (in
addition to data belonging to input arrays accessed in the block) for
the computation in the block. Similarly, for each true dependence,
we identify each statement instance in the block that is a source of
the dependence but whose corresponding target statement instance
does not belong to the block. The data accessed due to the array
write references involved in true dependences in the set of all iden-
tiﬁed statement instances constitute the data that needs to be copied
out from local memory (in addition to data belonging to output ar-
rays written in the block) after the computation in the block.
The current implementation of the framework takes iteration
spaces of statements and afﬁne array access functions of references
as input, and creates local memory arrays and data movement code.
In future work we plan to implement the optimization outlined
above, based on data dependence information.
4. Tiling for Multiple Levels of Parallelism
In this section we present in detail the approach of mapping com-
putation in an afﬁne program to different levels of parallel units in
a multi-level parallel architecture that has explicitly managed on-
chip memories. We use a multi-level tiling approach that addresses
the constraints imposed on tile sizes by the availability of on-chip
memory.
4.1 Details of the Approach
The architecture we consider for further explanation has the fol-
lowing components: (1) a slow global memory, (2) a set of parallel
units at an outer level that communicate with each other throughOriginal Code:
A[200][200]; B[200][200];
for (i=10;i<=14;i++) {
for (j=10;j<=14;j++) {
A[i][j+1] = A[i+j][ j+1]∗3;
for (k=11;k<=20;k++) {
B[i][ j+k] = A[i][k] + B[i+j][k];
}
}
}
Modiﬁed Code:
LA[19][10]; LB[19][24];
for (i=10;i<=14;i++) {
for (j=10;j<=14;j++) {
LA[i−10][j+1−11] = LA[i+j−10][j+1−11]∗3;
for (k=11;k<=20;k++) {
LB[i−10][j+k−1 1 ]=L A [ i −10][k−11] + LB[i+j−10][k−11];
}
}
}
/∗ Array A∗/
/∗ Local Memory Storage
lb(i) = 10; ub(i) = 28
lb(j) = 11; ub(j) = 20
offset (i) = 10; offset (j) = 11;
∗/
LA[19][10];
/∗ Data Move in code ∗/
for (i=10;i<=14;i++) {
for (j=11;j<=20;j++)
LA[i−10][j−11] = A[i][j] ;
}
for (i=20;i<=28;i++) {
for (j=max(i−13,11);j<=min(15,i−9);j++)
LA[i−10][j−11] = A[i][j] ;
}
/∗ Data Move out code ∗/
for (i=10;i<=14;i++) {
for (j=11;j<=15;j++)
A[i][j] = LA[i−10][j−11];
}
/∗ Array B∗/
/∗ Local Memory Storage
lb(i) = 10; ub(i) = 28
lb(j) = 11; ub(j) = 34
offset (i) = 10; offset (j) = 11;
∗/
LB[19][24];
/∗ Data Move in code ∗/
for (i=20;i<=28;i++) {
for (j=11;j<=20;j++)
LB[i−10][j−11] = B[i][j ];
}
/∗ Data Move out code ∗/
for (i=10;i<=14;i++) {
for (j=21;j<=34;j++)
B[i][j] = LB[i−10][j−11];
}
Figure 1. Example of data allocation and movement.
the global memory space, (3) a set of parallel units within each
outer-level parallel unit, and (4) a local fast explicitly managed
scratchpad memory within each outer-level parallel unit shared by
the inner-level parallel units. The number of software parallel pro-
cesses or threads executed in the system can be higher than the
number ofparallel processors inthe hardware. Each outer-level par-
allel process can be thought of as logical grouping of a set of inner-
level parallel processes. When more than one process is launched
on an outer-level parallel unit, these processes divide the scratchpad
memory among themselves. In this case, the inner-level processes
that are part of one outer-level process have available to them only
a portion of the local scratchpad memory of the outer-level unit;
these inner-level processes share this portion among themselves.
Given any input program, the ﬁrst step is to ﬁnd the parallelism
available in the computation. Our approach uses the framework de-
veloped by Bondhugula et al. [7] for this purpose. Given any afﬁne
input program, the framework ﬁnds an optimal set of afﬁne trans-
formations (or, equivalently, tiling hyperplanes) for each statement
to minimize the volume of communication between tiles and also
to improve data reuse in each tile. The framework ﬁnds bands of
permutable loops that can be tiled and also identiﬁes points in exe-
cution where synchronization is required. A band can have a single
sequential loop, or it can have multiple loops found in the increas-
ing order of communication volume induced due to the loop. In our
approach, we consider the outermost band that has multiple per-
mutable loops, and treat the communication-free loops in the band,
if any, as space loops. If there are no communication-free loops in
the outermost band, we treat all but the last loop as space loops, in
order to achieve pipeline parallelism. Having identiﬁed the bands
of permutable loops and the space and time loops, we proceed to
perform multi-level tiling of the space loops to distribute the avail-
able parallelism across the various levels of parallel units in the
system. By default, we do as many levels of tiling as the number of
levels of parallel units. But due to constraints imposed by memory
availability at different levels (as explained later), we perform ad-
ditional levels of tiling when needed. For these additional levels of
tiling,which are done within tilesthat are distributed across parallel
processes, we tile all permutable loops.
In the two-level parallel architecture considered, for programs
that require synchronization across outer-level parallel processes,
all processes involved in synchronization need to be launched and
active on the outer-level hardware parallel processors. In such
cases, for a speciﬁc total number of outer-level parallel processes
that are launched on the same outer-level processor, there is an up-
per limit on the amount of local memory that is available to each
process. This limit is given by the total capacity of local memory in
the processor, divided by the number of processes assigned to this
processor. However, for programs that do not require synchroniza-
tion across outer-level parallel processes, all these processes need
not be active at the same time on the outer-level hardware paral-FORALL i = 1, Ni
FORALL j = 1, Nj
FOR k = 1, WS
FOR l = 1, WS
S1
END FOR
END FOR
END FOR
END FOR
Figure 2. Example of Multi-level Tiling - Original Code
lel processors. This allows some of them to be launched after the
completion of others, which could increase the amount of available
local memory per process. In such cases, the upper limit on the
amount of local memory that is available to each process, that is
launched on an outer-level processor, is ideally the total capacity of
local memory in the processor.
The procedure to perform multi-level tiling for the two-level
parallel architecture considered, is as follows. We ﬁx the number
of parallel processes at outer and inner levels to be a multiple of
the number of physical parallel processors at the level. We ﬁrst
perform an outer level of tiling of the space loops that equally
distributes tiles across outer-level parallel processes. If the tile in
an outer-level process is large enough such that it requires more
local memory than the available amount, it becomes necessary to
introduce one more level of tiling, in order to limit the amount of
needed local memory. In this case we split the tile in an outer-level
parallel process into sub-tiles (that are executed sequentially within
the outer-level tile) such that each sub-tile requires an amount of
local memory that isno higher than theﬁxed upper limit.Weﬁnd an
optimal set of tile sizes that deﬁnes an atomic unit of computation
in an outer-level tile under the constraint of limited local memory
availability, using the algorithm described in Section 4.3. Once the
outer level tiling is done, we perform an inner level of tiling of the
space loops that equally distributes the iterations of an atomic unit
of computation executed in an outer-level parallel process among
the inner-level parallel processes.
Figure 2 and Figure 3 illustrate an example in which multiple
levels of tiling are done to exploit various levels of parallelism
available in the system.
4.2 Optimal Placement of Data Movement Code
With the tiling hyperplanes (that determine the shape of a tile
executed atomically in an outer-level parallel process) known, the
iteration spaces of statements in a tile, parameterized by tile sizes,
are determined. The iteration spaces of statements in a tile, along
with the afﬁne array access functions for each reference in the
tile are given as input to the framework described in Section 3 to
determine the local memory storage needed for data accessed in
the tile (as a function of tile sizes) and to generate code to move
data between local memory and global memory. By default, code
to move data in to local memory allocated for a tile is placed
in the program structure at the beginning of execution of the tile
(computational block) and code to move modiﬁed data out to global
memory isplaced at the end of execution of the tile.The tiling loops
form a loop nest over the data movement code and tile code.
When an array access function does not depend on a loop
iterator in the iteration space, then the loop is a redundant loop for
the reference. If all references of a local memory array have one
or more common redundant loops in the loop nest of tiling loops,
then the data movement code of the array is hoisted and placed at a
level in the loop nest of tiling loops such that any tiling loop that is
below the level in the loop nest is a redundant loop identiﬁed for the
// Tiling to distribute at the outer level
FORALL iT = 1, Ni, Ti
FORALL jT = 1, Nj, Tj
// Tiling to satisfy local memory limit
FOR i’ = iT, min(iT+Ti−1,Ni), ti ’
FOR j’ = jT, min(jT+Tj−1,Nj), tj ’
FOR k’ = 1, WS, tk’
FOR l’= 1, WS, tl’
<Data move in Code>
// Tiling to distribute at the inner level
FORALL it = i’, min(i’+ti’−1,Ni), ti
FORALL jt = j’, min(j’+tj’−1,Nj), tj
FOR i = it , min(it+ti−1,Ni)
FOR j = jt , min(jt+tj−1,Nj)
FOR k = k’, min(k’+tk’−1,WS)
FOR l = l ’, min(l’+tl’−1,WS)
S1
END FOR
END FOR
END FOR
END FOR
END FORALL
END FORALL
<Data move out Code>
END FOR
END FOR
END FOR
END FOR
END FORALL
END FORALL
Figure 3. Example of Multi-level Tiling - Multi-level Tiled Code
array. By doing such hoisting, the data in the local memory array
is reused, if possible, across various computational blocks. Placing
data movement code outside redundant loops helps in the selection
of optimal tile sizes (for tiles that are sequentially executed in an
outer-level parallel process).
4.3 Tile Size Search Algorithm
The goal of this algorithm is to ﬁnd optimal sizes for tiles that
determine the atomic units of computation executed in an outer-
level parallel process, under the constraint that the active local
memory used by the process does not exceed a given upper limit
Mup. The algorithm tries to minimize data movement cost between
local memory and global memory.
The data movement cost is directly affected by the volume
of data that is being moved and the number of occurrences of
data movement. The movement is done in parallel by the inner-
level processes and there has to be a synchronization among these
processes every time the data is moved. Hence the data movement
cost, C, is modeled as
C = N ×((P×S)+
V ×L
P
)
where N is the number of occurrences of data movement, P is
the number of inner-level processes, S is the synchronization cost
per process per occurrence of data movement, V is the volume
of data moved each time, and L is the cost of transferring a data
element. There is usually an architecture-dependent constraint on
the minimum number of inner-level processes to be chosen, below
which the resources are under-utilized and overlap of computation
and load/store into local memory is very poor. Let this lower limit
be Plow and P ≥ Plow.
Consider tiling of a loop nest with maximum depth m. Let the
index range of the ith loop in the nest be Ni and let ti be the optimal
tile size to be found for the ith loop in the nest. Let the number oflocal memory arrays created in a tile be nl and rj be the optimal
position in the loop nest where data movement code of jth local
memory array is placed. Using Algorithm 2, we determine the size
of each local memory array created in a tile, as a function of the tile
sizes. Let M1,M2,...,Mnl represent the size of each local memory
array. Let Vin
j and Vout
j be the upper bounds on the volume of data
moved in to and moved out of jth local memory array, respectively,
determined as a function of tile sizes, as explained in Section 3.1.3.
Let I denote the set of arrays to which data is moved in and O
denote the set of arrays from which data is moved out. The tile
size search algorithm is phrased as an optimization problem that
minimizes the cost of data movement between local memory and
global memory. The constraints ensure that (1) all tile sizes are
greater than zero but lesser than or equal to the corresponding loop
index range, (2) total memory required by a tile is within the given
upper limitMup,and (3) tilesizes are large enough to keep all inner-
level processes busy.
The optimization problem is formulated below.
Variables:
t1,t2,...,tm
Constraints:
∀i :1≤ i ≤ m,ti > 0
∀i :1≤ i ≤ m,ti ≤ Ni
nl
∑
i=1
Mi ≤ Mup
t1×t2×···×tm ≥ P
Objective function:
minimize ∑
k∈I
(
rk
∏
i=1
Ni
ti
×((P×S)+
Vin
k ×L
P
))+
∑
k∈O
(
rk
∏
i=1
Ni
ti
×((P×S)+
Vout
k ×L
P
))
The above optimization problem is a nonlinear constrained op-
timization problem that can be solved by a technique such as se-
quential quadratic programming (SQP). Note that we need to re-
lax the integer constraints on (t1,t2,...,tm) to lie in Rm instead of
Zm, solve the optimization problem, then round off the result to the
closest integral vector. A detailed discussion on SQP is presented
in [4].
5. Overview of GPU Architecture
GPU architectures are representative of the class of multi-level
parallel architectures with explicitly managed memories. We have
used GPUs as the experimental testbed architecture to implement
and illustrate the beneﬁts of our work discussed in Section 3 and
Section 4. This section presents a brief overview of the GPU archi-
tecture.
The GPU architecture has a set of multiprocessors (MIMD
units) and within each multiprocessor has a set of processor cores
executing in a SIMD fashion (referred to as SIMD units in fur-
ther discussion). Hence the architecture exhibits a two-level par-
allelism, namely, parallelism across the multiprocessors (MIMD
units) and parallelism across the various SIMD units within a mul-
tiprocessor. The multiprocessors communicate through an off-chip
DRAM memory, which has very high access latency. The SIMD
units within a multiprocessor communicate through a fast local
scratchpad memory (referred to as shared memory) that resides in
the multiprocessor. Programming GPUs for general-purpose appli-
cations is enabled through the Compute Uniﬁed Device Architec-
ture (CUDA) programming model [30]. The CUDA programming
model abstracts the multiprocessors as a grid of virtual processors
called thread blocks, and abstracts the SIMD units within a multi-
processor as a grid of virtual processors called threads. The execu-
tion model of a GPU device is such that a grid of thread blocks is
executed by running one or more blocks on each MIMD unit, and
each block is split into SIMD groups of threads called warps. The
size of a warp in NVIDIA GeForce 8800 GTX is 32. Hence the
lower limit on the number of threads used, Plow (as deﬁned in Sec-
tion 4.3), is ﬁxed at 32 for experiments on NVIDIA GeForce 8800
GTX.
Multi-level tiling, as described in Section 4, is performed at an
outer level across the thread blocks mapped to the MIMD units,
and at an inner level across the threads mapped to the SIMD units.
In general, given a set of parallel or space loops for a program,
the space loops are tiled and the tiles are allocated/scheduled to
execute on the thread blocks. The parallel iterations within a tile
are executed in a SIMDfashion across the threads of a thread block.
Based on the availability of shared memory, the tile executing in a
thread block is divided into sub-tiles that are executed sequentially
in a thread block.
In GPUs the number of concurrent thread blocks at any point of
time is determined by the amount of shared memory that is needed
by any thread block. If the shared memory required by a thread
block is M bytes and if the total available shared memory in the
device is X bytes, then the maximum number of concurrent thread
blocks in the device cannot exceed X
M. The shared memory that is
available in a MIMD unit in a NVIDIA GeForce 8800 GTX is 16
KB and there are 16 MIMD units; hence, the maximum number of
concurrent thread blocks cannot exceed 218
M .
6. Experimental Results
The experiments were conducted on a NVIDIA GeForce 8800
GTX GPU device. The device has 768 MB of DRAM and has 16
multiprocessors (MIMD units) at 675 MHz. Each multiprocessor
has 8 SIMD units running at twice the clock frequency of the
multiprocessor and has 16 KB of scratchpad memory (referred to as
shared memory in the context of CUDA). The CUDA kernels are
compiled using the NVIDIA CUDA Compiler (nvcc) to generate
the device code that is launched from the CPU (host). The CPU is
an IntelCore2 Duo processor at2.13 GHz with2 MB L2cache. The
GPU device is connected to the CPU through a 16-x PCI Express
bus. The host programs are compiled using the gcc compiler with
-O3 optimization ﬂag.
We broadly conducted our experimental study on two kernels,
namely Mpeg4 Motion Estimation (ME) kernel and 1-D Jacobi ker-
nel, the former requiring no synchronization across thread blocks
and the later requiring synchronization across thread blocks. For
efﬁcient execution of the kernels, we performed multi-level tiling
on the kernels and effectively managed data in scratchpad memory
using our automatic data management framework.
We ﬁrst present the experimental results that show the ben-
eﬁts of using scratchpad memory to efﬁciently access data for
computation. Fig. 4 and Fig. 5 illustrate the beneﬁts of efﬁcient
data access using scratchpad memory and also exemplify the high
speedup achieved in running the kernel in GPU in contrast to run-
ning in CPU. The speedup of the implementation utilizing scratch-
pad memory is 8x for Mpeg4 ME kernel and 10x for Jacobi kernel
over that using only GPU DRAM. The speedup over CPU perfor-
mance is over 100x for Mpeg4 ME kernel and 15x for Jacobi ker-
nel.
The original and parallel tiled code structure of the Mpeg4 ME
kernel are shown in Fig. 2 and Fig. 3. For various problem sizes, we
conducted experiments to analyze the performance of the Mpeg4
ME kernel. The results are shown in Fig. 6. The number of thread 0
 20000
 40000
 60000
 80000
 100000
 120000
 140000
 160000
64M 16M 9M 4M 2M 1M 256k
E
x
e
c
u
t
i
o
n
 
T
i
m
e
 
(
m
s
)
Problem Size
GPU Perf w/o Scratchpad memory
GPU Perf with Scratchpad memory
CPU Perf
Figure 4. Execution time of Mpeg4 ME for various problem sizes
 0
 1000
 2000
 3000
 4000
 5000
 6000
512k 256k 128k 64k 32k 16k 8k
E
x
e
c
u
t
i
o
n
 
T
i
m
e
 
(
m
s
)
Problem Size
GPU Perf w/o Scratchpad memory
GPU Perf with Scratchpad memory
CPU Perf
Figure 5. Execution time of 1-D Jacobi for various problem sizes
blocks was chosen as 32 and the number of threads as 256. The i
and j loops in Fig. 2 are the space loops that are tiled across thread
blocks and threads, and k and l loops are the time loops. The sizes
of tiles that distribute computation across thread blocks are set by
dividing the problem size equally (except for boundary) among
the thread blocks. The time loops k and l in this kernel iterate
over very small index range (16 in our experiment) and hence the
data accessed in the iteration space of the loops do not occupy
much space in the scratchpad memory. The tile search algorithm
described in Section 4.3 found tile sizes of 32, 16, 16 and 16 for
i, j, k and l loops to be optimal as these tile sizes lead to lesser
data movement cost by reducing the number of data movements,
given the scratchpad memory constraint. From the results shown in
Fig. 6, it is clear that the tile sizes chosen by the algorithm gave
better performance than other tile sizes for various problem sizes.
For the implementation of Jacobi-1D kernel that has a space
loop surrounded by a time loop, we used the framework discussed
in [27] to modify the tiled code to enable concurrent start of ex-
ecution in all processes, and performed multi-level tiling over the
modiﬁed code. We ran experiments on the kernel for various prob-
lem sizes that can completely ﬁt in the total scratchpad memory
available in the GPU device and obtained results that are depicted
in Fig. 7. The number of time iterations was chosen as 4096 and
 400
 600
 800
 1000
 1200
 1400
 1600
 1800
 2000
64M 32M 16M 8M
E
x
e
c
u
t
i
o
n
 
T
i
m
e
 
(
m
s
)
Problem Size
Tile Size = 8,8,16,16
Tile Size = 16,8,16,16
Tile Size = 16,16,16,16
Tile Size = 32,16,16,16
Tile Size = 32,32,16,16
Tile Size = 64,16,16,16
Figure6. Execution time ofMpeg4 ME kernel forvarying tilesizes
time tile size as 32. The problem size was equally divided (except
for boundary) by the number of thread blocks used, to set the size
of space tile executed per thread block. The number of threads used
was 64. For problem sizes that completely ﬁt in the total scratch-
pad memory available in the device, the number of thread blocks
has no constraint. However, since the kernel requires synchroniza-
tion across all thread blocks, for very high number of thread blocks,
the computation done by a thread block is too small to hide the
synchronization cost incurred. The same behavior is illustrated by
Fig. 7. Performance enhances as the number of thread blocks in-
creases, till a point, and then decreases when the synchronization
cost dominates over the amount of computation done in a thread
block.
 20
 30
 40
 50
 60
 70
 80
 90
 100
 50  100  150  200  250
E
x
e
c
u
t
i
o
n
 
T
i
m
e
 
(
m
s
)
Number of Thread Blocks
N = 32k
N = 16k
N = 8k
Figure 7. Execution time of 1-D Jacobi for smaller problem sizes
for varying thread blocks
For larger problem sizes that have to do be tiled to ﬁt in the
scratchpad memory, we ﬁxed the number of thread blocks to be
128, ﬁxed empirically from the experimental results shown in
Fig. 7, to allow better concurrency and incur less synchronization
cost. The active scratchpad memory used by a thread block was
hence limited to 211 bytes or 29 words. The number of threads used
was ﬁxed at 64. The tile search algorithm described in Section 4.3
gave a space tile size of 256 and time tile size of 32 to be optimalfor minimizing the data movement cost between scratchpad mem-
ory and global memory. The experiments conﬁrmed the same, as
indicated by the results in Fig. 8 that show the performance of the
kernel for various tile sizes for different problem sizes.
 200
 400
 600
 800
 1000
512k 256k 128k 64k
E
x
e
c
u
t
i
o
n
 
T
i
m
e
 
(
m
s
)
Problem Size
Tile Size = 32,64
Tile Size = 32,128
Tile Size = 16,256
Tile Size = 32,256
Tile Size = 64,256
Figure 8. Execution time of 1-D Jacobi for larger problem sizes for
varying tile sizes
7. Related Work
In this section we discuss prior work that has addressed compiler
issues in multi-level parallel architectures and architectures with
explicitly managed scratchpad memories.
Schreiber and Cronquist [38] have proposed an approach to do
near-optimal allocation of scratchpad memories and near-optimal
reindexing of array elements in scratchpad memories. Their ap-
proach generates separate storage efﬁcient local arrays for each
equivalent class of uniformly generated references. Hence if data
accessed due to two references, belonging to different classes,
are overlapping, then two different local arrays would be created
to hold the overlapping accessed data spaces. Anantharaman and
Pande [1] perform data partitioning on arrays into portions to be
kept in local memory and global memory. They compute a bound-
ing box for each equivalent group of uniformly generated refer-
ences as in the case of [38]. Eisenbeis et al. [12] consider elements
to move to local memory from a view of individual iteration of a
loop nest instead of an atomic unit of computation of the program.
Kandemir et al. [25] propose an approach for dynamically manag-
ing scratchpad memories, but they handle only uniformly generated
afﬁne references.
The idea of estimation of the number of references to an array in
order to predict cache effectiveness has been discussed by Ferrante
et al. [19] and Gallivan et al. [20]. The idea of ﬁnding image of
the iteration space onto the array space to optimize global transfers
has been discussed in [20]; but only a framework for estimating
bounds for the number of elements accessed was given. Ferrante et
al. gave exact values for uniformly generated references but did not
consider multiple references. Also, for non-uniformly generated
references, arbitrary correction factors were given for arriving at
lower and upper bounds for the number of distinct references.
Clauss[9] and Pugh [35] have presented more expensive but exact
techniques to count the number of distinct accesses.
There has been a signiﬁcant amount of research on memory
reduction and optimization of data locality for embedded single-
processor-on-chip (SOC)systems. In the case of memory optimiza-
tions, Panda et al., Balasa et al., and the IMEC group have derived
several transformations forimproving memory performance on em-
bedded systems [3, 8, 31, 36, 40]. Their work is a collection of
techniques that form a custom memory management methodology
referred to as data transfer and storage exploration (DTSE). There
is a large body of work on estimating the memory requirements of
loops [3, 8, 36, 40] (and references therein). Most of these works
assume the given sequential execution order and ﬁnd the memory
requirements. A number of works have addressed scratchpad mem-
ory management [22, 24, 25] (to name a few).
Multi-level tiling approach has been employed in various con-
textssuch as tilingforvarious levels ofmemory hierarchy [2,6,13],
and tiling for parallelism and locality [37, 5]. Multi-level tiling has
become a key technique for high-performance computation. There
has been work on generating efﬁcient multi-level tiled code for
polyhedral iteration spaces that handle tile sizes at compile time
[23] and that handle tile sizes as symbolic parameters [26].
8. Conclusions
In this paper, we have developed approaches to address two main
challenges in modern high-performance multi-level parallel archi-
tectures with explicitly managed scratchpad memories, namely, ef-
fective data management in scratchpad memories, and effective
mapping of computation in regular programs on to multiple levels
of parallel units on the architecture. We have developed a frame-
work, to address the ﬁrst problem, that automatically performs al-
location of storage space in scratchpad memory to hold portions
of data accessed in a computational block, determination of access
functions of local memory array references, and generation of code
for moving relevant portions of data resident in slow off-chip mem-
ory to fast scratchpad memory and vice versa. We have employed
a multi-level tiling strategy, to address the second problem, that ef-
fectively handles the impact of on-chip memory availability on tile
sizes at various levels. We have shown the effectiveness of our ap-
proach through experiments on GPUs which are representatives of
high performance multi-level parallel architectures with explicitly
managed scratchpad memories.
Acknowledgments This work is supported in part by the U.S.
National Science Foundation through awards 0121676, 0121706,
0403342, 0508245, 0509442, 0509467 and 0541409.
References
[1] S. Anantharaman and S. Pande. Compiler optimizations for real time
execution of loops on limited memory embedded systems. In IEEE
Real-Time Systems Symposium, pages 154–164, 1998.
[2] Automatically Tuned Linear Algebra Software (ATLAS). http://math-
atlas.sourceforge.net/.
[3] F. Balasa, P. Kjeldsberg, M. Palkovic, A. Vandecappelle, and
F. Catthoor. Loop transformation methodologies for array-oriented
memory management. In 17th IEEE International Conference
on Application-Speciﬁc Systems, Architectures and Processors
(ASAP’06), pages 205–212, 2006.
[4] D. P. Bertsekas. Nonlinear Programming: 2nd Edition. Athena
Scientiﬁc. ISBN 1-886529-00-0.
[5] G. Bikshandi, J. Guo, D. Hoeﬂinger, G. Almasi, B. B. Fraguela, M. J.
Garzaran, D. Padua, and C. von Praun. Programming for parallelism
and locality with hierarchically tiled arrays. In PPoPP, pages 48–57,
2006.
[6] J. Bilmes, K. Asanovic, C. Chin, and J. Demmel. Optimizing matrix
multiply using PHiPAC. In Proc. ACM International Conference on
Supercomputing, pages 340–347, 1997.
[7] U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam,
A. Rountev, and P. Sadayappan. Afﬁne transformations for
communication minimal parallelization and locality optimization
of arbitrarily nested loop sequences. Technical Report OSU-CISRC-
5/07-TR43, Ohio State University, May 2007.
[8] F. Catthoor, K. Danckaert, C. Kulkarni, E.Brockmeyer, P.Kjeldsberg,
T. V. Achteren, and T. Omnes. Data Access and Storage Managementfor Embedded Programmable Processors. Kluwer Academic
Publishers, 2002.
[9] P. Clauss. Counting solutions to linear and nonlinear constraints
through ehrhart polynomials: applications to analyze and transform
scientiﬁc programs. In ICS ’96: Proceedings of the 10th international
conference on Supercomputing, pages 278–285, 1996.
[10] CLooG: The Chunky Loop Generator. http://www.cloog.org.
[11] A. Darte and F. Vivien. Optimal ﬁne and medium grain parallelism
detection in polyhedral reduced dependence graphs. IJPP,25(6):447–
496, Dec. 1997.
[12] C. Eisenbeis, W. Jalby, D. Windheiser, and F. Bodin. A strategy for
array management in local memory. In Advances in Languages and
Compilers for Parallel Computing, 1990 Workshop, pages 130–151,
Irvine, Calif., 1990. Cambridge, Mass.: MIT Press.
[13] K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn,
L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan.
Sequoia: Programming the memory hierarchy. In Proceedings of the
2006 ACM/IEEE Conference on Supercomputing, 2006.
[14] P. Feautrier. Parametric integer programming. Operationnelle/Oper-
ations Research, 22(3):243–268, 1988.
[15] P. Feautrier. Dataﬂow analysis of array and scalar references. IJPP,
20(1):23–53, 1991.
[16] P.Feautrier. Some efﬁcient solutions to the afﬁne scheduling problem:
I. one-dimensional time. IJPP, 21(5):313–348, 1992.
[17] P.Feautrier. Some efﬁcient solutions to the afﬁne scheduling problem.
part II. multidimensional time. IJPP, 21(6):389–420, 1992.
[18] P. Feautrier. Automatic parallelization in the polytope model. In The
Data Parallel Programming Model, pages 79–103, 1996.
[19] J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing
cache effectiveness. In Proceedings of the Fourth International
Workshop on Languages and Compilers for Parallel Computing,
pages 328–343, London, UK, 1992. Springer-Verlag.
[20] D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and
local memory management by global program transformation. In
Proceedings of the 1st International Conference on Supercomputing,
pages 229–254, New York, NY, USA, 1988. Springer-Verlag New
York, Inc.
[21] M. Griebl. Automatic Parallelization of Loop Programs for
Distributed Memory Architectures. FMI, University of Passau, 2004.
Habilitation Thesis.
[22] I. Issenin, E. Brockmeyer, B. Durinck, and N. Dutt. Multiprocessor
system-on-chip data reuse analysis for exploring customized memory
hierarchies. In DAC ’06: Proceedings of the 43rd annual conference
on Design automation, pages 49–52, 2006.
[23] M. Jimnez, J. M. Llabera, and A. Fernndez. A cost-effective
implementation of multilevel tiling. IEEE Trans. Parallel Distrib.
Syst., 14(10):1006–1020, 2003.
[24] M. Kandemir, I. Kadayif, A. Choudhary, J. Ramanujam, and I. Kolcu.
Compiler-directed scratch pad memory optimization for embedded
multiprocessors. IEEE Transactions on VLSI (TVLSI), 12(3):281–
287, 2004.
[25] M. Kandemir, J. Ramanujam, M. Irwin, V. Narayanan, I. Kadayif,
and A. Parikh. A compiler based approach for dynamically managing
scratch-pad memories in embedded systems. IEEE Transactions on
Computer-Aided Design, 23(2):243–260, 2004.
[26] D. Kim, L. Renganarayana, D. Rostron, S. Rajopadhye, and M. M.
Strout. Multi-level tiling: M for the price of one. In SC, November
2007.
[27] S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam,
A. Rountev, and P. Sadayappan. Effective Automatic Parallelization
of Stencil Computations. In ACM SIGPLAN PLDI 2007, July 2007.
[28] A. Lim. Improving Parallelism And Data Locality With Afﬁne
Partitioning. PhD thesis, Stanford University, Aug. 2001.
[29] A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing
synchronization with afﬁne transforms. In POPL’97, pages 201–214,
1997.
[30] NVIDIA CUDA.
http://developer.nvidia.com/object/cuda.html.
[31] P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer,
C. Kulkarni, A. Vandecappelle, and P. G. Kjeldsberg. Data and
memory optimization techniques for embedded systems. ACM Trans.
Design Autom. Electr. Syst., 6(2):149–206, 2001.
[32] PolyLib - A library of polyhedral functions.
http://icps.u-strasbg.fr/polylib/.
[33] L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative
Optimization in the Polyhedral Model: Part I, One-Dimensional
Time. In CGO ’07, pages 144–156, 2007.
[34] W. Pugh. The omega test: a fast and practical integer programming
algorithm for dependence analysis. Communications of the ACM,
8:102–114, Aug. 1992.
[35] W. Pugh. Counting solutions to presburger formulas: how and why.
In PLDI ’94: Proceedings of the ACM SIGPLAN 1994 conference on
Programming language design and implementation, pages 121–134,
1994.
[36] J. Ramanujam, J. Hong, M. Kandemir, and A. Narayan. Reducing
memory requirements of nested loops for embedded systems. In DAC
’01: Proceedings of the 38th conference on Design automation, pages
359–364, 2001.
[37] L. Renganarayanan, M. Harthikote-Matha, R. Dewri, and S. V. Ra-
jopadhye. Towards optimal multi-level tiling for stencil computations.
In IPDPS, pages 1–10. IEEE, 2007.
[38] R. Schreiber and D. C. Cronquist. Near-Optimal Allocation of Local
Memory Arrays. Technical Report HPL-2004-24, HP Laboratories
Palo Alto, 2004.
[39] N. Vasilache, C. Bastoul, S. Girbal, and A. Cohen. Violated
dependence analysis. In ACM ICS, June 2006.
[40] Y. Zhao and S. Malik. Exact memory size estimation for array
computations without loop unrolling. In DAC ’99: Proceedings of the
36th ACM/IEEE conference on Design automation, pages 811–816,
1999.