Tiling optimizations for stencil computations by Zhou, Xing
TILING OPTIMIZATIONS FOR STENCIL COMPUTATIONS
BY
XING ZHOU
DISSERTATION
Submitted in partial fulllment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2013
Urbana, Illinois
Doctoral Committee:
Professor David Padua, Chair, Co-Director of Research
Research Assistant Professor Mara J. Garzaran, Co-Director of Research
Professor William Gropp
Professor Wen-Mei Hwu
Doctor Robert Kuhn, Intel
Abstract
This thesis studies the techniques of tiling optimizations for stencil programs. Traditionally,
research on tiling optimizations mainly focuses on tessellating tiling, atomic tiles and regular
tile shapes. This thesis studies several novel tiling techniques which are out of the scope of
traditional research. In order to represent a general tiling scheme uniformly, a unied tiling
representation framework is introduced. With the unied tiling representation, three tiling
techniques are studied. The rst tiling technique is Hierarchical Overlapped Tiling, based
on the idea of reducing communication overhead by introducing redundant computations.
Hierarchical Overlapped Tiling also applies the idea of hierarchical tiling to take advantage of
hardware hierarchy, so that the additional overhead introduced by redundant computations
can be minimized. The second tiling technique is called Conjugate-Trapezoid Tiling, which
schedules the computations and communications within a tile in an interleaving way in order
to overlap the computation time and communication latency. Conjugate-Trapezoid Tiling
forms a pipeline of computations and communications, hence the communication latency can
be hidden. Third, this thesis studies the tile shape selection problem for hierarchical tiling.
It is concluded that optimal tile shape selection for hierarchical tiling is a multidimensional,
nonlinear, bi-level programming problem. Experimental results show that the irregular tile
shapes selected by solving the optimization problem have the potential to outperform intu-
itive tiling shapes.
ii
Acknowledgements
The thesis dissertation marks the end of a long and eventful journey. I would like to ac-
knowledge all the friends and family for their support along the way. This work would not
have been possible without their support.
First, I would like to express my heartfelt gratitude to my advisors, Professor David Padua
and Professor Mara Garzaran, for their patience, encouragement, and immense knowledge.
Their guidance helped me in all the time of research and writing of this thesis. This thesis
would certainly not have existed without their support. I would also like to thank committee
members Professor William Gropp, Professor Wen-Mei Hwu, and Doctor Robert Kuhn for
their insightful comments and suggestions.
I am very grateful to my fellow group members and other friends in Computer Science
Department, for the stimulating discussions, enjoyable experience of working together, and
all the fun we have had in the last four years.
I thank my parents, who gave me the best education and built my personality.
Last but not the least, I would like to thank my wife, who always stands with me during
times of stress and joy.
iii
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Stencil Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Loop Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3 Hierarchical Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 2 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Unied Tiling Representation Framework . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Iteration Space Representation . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Tiling Representation with the Polyhedral Model . . . . . . . . . . . 11
2.1.3 Unied Tiling Representation Framework . . . . . . . . . . . . . . . . 13
2.1.4 Dependences after Tiling . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.5 Schedule of Tiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.6 Tiling Transformation Representation . . . . . . . . . . . . . . . . . . 19
2.1.7 Hierarchical Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Implementation Using the Omega Library . . . . . . . . . . . . . . . . . . . 20
2.2.1 Set and Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter 3 Hierarchical Overlapped Tiling . . . . . . . . . . . . . . . . . . . 23
3.1 Introduce Redundant Computation Through Overlapped Tiling . . . . . . . 23
3.2 Hierarchical Overlapped Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Analytical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Overlapped Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.2 Hierarchical Overlapped Tiling . . . . . . . . . . . . . . . . . . . . . 32
3.4 Unied Tiling Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
iv
3.5.2 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.5 Compilation Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Chapter 4 Hiding Communication Latency with Conjugate-Trapezoid
Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 Hiding Communication Latency . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Conjugate-Trapezoid Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 1-dimensional Stencil . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.3 2-dimensional Stencil . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.4 Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.2 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Unied Tiling Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.1 Target Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.3 Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.4 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Chapter 5 Tile Shape Selection for Hierarchical Tiling . . . . . . . . . . . 83
5.1 Problem Denition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Constraints on Tile Shape Selection . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.4 Calculate the Longest Dependent Path . . . . . . . . . . . . . . . . . . . . . 92
5.4.1 Calculate L(Tk; 1n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.2 Calculate L(El; 1n) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5 Automatic Tile Shape Selection . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6 Unied Tiling Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.7.1 Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.7.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.7.3 Tile Shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.7.4 Model Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
v
Chapter 6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Chapter 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
vi
List of Figures
1.1 Hierarchical tiling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 n-depth loop nest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 The edge vectors (thick arrows) of iteration spaces. . . . . . . . . . . . . . . 9
2.3 Valid and invalid tile shapes for tessellating tiling with atomic tiles. . . . . . 11
2.4 The eect of tiling transformation. . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Overlapped Tiling. Each tile is trapezoid-shaped. The grey triangles are the
overlapped areas between tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Split Tiling. There are two shapes of tiles: triangle and trapezoid. The
neighboring tiles of dierent shapes consist a "super" tile (in the dashed box)
which is the basic repetition unit. . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Generated loop nest code for the integer tuple set S1 . . . . . . . . . . . . . 22
2.8 Parallel code with OpenMP and MPI . . . . . . . . . . . . . . . . . . . . . . 22
3.1 A simple tiling example for parallel loops . . . . . . . . . . . . . . . . . . . . 24
3.2 Overlapped tiling of K loops . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Comparison of overlapped tiling and hierarchical overlapped tiling on a 4-
way/8-core multicore system, where each processor contains 2 cores on chip
that share the last level cache. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 A sequence of K parallel loops . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Equivalent loop nest to the K consecutive parallel loops . . . . . . . . . . . 35
3.6 Framework of the experimental system. . . . . . . . . . . . . . . . . . . . . . 39
3.7 Speedup of traditional tiling, overlapped tiling and hierarchical overlapped
tiling over the original OpenCL code. . . . . . . . . . . . . . . . . . . . . . 42
3.8 Evaluating the performance overlapped tiling and hierarchical overlapped
tiling by scaling fusion depth K. . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.9 Fusion depth K 0 for the second level of hierarchical overlapped tiling. . . . . 46
3.10 Speedup of hierarchical overlapped tiling over plain overlapped tiling with
dierent input sizes. The horizontal axis is the input size for each benchmark. 47
3.11 Compile time for overlapped tiling and hierarchical overlapped tiling, normal-
ized to the compile time of the original OpenCL code. . . . . . . . . . . . . 48
4.1 1-dimensional Jacobi code with MPI. . . . . . . . . . . . . . . . . . . . . . . 50
4.2 An ISL with a d-dimensional stencil. The total depth of the loop nest is d+ 1. 51
4.3 Dependence vectors of a 5-point stencil . . . . . . . . . . . . . . . . . . . . . 52
vii
4.4 Tiling 1-dimensional Jacobi code. Thin arrows represent the dependence vec-
tors. Thick arrows show the dependence between tiles. w and h are the
parameters decided by users to determine the size of each tile. . . . . . . . . 53
4.5 Subtiling and delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Sequential 2-dimensional Jacobi code. . . . . . . . . . . . . . . . . . . . . . . 56
4.7 Tiling with 2-dimensional stencils. The thick arrows in (b) represent the
dependence between tiles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.8 Subtiling with 2-dimensional stencils. . . . . . . . . . . . . . . . . . . . . . . 60
4.9 Schedule of subtiles for each time step.8j, Sj and Rj are the corresponding
send and receive pair. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.10 Projections of subtiles with 2-dimensional stencils. . . . . . . . . . . . . . . . 61
4.11 Slice of a 4-dimensional hyper-parallelepiped tile. . . . . . . . . . . . . . . . 62
4.12 Number of messages to be sent for 3-dimensional Jacobi. Each arrow stands
for a message to be sent to an neighboring tile. The arrows with the same
direction means that the destination of the messages are the same. . . . . . . 67
4.13 Merge communication operations. The send or receive operations with in
the same dashed box can be merged. Compared to Figure 4.9-(b), the send
or receive operations denoted by the dashed arrows are moved (delayed or
brought forward) to the operations denoted by corresponding thick arrows. . 69
4.14 Balanced multi-threading with four processors. The white bars stand for com-
putation time, and the grey bars stand for communication operation overhead. 70
4.15 Speedup with dierent values of h0. The baseline is the performance achieved
by tiling with no communication overlap. The gure shows the performance
with h0 up to 6 for 2-dimensional stencils and h0 up to 4 for 3-dimensional sten-
cils, because the trend shows that there would be no additional performance
gains with a larger h0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.16 Communication overhead with dierent h0. . . . . . . . . . . . . . . . . . . . 80
4.17 Input size sensitivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.1 Dierent tile shape choices for hierarchical tiling. Arrows are the dependences
between tiles. Dashed lines represent the tiles that can be executed in parallel
in a wavefront schedule. The numbers stand for the order of the wavefronts . 84
5.2 Constraints of tile shape imposed by dependences. . . . . . . . . . . . . . . . 87
5.3 Inter-tile dependences (thicker arraows) generated by original dependence vec-
tor ~dk. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4 The dependent pathes (P and P 0) within an iteration space. The length of a
dependent path is the number of iterations on the path. If any path contains
~d2 (P
0), it is always possible to replace ~d2 with ~d1 and ~d0 on the same position,
which resulting a longer path. This means the longest path (P ) must only
contain ~d0 and ~d1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.5 Determine the length of the longest dependent path within the iteration space
of tile Tk. Each ~p1 stands for the coordinate of the point. . . . . . . . . . . . 96
viii
5.6 Construct path P according to a given path P 0. The lengthes of P and P 0 are
approximately equal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.7 The dependence vectors in the iteration space of stencil computation programs.104
5.8 Hierarchical tiling performance for 2-dimensional iteration space on GPU and
cluster platforms. The horizontal axis is the size of the iteration space. The
vertical axis is the speedup over Wavefront or Diamond. . . . . . . . . . . . 107
5.9 Hierarchical tiling performance for 3-dimensional iteration space on GPU and
cluster platforms. The horizontal axis is the size of the iteration space. The
vertical axis is the speedup over Wavefront. . . . . . . . . . . . . . . . . . . 108
5.10 Comparison between the real execution time and the ideal execution time for
1-D Gauss-Seidel. The left axis is for real execution time and the axis on the
right is for ideal execution time. . . . . . . . . . . . . . . . . . . . . . . . . . 109
ix
List of Tables
3.1 Representation of Overlapped Tiling . . . . . . . . . . . . . . . . . . . . . . 38
3.2 ISL Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Loop Fusion depth K used in experiments. . . . . . . . . . . . . . . . . . . . 45
4.1 Representation of Conjugate-Trapezoid Tiling . . . . . . . . . . . . . . . . . 75
4.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3 Dierent stencils of the benchmarks . . . . . . . . . . . . . . . . . . . . . . . 77
5.1 The Unied Tiling Representation . . . . . . . . . . . . . . . . . . . . . . . . 103
5.2 Common tiling schemes: Wavefront, Diamond and Skewing. . . . . . . . . . 105
5.3 Tile shape selection for 1-D Gauss-Seidel and Jacobi. . . . . . . . . . . . . . 106
5.4 Tile shape selection for 2-D Gauss-Seidel on GPU (192019201920). . . . 108
x
Chapter 1
Introduction
1.1 Background
1.1.1 Stencil Computations
Iterative stencil loops (ISLs) are an important kernel of many computations. For example,
for certain regular matrices, the kernel the widely used Jacobi method is a stencil loop; and
image processing lters are also implemented as stencil loops. Iterations of the outermost
loop of ISLs are called time steps, as programs with ISLs are often used to simulate the
evolution of physical systems over time. At each time step, the inner loop nest of the ISL
operates on a multidimensional array, and computes each element of an array as a function
of neighboring locations in the array. The choice of the neighboring elements follows a xed
pattern or stencil.
To achieve high performance, the inner loop nests of ISLs are usually tiled and each tile
is assigned to a processing node for parallel execution. Data locality and communication are
the main issues when optimizing ISLs. Data locality inuences the execution time within
each computing node. When running on distributed memory systems, the performance of
these ISLs may be hindered by inter-node communication. Limited by the performance of
1
the interconnection network, the cost of communication grows with the scale of the system
increases. If the ISL is parallelized for a shared memory system, the inter-tile communication
is implicit, usually controlled with synchronization operations. Even though the communi-
cation overhead of shared memory systems is usually much lower than that of distributed
memory systems, it may still cause performance degradation.
1.1.2 Loop Tiling
Loop tiling is an eective optimization to improve performance of multiply-nested loops,
which are usually the most time-consuming parts of computationally intensive programs.
Numerous techniques for the tiling of iteration spaces have been proposed. The goal of
tiling is to improve data locality [39, 42, 1, 2, 36, 14, 33, 38, 41, 45, 48], or contribute to
the scheduling of parallel computation [40, 43, 6, 18, 44, 7, 47]. The performance resulting
from the use of a tiling scheme is a function of (1) locality, (2) the amount of parallelism
exposed, and (3) the communication/synchronization overhead. The size and shape of the
tiles, and the scheduling strategy are the main factors that impact locality, parallelism,
and communication cost of the code after tiling. Research on tiling techniques usually try
to improve one or more factors that impact performance. Locality was the rst focus of
research in these studies of tiling, but as the importance of parallel computing increased,
mechanisms to optimize parallelism organization and communication with tiling techniques
gained attention.
Existing tiling transformations mainly relies on tessellating tiling and scheduling oper-
ations in tiles in atomic. In tessellating tiling there is no overlap between tiles. When
scheduling operations of tiles atomically, inter-tile communication or synchronization only
happen before computation starts or after all the computation within a tile completes. The
above two restrictions simplies the study of loop tiling techniques. Elegant representation
frameworks, such as the polyhedral model [5], have been built to facilitate research on tiling
2
schemes under these restrictions. However, by ignoring such restrictions it is possible to
taka advantage of the optimization opportunities of general tiling schemes. For example,
because of data dependences, inter-tile communication is inevitable with tessellating tiling;
and, with atomic tiles execution, schedule of tiles is restricted and often results in load im-
balances. In order to exploit the optimization opportunities of general tiling schemes, a new
representation framework is necessary.
1.1.3 Hierarchical Tiling
Most massively parallel systems today are organized hierarchically. Dierent levels in the
hierarchy may have dierent organizations and follow dierent memory models. Consider a
computer cluster consisting of nodes connected by a network. This forms the rst level of
the hierarchy. At the next level, each node is typically an SMP machine. Modern accelerator
devices are also organized hierarchically. An NVIDIA GPGPU contains a number of Multi-
Processors (MPs). The MPs have uniform access to a global memory. Each MP consists of
several Stream-Processors (SPs), which share the MP-private shared memory.
Hierarchy-aware optimizations are usually necessary to unleash the potential of hierar-
chically organized systems. In order to make better use of hardware hierarchies, loop nests
should also be tiled hierarchically to t the organization of the target machine. Figure 1.1
gives an example of hierarchical tiling for a cluster. In top-down order, (1) the iteration
space of the loop nest is tiled, and each tile is mapped to a node of the cluster. (2) On each
node, the iteration space is further partitioned into smaller tiles, each of which is assigned
to a processor in the node. Hierarchical tiling can also be done bottom-up by partitioning
the original iteration space into tiles that are assigned to the lowest level of hardware and
then grouping tiles into larger ones for the higher levels of hardware.
Hierarchical tiling increases the complexity of the study on tiling transformation, because
each level of tiling has its own choice tile size/shape and scheduling strategy, and the decisions
3
Iteration 
Space
Node
ProcessorLevel 0 Tile
Level 1 Tile
Figure 1.1: Hierarchical tiling.
of tiling scheme at dierent levels interfere with each other. In order to achieve global
optimality for hierarchical tiling, dierent levels of tiling must be considered in an integral
model.
1.2 Contributions
This thesis makes several contributions in the area of tiling for stencil computations. First,
it introduces a unied representation framework for tiling transformations, which is able to
describe tiling techniques that cannot be represented by the traditional polyhedral model,
such as non-tessellating tiling and scheduling with non-atomic execution of tiles, can be
dened uniformly with this new representation framework.
Second, this thesis proposes two novel tiling schemes. The rst is Hierarchical Overlapped
Tiling. This is an extension of the existing idea of overlapped tiling [26], which introduces
redundant computation to eliminate inter-tile communication. Overlapped tiling is a non-
tessellating tiling scheme. It is a useful transformation to reduce communication overhead,
but it may also introduce a signicant amount of redundant computations. Based on this
observation, Hierarchical Overlapped Tiling is designed to solve this problem. Hierarchical
Overlapped Tiling trades o redundant computation for reduced communication overhead
in a more balanced manner, and thus has the potential to provide higher performance. An
analytic model is built to analyze the performance of both Overlapped Tiling and Hierarchical
4
Overlapped Tiling. The second tiling scheme is called Conjugate-Trapezoid Tiling. This tiling
scheme also aims at reducing the inter-tile communication. Instead of introducing redundant
computation to eliminate inter-tile communication, Conjugate-Trapezoid Tiling schedules
the computations and communications within a tile in an interleaving order to overlap the
computation and communication. Conjugate-Trapezoid Tiling pipelines computation and
communication. The operations in tiles of Conjugate-Trapezoid Tiling are not scheduled
assuming that they are atomic; instead, computation and communication within a tile are
interleaved with each other. A performance model is used to determine the optimal pipeline
parameters of Conjugate-Trapezoid Tiling. The eciency of both tiling schemes is evaluated
using several codes which implement stencil computations.
Third, this thesis studies the tile shape selection problem for hierarchical tiling. It builds
an analytic model to analyze the execution time of the tiled loop nest as a function of the
tile shape at each level of a hierarchy. The model is not tied to any specic scheduling
scheme of tiles, but only focuses on the essence eect on parallelism exposure for all possible
tile shape choices. It is concluded that optimal tile shape selection for hierarchical tiling is
a multidimensional, nonlinear, bi-level programming problem. An experimental automatic
system is implemented, which uses a simulated annealing algorithm to nd a near-optimal
tile shape choice according the analytic model. Experimental results show that the tiling
scheme with automatically chosen tile shapes has the potential to outperform intuitive tiling
shapes.
1.3 Thesis Organization
The rest of the thesis is organized as follows: Chapter 2 introduces a new tiling repre-
sentation framework for general schemes and a code generator implementation based on
this framework.In Chapter 3 Hierarchical Overlapped Tiling is introduced building upon
5
the existing idea of overlapped tiling, and an analytic model is built to study the trade o
between communication and redundant computation. In Chapter 4 a novel tiling scheme
called Conjugate-Trapezoid Tiling is designed, and a performance model is used to deter-
mine the optimal pipeline parameters. Chapter 5 studies the tile shape selection problem for
hierarchical tiling. Chapter 6 discusses related work, and Chapter 7 concludes this thesis.
6
Chapter 2
Prerequisites
2.1 Unied Tiling Representation Framework
This chapter rst introduces the polyhedral model [5], which has been used in the past to
represent of tiling transformations. Based on the observation that the polyhedral model
is only able to represent traditional tiling techniques, a more general tiling representation
framework is presented. This representation accommodates several non-traditional tiling
schemes including non-tessellating tiling and scheduling strategies that do not assume atomic
tile operations.
2.1.1 Iteration Space Representation
Consider the loop nest in Figure 2.1-(a). whose vector form is shown in Figure 2.1-(b).
Iteration space I is a nite set of points in the n-dimensional space Zn. The loop bounds
lb0; lb1; :::lbn 1 and ub0; ub1; :::ubn 1 are functions of the index variables of outer loops as
7
follows:
lbk = lbk(i0; i1; :::; ik 1);
ubk = ubk(i0; i1; :::; ik 1); k = 0; 1; ::n  1:
The set of iterations in I is dened by lb0; lb1; :::lbn 1 and ub0; ub1; :::ubn 1:
I = f~i = (i0; i1; :::; in 1)jlbk(i0; i1; :::; ik 1)  ik < ubk(i0; i1; :::; ik 1)g:
In the rest of this thesis, lbk(i0; i1; :::; ik 1) and ubk(i0; i1; :::; ik 1) are restricted to be ane
functions of i0; i1; :::; ik 1.
1 for ( int i0 = lb0() ; i0 < ub0() ; ++i0 )
2 for ( int i1 = lb1(i0) ; i1 < ub1(i0) ; ++i1 )
3 . . .
4 for ( int in 1 = lbn 1(i0; i1; :::; in 2) ; in 1 < ubn 1(i0; i1; :::; in 2) ; ++in 1 )
5 << loop body >>
(a) Loop nest with scalar induction variables
1 for (~i = (i0; i1; :::; in 1) 2 I )
2 << loop body >>
(b) The representation with vector induction variables
Figure 2.1: n-depth loop nest
In practice, the shape of I is usually an n-dimensional hyper-parallelepiped. Any set of
n edges of the hyper-parallelepiped which share a common vertex can be used as the basis of
the n-dimensional iteration space; the common vertex of these edges is the origin. Figure 2.2
gives two examples of the edge vectors that dene 2 dimensional and 3 dimensional iteration
spaces.
Assume that ~ek = (ek;0; ek;1; :::; ek;n 1), k = 0; 1; :::; n   1 are the n edge vectors used as
8
i1
i0


(0,0)
i1
i2
i0


(0,0,0)
(a) 2 dimensional space (b) 3 dimensional space
Figure 2.2: The edge vectors (thick arrows) of iteration spaces.
the basis of the hyper-parallelepiped shaped iteration space I. The basis matrix of I is:
E =
0BBBBBBB@
~e0
~e1
:::
~en 1
1CCCCCCCA
:
Dene the function span(E) as follows:
span(E) = f~i : ~i 2 Zn; 9~a = (a0; a1; :::; an 1);
~0n = (0; 0; :::; 0)  ~a  (1; 1; :::; 1) = ~1n; ~i = ~a  Eg:
Clearly I = span(E), and the total number iterations in I is jIj = jdet(E)j (because jdet(E)j
is the volume of the parallelepiped).
Dependences are the partial order that must be enforced between iterations to obtain
correct results. If there are no direct or indirect dependences between two iterations, these
two iterations can be scheduled either concurrently or in any sequential order. Otherwise, the
iteration at the source of the dependence must nish before the iteration at the destination
of the dependence starts.
Dependences can be represented by dependence vectors. A dependence vector ~d =
9
(d0; d1; :::; dn 1) indicates that any iteration ~i must nish before iteration ~i + ~d. A valid
dependence vector d must satisfy the following condition:
if 8k = 0; 1; :::; j; dk = 0; and dj+1 6= 0; then
dj+1 > 0 (2.1)
Assume that a loop with n levels of nesting has m dependence vectors ~d0; ~d1; :::; ~dm 1,
which dene the dependence matrix :
D =
0BBBBBBB@
~d0
~d1
:::
~dm 1
1CCCCCCCA
:
Without loss of generality, this thesis assumes that m  n, where m is the number of depen-
dence vectors, and n is the number of loops in the loop nest. It is also assumed that there
are n dependence vectors which are linearly independent. Otherwise, it must be possible
to make at least one loop fully permutable through a sequence of ane transformations, so
it is possible to move the permutable loop either to outmost or innermost, and then it is
only needed to focus on the other n   1 loops and the iteration space can be studied as a
lower-dimensional one.
The dependences impose constraints on possible tiling transformations. In order to be
valid, the new loop nest after tiling must preserve the dependences of the original loop
nest. If the computation operating in a tile is "atomic" - which means inter-tile communica-
tion (inter-tile dependence) only happens before inner-tile computation starts and/or after
inner-tile computation is nished - a valid tiling scheme requires that it must be possible
to topologically sort all the tiles. Otherwise, the neighboring tiles would have a cycle of
10
t
1
t
0
d
0 d1
e
0
e
1
t
1
t
0
d
0
d
1
e
0
e
1
(a) Valid tile shape (b) Invalid tile shape
Figure 2.3: Valid and invalid tile shapes for tessellating tiling with atomic tiles.
dependences and it would be impossible to schedule the computations.
Figure 2.3 gives two examples of the constraints on tiling imposed by dependences for
tessellating tiling with atomic tiles. The tile shape shown in Figure 2.3-(a) is valid, because
dependence vectors only cross the hyper-plane in one direction, which means that at each of
its boundaries, the tile has either in-bound dependences or out-bound dependences. However,
the tile shape in Figure 2.3-(b) is invalid, since there are both in-bound and out-bound
dependences crossing the boundary hyper-planes of tiles. This means that a tile and one of
its neighboring tile along the horizontal direction are in a dependence cycle. On the other
hand, if not restricted to tessellating tiles or scheduling assuming atomic operations in tiles,
there would be more freedom in selecting the tiling schemes. One example is that overlapped
tiling, in which the tiles are not tessellating, is able to eliminate inter-tile dependences by
introducing redundant computation.
2.1.2 Tiling Representation with the Polyhedral Model
The polyhedral model is the traditional representation of tiling transformations used in the
past. The polyhedral model uses a set of hyperplanes that partition the iteration space to
represent the tiling transformation. These hyperplanes are dened by a matrix H, where
11
each row represents the normal vector of one of the tiling hyperplanes:
H =
0BBBBBBB@
~h0
~h1
:::
~hn 1
1CCCCCCCA
:
~hk is the normal vector of not just of one, but a set of tiling hyperplanes. The direction ~hk
determines an innite set of parallel hyperplanes. In addition, this thesis assumes the length
of each ~hk is the distance between tiling hyperplanes. So H can fully determine the shape of
tiles under the assumption that the hyperplane that intersects the rst iteration is the rst
hyperplane.
As discussed in Section 2.1.1, dependences constraint tiling transformations. A tiling
scheme is valid if there exists a valid total ordering of the tiles. If tiles are required to be
scheduled assuming atomic operations in tiles, this constraint is equivalent to saying that no
any pair of tiles are dependent on each other. The polyhedral model represents the validity
condition as follows [20]:
H DT  0 (2.2)
The polyhedral model provides elegant representations for traditional tiling schemes.
However, the polyhedral model has several shortcomings if it is used to study more general
tiling techniques. First, using a set of tiling hyperplanes to represent tile shapes implies
tessellating tiling, which means that there must be no overlap between tiles. The polyhe-
dral model cannot represent non-tessellating tiling. Second, Equation 2.2 assumes that the
scheduling of tiles must follow the order dened by the directions of the normal vectors of
tiling hyperplanes. This means that with the polyhedral model, the tile shape and schedul-
12
ing strategy are determined by H at the same time. So the polyhedral model lacks the
exibility to design dierent schedule constraints in order to optimize parallelism exposure.
Finally, the polyhedral model is not good for studying hierarchical tiling techniques, because
the iteration space within each tile does not have a unied representation of the original
iteration space so that recursive analysis cannot be simply applied.
2.1.3 Unied Tiling Representation Framework
Based on the observation that the traditional polyhedral model is not good for studying
general tiling techniques, a unied representation framework of tiling transformations is
introduced in this section.
In this unied representation framework, a base tile T is a set of points in Zn. Without
lost of generality, this thesis assumes that the size of the base tile T is smaller than the size
of the iteration space I that the tiling is applied to: jTj < jIj. Since the iteration space,
say I, is also a set of points in Zn, tile T is also a smaller iteration space. If the tiles are
n-dimensional hyper-parallelepiped where ~tk = (tk;0; tk;1; :::; tk;n 1), k = 0; 1; :::; n  1 are the
n "basis" edges of the hyper-parallelepiped tile, then the base tile T can be described by the
tiling matrix T :
T =
0BBBBBBB@
~t0
~t1
:::
~tn 1
1CCCCCCCA
:
For each non-boundary tile, the set of the iterations with in the tile is span(T ). And the
total number of iterations is jdet(T )j.
Tiling is a map from Zn to Z2n: I ! [I 0 : J ] = f(~i0 : ~j)g, in which ~i0 = (i00; i01; :::; i0n 1) is
an index vector for each tile, and ~j = (j0; j1; :::; jn 1) is an iteration index that falls within
tile ~i0 (notice that ~j is a global iteration index instead of local index within the tile). If every
13
tile has the same shape as T, given an iteration space I, tiling is done by replicating tile T
in a repeated pattern until all the points (iterations) in I are covered by the replications of
T.
Let R denote the repetition operator applied to the index vector of an iteration ~i. In the
rest of the thesis, repetition patterns are restricted to be translation transformations in the
n-dimensional space. Let R denotes the following translation matrix:
R =
0BBBBBBB@
~r0
~r1
:::
~rn 1
1CCCCCCCA
:
Then a repetition pattern R can be dened as follows:
R(~i;~i0) =~i+~i0 
0BBBBBBB@
~r0
~r1
:::
~rn 1
1CCCCCCCA
=~i+~i0 R (2.3)
Since the matrix R represents the repetition pattern of tiling, R is called the repetition
matrix. Given a base tile T, each ~i0 generates a repetition of the tile T though R, denoted
as T(~i0):
T(~i0) = fR(~i;~i0)j~i 2 Tg (2.4)
The tiling can be formally dened as follows:
I = f~ig ! f(~i0 : ~j) j 8 ~i0;~j T(~i0) \ I 6= ; ^ ~j 2 T(~i0) ^ ~j 2 Ig (2.5)
14
The rst constraint in Equation 2.5, T(~i0) \ I 6= ;, denes another iteration space I 0, which
is a set of ~i0, as follows:
I 0 = f~i0 j T(~i0) \ I 6= ;g (2.6)
And the original iteration space I must be covered by the union of all repetitions of the tile:
I 
[
8~i02I0
T(~i0) (2.7)
Note that the above denition of tiling does not require that the tiles partition the
iteration space. If 9~i01; ~i02;T(~i01) \ T(~i02) 6= ;, the iterations in the intersection set are
redundant computations introduced by tiling compared to the original loop nest. Tiling
with redundant computations can still be legal if the loop nest after tiling produces the same
result. Overlapped tiling [26, 30, 35, 49] is an example of tiling schemes with redundant
computations which has been studied before. Figure 2.5 gives an example of overlapped
tiling. In most tiling schemes studied in existing research are tessellating tiling, which
means the tile repetitions are disjoint:
8~i01; ~i02 ~i01 6= ~i02; T(~i01) \ T(~i02) = ;
For tessellating tiling, for each ~i 2 I, there is a unique (~i0 : ~j) after tiling corresponding to
it, and vice versa. Then I 0 = f~i0g forms an iteration space, where each T(~i0) is an iteration.
Figure 2.4 shows the eect of tiling transformation.
The representation can handle the case where there are several but nite number of tile
shapes, as long as there is a single "super" tile which groups neighboring tiles of dierent
shapes, and the whole iteration space is covered by repetitions of the "super" tile. Therefore
the tiling denition above can still be used to analyze the tiling scheme with nite number
15
  
 
 
     
 
 
 
 
                                     
 
                                                                        
i
1
 
i
0
 
i'
0
 
i'
1
 
t1 
t0 
e0 
e1 
Figure 2.4: The eect of tiling transformation.
tile
Figure 2.5: Overlapped Tiling. Each tile is trapezoid-shaped. The grey triangles are the
overlapped areas between tiles.
of tile shapes, but tile T must represent the "super" tile. Split tiling [26, 37] is an example of
tiling schemes with more than one tile shapes. As shown in Figure 2.6, there are two shapes
of tiles: triangle and trapezoid. The neighboring tiles of dierent shapes consist a "super"
tile (in the dashed box) which is the basic repetition unit.
2.1.4 Dependences after Tiling
As mentioned above, I 0 = f~i0g is an iteration space with each iteration containing all the
iterations in T(~i0). The dependences in the original iteration space I impose new dependences
in I 0. Let ~d00, ~d01, ..., ~d0m0 1 are the m0 resulting dependences vectors in I 0 and D0 is the
16
super tile
tile 1 tile 2
Figure 2.6: Split Tiling. There are two shapes of tiles: triangle and trapezoid. The neigh-
boring tiles of dierent shapes consist a "super" tile (in the dashed box) which is the basic
repetition unit.
dependence matrix:
D0 =
0BBBBBBB@
~d00
~d01
:::
~d0m 1
1CCCCCCCA
:
Note that the execution order of tiles is not necessarily the lexicographical order of ~i0.
The execution order can be changed by dierent scheduling schemes as discussed in Section
2.1.5, a dependence vector ~d0j (0  j < m) does not have to satisfy the condition dened in
Equation 2.1.
In order to simplify the problem, the rest of the thesis assumes that the tile chosen for
any tiling scheme is always large enough, so that for an n-dimensional iteration space, each
tile only has up to 3n   1 neighboring tiles, and inter-tile dependences only exist between
neighboring tiles. This means that each component in any dependence vector ~d0j, 0  j < m0,
can only be 0, 1, or  1:
~d0j = (d0j;0; d
0
j;1; :::; d
0
j;n 1); d
0
j;k = 0; 1; 1; 0  k < n
17
2.1.5 Schedule of Tiles
Tiles must be scheduled in a way that the inter-tile dependences are preserved. The schedul-
ing of tiles is a reorder of the T(~i0)s. The schedule can be formally dened as a mapping S
from each ~i0 to ~io:
~io = S(~i0)
After tiling, each tile T(~i0) will be executed in a lexicographical order of ~io. In order
to be a legal scheduling, ~io must lexicographically conform to all inter-tile dependences: if
S(~i01) = ~io1 and S(~i02) = ~io2, and ~io1  ~io2, there must be no direct or indirect inter-tile
dependence from the tile T(~i02) to tile T(~i01). Note that S is an n-to-1 mapping instead
of 1-to-1 mapping. If 9~i01; ~i02, ~i01 6= ~i02, S(~i01) = S(~i02), this means that there is no direct
or indirect dependence between T(~i01) and T(~i02), and these two tiles can be scheduled in
parallel. Dene S 1 as follows:
S 1(~io) = f~i0jS(~i0) =~iog
Then jS 1(~io)j is the amount of parallelism on each step exposed by the scheduling S.
If the scheduling mapping is ane, S can be described by an n s (s  n) matrix S as
follows:
~io = S(~i0) = ~i0  S
S is called the scheduling matrix. Whether S or S is a valid scheduling mapping depends on
the dependence matrix D0 in the tiled iteration space I 0.
Valid Schedule
The lexicographical order of ~io = S(~i0) denes the execution order of tile ~i0. The execution
order must not violate the dependences between tiles. Given ~i01 and ~i02 = ~i01 + ~d0j, T(~i01)
18
must be executed before T(~i02). If S(~i01) =~i1o and S(~i02) =~i2o, there must be
~i2
o ~i1o; or ~i2o  ~i1o = S(~i01 + ~d0j)  S(~i01)  ~0
Assume S is an ane transformation, then S(~i01 + ~d0j)  S(~i01) = S(~d0j). So the condition of
valid schedule can be dened as follows:
8~d0j; S(~d0j)  ~0 (2.8)
If scheduling matrix S exists, ~doj = S(~d0j) = ~d0j  S must satisfy the condition in Equation
2.8.
In addition, because for any ~io, all tiles ~i0 2 S 1(~io) can be scheduled in parallel, there
must be no dependence among them:
8~i0 2 S 1(~io);
8~d0j; S(~i0 + ~d0j) =2 S 1(~io): (2.9)
Equation 2.8 and 2.9 are the two conditions that a valid schedule S must satisfy.
2.1.6 Tiling Transformation Representation
To sum up the discussion above, given an iteration space I and its dependence matrix D, a
specic tiling transformation is determined by tile shape T (or tiling matrix T ), repetition
pattern R (or repetition matrix R) and scheduling mapping S (or scheduling matrix S). The
choices of T(T ), R(R) and S(S) must conform the constraints imposed by D.
19
2.1.7 Hierarchical Tiling
After tiling, either each tile T or I 0 can be considered as a new iteration space and can be
applied tiling transformation recursively. This means tiling transformations can be done
hierarchically. Each level of tiling can be dened and analyzed with corresponding T(T ),
R(R) and S(S) as a single level of tiling independently.
There are two approaches to do hierarchical tiling. First, do tiling for I and then apply
tiling the each tile T; in this way the whole process would produce higher level tiles rst
then produce lower level tiles. This is called the top-down approach. On the other hand,
if the iteration space I 0 is tiled after tiling the original I, the whole process would produce
lower level tiles rst then higher level tiles, which is called the bottom-up approach.
2.2 Implementation Using the Omega Library
In order to automate the process of tiling, an experimental implementation using Presburger
formulas [25] has been developed. The manipulation of Presburger formulas is done by the
Omega Library [22] automatically.
2.2.1 Set and Mapping
The key concepts of tiling transformation are sets and mappings in the n-dimensional integer
space. As discussed in last section, for a given tiling transformation, the iteration space I
and tile shape T are sets of integer vectors, and the repetition pattern R and scheduling
mapping S are mappings in iteration space. In the experimental implementation, sets and
mappings in the n-dimensional integer space are represented by integer tuple sets and integer
tuple relations, respectively. These two are the objects used in the Omega Library.
In the Omega Library, an n-dimensional integer tuple ~x = [x0; x1; :::; xn 1] is a vector of
n integers in Zn. An integer tuple set is a set of integer tuples. Constraints in the form of
20
equations and inequalities can be used to describe sets of integer tuples. For example, the
set S0 = f[1; 1]; [1; 2]; :::; [1; N ]g can be represented as f[i; j] : i = 1 ^ 1  j  Ng.
In this thesis it is assumed that the arithmetic expressions in the equations and inequal-
ities are ane and the terms are integers. Logical operators :, ^ and _, and the existential
and universal quantiers 9 and 8 are also needed to describe the set of integer tuples. The
representations are known as Presburger formulas. In the experimental implementation,
these expressions of Presburger formulas are manipulated using the Omega Library.
Integer tuple relations with rules are described by Presburger formulas, too. For example
the relation R = f[i; j] ! [x; y] : i   1  x  i ^ j   1  y  j + 1g when applied to S0
yields the following set:
R(S0) = f[x; y] : 0  x  1 ^ 0  y  N + 1g
Note that given a relation R the size and shape of the original set S and that of R(S)
for a relation R can be dierent. The union [ of two relations R1 and R2 is dened as:
R1 [R2 = R; if 8S;R1(S) [R2(S) = R(S)
2.2.2 Code Generation
The Omega Library provides the functionality of generating loop nest code in C for a given
integer tuple set, which can be used in the automated implementation of tiling transformation
schemes. For example, for integer tuple set S1 = f[i; j] : 0  i < 100 ^ 0  j < 1000g, the
generated code loop nest is as follows:
The loop iterations of the generated loop nest code are exactly the integer tuples in
the given integer tuple set in lexicographic order. The loop nest generated by the Omega
Library is sequential code. In order to express parallelism, the experimental implementation
21
1 for ( int i = 0 ; i < 100 ; ++i )
2 for ( int j = 0 ; j < 1000 ; ++j )
3 << loop body >>
Figure 2.7: Generated loop nest code for the integer tuple set S1
provides the functionality to generate parallel loops for dierent programming environment.
Figure 2.8-(a) and (b) show the generated OpenMP and MPI codes if the outmost loop is
parallelized.
1 #pragma omp paral le l for
2 for ( int i = 0 ; i < 100 ; ++i )
3 for ( int j = 0 ; j < 1000 ; ++j )
4 << loop body >>
(a) OpenMP code
1 MPI Comm rank(MPICOMMWORLD, &rank ) ;
2 int i = rank ; // f o r ( i n t i = 0 ; i < 100; ++i )
3 for ( int j = 0 ; j < 1000 ; ++j )
4 << loop body >>
(b) MPI code
Figure 2.8: Parallel code with OpenMP and MPI
22
Chapter 3
Hierarchical Overlapped Tiling
3.1 Introduce Redundant Computation Through Over-
lapped Tiling
Consider the code in Figure 3.1-(a) where two parallel loops are executed in a shared mem-
ory machine. Although this gure shows a natural representation of the computation, the
pair of loops may cause unnecessary cache misses, depending on how they are scheduled. If
the loops are scheduled naively, e.g., dynamic scheduling, the second loop will likely incur
frequent cache misses. To increase locality, and also coarsen the granularity of the parallel
tasks, the programmer can tile and fuse the loops, as shown in Figure 3.1-(b). The resulting
code requires an explicit barrier to guarantee correctness, because of the data dependences
between neighboring tiles of iterations (during iteration t of the outer loop, j consumes data
produced by adjacent tiles of loop i, namely tiles t  1 and t+ 1 or just one of them at the
boundaries). Notice that locality would improve if the same task executes the corresponding
i and j tiles in the code of Figure 3.1-(b). However, good locality is only possible if array
A can be kept in cache memory when the execution moves from the rst to the second
loop. If, however, the array A is larger than the total cache of the processors executing
23
the loops, the traditional loop fusion and tiling transformation applied in Figure 3.1-(b) will
not benet from locality, because all the iterations of the i loop must complete before the j
loop executes. Besides the diculties for achieving locality of naive tiling, the parallelization
transformation may do a suboptimal job because of the barrier introduced. On some ar-
chitectures, barriers are expensive synchronization operations, and could additionally cause
load imbalance. Furthermore, because of the barrier the transformation from Figure 3.1-(a)
to Figure 3.1-(b) is not possible in some languages, such as OpenMP and OpenCL [23],
which do not allow global barriers inside data parallel constructs.
1 paral le l for ( int i = 0 : N 1)
2 A[ i ] = . . . ;
3 paral le l for ( int j = 0 : N 1)
4 . . . = A[ j  1] + A[ j ] + A[ j +1] ;
(a) Original loops
1 paral le l for ( int t = 0 : N/T)
2 for ( int i = t T; i < min(N, ( t+1)T; i++)
3 A[ i ] = . . . ;
4 BARRIER;
5 for ( int j = t T; j < min(N, ( t+1)T; j++)
6 . . . = A[ j  1] + A[ j ] + A[ j +1] ;
7 g
(b) Traditional tiling and fusion
1 paral le l for ( int t = 0 : N/T) f
2 for ( int i = max(0 , t T 1) ;
3 i < min(N, ( t+1)T+1) ; i++)
4 A[ i ] = . . . ;
5 for ( int j = t T; j<min(N, ( t+1)T) ; j++)
6 . . . = A[ j  1] + A[ j ] + A[ j +1] ;
7 g
(c) Overlapped tiling
Figure 3.1: A simple tiling example for parallel loops
To remove the synchronization and enhance locality, the code can be transformed into
the form shown in Figure 3.1-(c). In this case, each iteration of the outer loop t produces
all the data it needs so that its iterations (which correspond to tiles) are independent from
24
each other. This is achieved because each iteration performs redundant computation. The
result is a code without the BARRIER and with increased locality.
In the example in Figure 3.1-(c) loop i produces A[max(0; tT  1) : min(N; (t+1)T )]
in each iteration of the outer loop, that is, T + 2 elements, 2 more than the number of
elements of A computed by loop i in Figure 3.1-(b). In total the N=T executions of loop i
in Figure 3.1-(c) produce T+2
T
N elements, so this loop performs T+2
T
N  N = 2N
T
more
iterations than the corresponding loop of Figure 3.1-(b). The transformation leading to a
loop of the form of Figure 3.1-(c) is called overlapped tiling.
Figure 3.2-(a) shows a code snippet which represents a typical stencil computation. Pairs
of consecutive executions of the inner loop form a pattern similar to that of the two inner
loops in Figure 3.1-(a). If applying overlapped tiling repetitively and fuse all K executions
of the inner loop, it is possible to execute the outer loop without using any barrier. The
number of consecutive loops fused is the depth of the transformation. In this example, the
fusion depth is K. The total amount of redundant computation usually grows with the value
of depth. Figure 3.2-(b) shows the area of overlap. The triangles that bracket each tile
represent the redundant computation.
3.2 Hierarchical Overlapped Tiling
The basic idea of overlapped tiling is the trade-o between redundant computation and syn-
chronization overhead. The protability of overlapped tiling transformation over traditional
tiling transformation depends on that the amount of redundant computation introduced
should be smaller than total synchronization overhead. However, synchronization overhead
is highly related to the performance of the communication channels on the specic target
platform, e.g., global memory, bus and shared cache. This means that the eciency of over-
lapped tiling is sensitive to the hardware layer which tiles are mapped to. While overlapped
25
1 for ( int k = 0 ; k < K; k++) f
2 paral le l for ( int i = 0 : N 1)
3 B[ i ] = A[ i  1] + A[ i ] + A[ i +1] ;
4 swap (A, B) ;
5 g
(a) Code snippet of K consecutive loops in a stencil code
…
Tile t Tile t+1
i
i-1
i+1
Loop K-2
Loop 0
Loop K-1
(b) Overlapped tiling
Figure 3.2: Overlapped tiling of K loops
tiling removes synchronization and enhances locality, it does not take into account the hi-
erarchical organization of today's machines. Furthermore, it could suer from substantial
amount of redundant computation. As the fusion depth increases, more synchronizations
can be removed, but there is also an increase in the total amount of redundant computation
(the triangle-shaped areas in Figure 3.2-(b)). Hence, the use of hierarchical overlapped tiling
is proposed to balance communication overhead and redundant computation.
Figure 3.3 contrasts overlapped tiling with hierarchical overlapped tiling. The example
in the gure assumes 8 consecutive loops executing on a 4-way/8-core multicore system
where the two cores on each processor share the last level cache. Figure 3.3-(a) illustrates
overlapped tiling across the eight cores while Figure 3.3-(b) illustrates hierarchical overlapped
tiling. In Figure 3.3-(b), overlapped tiling is rst applied across the four processors. Within
each processor, pairs of consecutive loops are fused. This forces a local barrier between each
pair of loops (loop0 and loop1, loop2 and loop3, and so on). This barrier, however, only
synchronizes the two cores on each processor and therefore its cost should be relatively low.
26
Loop
0
1
2
3
4
5
6
7
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
K=8
(a) Overlapped tiling
Local
Barriers
Loop
0
1
2
3
4
5
6
7
Processor 0 Processor 1 Processor 2 Processor 3
Core 2 Core 3
Processor 1
K=8
K’=2
(b) 2-level hierarchical overlapped tiling
Figure 3.3: Comparison of overlapped tiling and hierarchical overlapped tiling on a 4-way/8-
core multicore system, where each processor contains 2 cores on chip that share the last level
cache.
Within each processor, overlapped tiling is applied to enable the parallel execution of pairs
of loops across the two cores without the need for a barrier. Compared to overlapped tiling,
the total amount of redundant computation (the shadowed triangle areas between dierent
processors and the smaller shadowed triangles between neighboring cores) caused by the 2
levels of tiling is much smaller. This reduction of the redundant computation is the main
source of the performance benet of hierarchical overlapped tiling over plain overlapped
tiling.
27
3.3 Analytical Modeling
This section gives a quantitative analysis of overlapped tiling and hierarchical overlapped
tiling.
Consider K consecutive parallel loops loop0, loop1, ..., loopK 1. Assume that the k-th
loop loopk (0  k < K) has the form shown Figure 3.4.
1 paral le l for ( int ~i = [i0; i1; :::; in 1] 2 I0 ) f // loop0
2 . . .
3 g
4 paral le l for ( int ~i = [i0; i1; :::; in 1] 2 I1 ) f // loop1
5 . . .
6 g
7 . . .
8 paral le l for ( int ~i = [i0; i1; :::; in 1] 2 Ik ) f // loopk
9 . . . = A0k[
~f0k (
~i)] ;
10 . . . = A1k[
~f1k (
~i)] ;
11 . . .
12 . . . = ALk 1k [~f
Lk 1
k (
~i)] ;
13 B0k[~g
0
k(
~i)] = . . . ;
14 B1k[~g
1
k(
~i)] = . . . ;
15 . . .
16 BMk 1k [~g
Mk 1
k (
~i)] = . . . ;
17 g
18 . . .
19 paral le l for ( int ~i = [i0; i1; :::; in 1] 2 IK 1 ) f// loopK 1
20 . . .
21 g
Figure 3.4: A sequence of K parallel loops
Without loss of generality, this thesis assumes that the body of loopk reads from Lk n
0-
dimensional arrays A0k, A
1
k, ..., A
Lk 1
k and writes toM arrays B
0
k, B
1
k,...,B
Mk 1
k which are also
n0-dimensional. This thesis also assumes that no A array overlaps with a B array. To simplify
discussion it is assumed that A0k = A
1
k = ::: = A
Lk 1
k = Ak and B
0
k = B
1
k = ::: = B
Mk 1
k = Bk.
There are Lk references to Ak on the RHS of the rst Lk assignment statements in the body
of the loop: Ak[~f
0
k (~i)], Ak[
~f 1k (~i)], ..., Ak[
~fLk 1k (~i)], with ~f
l
k : Zn ! Zn0 ; 0  l < Lk. There are
Mk references to elements of Bk on the LHS of the last Mk statements: Bk[~g
0
k(~i)], Bk[~g
m
k (~i)],
28
..., Bk[~g
Mk 1
k (
~i)], with ~gmk : Zn ! Zn0 ; 0  m < Mk.
There are 3 sets that must be computed for loopk: Ik, the iteration space; Rk, the set of
subscripts of Ak; and Wk the set of subscripts of Bk:
Rk = f~rg =
Lk 1[
l=0
f[r0; r1; :::; rn0 1] : ~r = ~f lk(~i) ^~i 2 Ikg
Wk = f~wg =
Mk 1[
m=0
f[w0; w1; :::; wn0 1] : ~w = ~gmk (~i) ^~i 2 Ikg
Ck is dened the consuming relation from Ik to Rk, Ck(Ik) = Rk, and Pk is for the
producing relation, as the relation from Wk to Ik, Pk(Wk) = Ik. Ck and Pk represent
the access pattern of the loop body. Ck and Pk are used to describe the dierent tiling
transformations. Ik, Rk, Wk, Ck and Pk are related as follows:
Rk = Ck(Ik); Ik = Pk(Wk) or Wk = P
 1
k (Ik)
3.3.1 Overlapped Tiling
Performing loop fusion and tiling for a sequence of loops of the form shown in Figure 3.4
is equivalent to nding Q partitions (tiles) I0k , I
1
k , ..., I
Q 1
k of the iteration space Ik of each
loopk. Similar to the denition of Rk and Wk discussed in the last subsection, R
q
k is dened
as the set of array element indices of A for tile Iqk , and W
q
k denotes the set of array elements
indices of B for tile Iqk . So,
Rqk = Ck(I
q
k); I
q
k = Pk(W
q
k ) or W
q
k = P
 1
k (I
q
k)
Traditional loop fusion and tiling, such as that used in the code of Figure 3.1-(b), produce
29
tessellating tiles. Because the tiles are a partition of the iteration space:
Iq1k \ Iq2k = ; q1 6= q2
However, with overlapped tiling the sum of the size of the tiles Iqk can be larger than the
size Ik. The amount of redundant computation RCk performed by loopk is:
RCk = jI0k j+ jI1k j+ :::+ jIQ 1k j   jIkj  0
In traditional loop fusion and tiling, the tiles of the dierent loops can be unrelated
thanks to the barriers. However, in overlapped tiling the data read in an iteration tile must
be produced within the same tile to eliminate the need of synchronization. To simplify
the discussion, Assume that the data ow in the sequence of fused loops loop0, loop1, ...
loopK 1 forms a linear chain. This means that A
j
k+1 = B
j
k for all k. Under this assumption,
it is necessary that Rqk+1  W qk to avoid unnecessary work. Let Rqk+1 = W qk . Hence the
corresponding tiles Iqk+1 and I
q
k of neighboring loops are related as follows:
Rqk+1 = Ck+1(I
q
k+1) = P
 1
k (I
q
k) = W
q
k )
Iqk = Pk(W
q
k ) = Pk(R
q
k+1) = Pk(Ck+1(I
q
k+1)) (3.1)
Equation 3.1 states that the tiles of loopk are determined by the tiles of loopk+1. Equation
3.1 provides the procedure to perform overlapped tiling: given an arbitrary partition into
tiles of loop loopK 1 (the last loop in the sequence), the tiles of all previous loops loopk
(0  k < K) can be determined iteratively. Furthermore, all the corresponding tiles from
the dierent loops are fused to build the new loop body.
30
Consider the code in Figure 3.2-(a). After unrolling, there will be a sequence of K loops.
Since every loopk in Figure 3.3-(a) is the same,
I0 = I1 = ::: = IK 1 = I = f[i] : 0  i < Ng
C0 = C1 = ::: = CK 1 = C = f[i! x] : i  1  x  i+ 1g
P0 = P1 = ::: = PK 1 = P = f[x! i] : i = xg
Initially, assign the following tiling partition for the last loop loopK 1:
IqK 1 = f[i] : q N=Q  i < (q + 1)N=Qg
which evenly partitions the iteration space, IK 1, so that jIqK 1j = N=Q.
Next, the previous loops can be tiled using Equation 3.1 (to simplify the discussion, the
boundaries are ignored):
jIqK 2j = jPK 2(CK 1(IqK 1))j = jP (C(IqK 1))j
= jf[i] : q N=Q  1  i < (q + 1)N=Q+ 1gj
= N=Q+ 2 = jIqK 1j+ 2
jIqK 3j = jP (C(IqK 2))j = N=Q+ 4 = jIqK 2j+ 2
:::
jIq0 j = jP (C(Iq1))j = N=Q+ 2 (K   1) = jIq1 j+ 2 (3.2)
31
Therefore:
jIqk j = jP (C(Iqk+1))j = N=Q+ 2 (K   1  k)
RCk = (
Q 1X
q=0
jIqk j)  jIkj = 2Q (K   1  k)
The total amount of redundant computation RC is dened as the sum of the redundant
computation of each loopk:
RC =
K 1X
k=0
RCk = QK  (K   1) (3.3)
RC is a monotonic function as the number of tiles Q and the number of loops to fuse
K. According to Equation 3.3, for the example shown in Figure 3.2-(a), the trend is that
the amount of redundant computation RC increases with both Q and K. Although this
observation is derived from the specic example, it is easy to see that the trend is true in
general. Compared with traditional loop fusion and tiling, where synchronization is neces-
sary, overlapped tiling saves the overhead of K   1 barriers. Suppose the average overhead
of each barrier is ts, and the average computation time for each iteration in the original code
is tc, the overhead dierence between overlapped tiling over traditional tiling is:
Overhead = ts  (K   1)  tc RC=Q (3.4)
3.3.2 Hierarchical Overlapped Tiling
According to the analysis above, coarse grain tiles (and thus small number of tiles) reduce
the amount of redundant computation in overlapped tiling. However, too few tiles would
reduce the amount of parallelism. Similarly, reducing the number of fused loops also reduces
the amount of redundant computation, but at the expense of the additional synchronization
32
required between loops.
Hierarchical overlapped tiling should be done according to the memory hierarchy of the
target machine. Consider a multicore system with Np processors, where each processor has
Nc cores sharing the last level cache. It is expected that the average overhead of synchronizing
the processors sharing a cache (t0s) will be signicantly smaller than that of synchronizing
cores on dierent processors (ts) that need to communicate through main memory and/or
bus.
ts=t
0
s  1 (3.5)
Based on the above observation, it is possible to proceed as follows: rst, perform coarse-
grain tiling for processors; then perform overlapped tiling within the tile that is assigned
to each processor, and generate sub-tiles for each core of the processor. During this 2-level
tiling, the parameters for overlapped tiling are determined separately for each level.
For simplicity, assume that the total number of tiles, Q, generated by the plain overlapped
tiling discussed in the previous subsection is equal to the number of cores: Q = Np  Nc.
During the rst level tiling of the hierarchical overlapped tiling, only Np tiles are generated,
so the total amount of redundant computation introduced in this level is:
RC1 = RC(Np; K) = Np K  (K   1)
After the rst level tiling, each processor q is assigned a coarse-grain tile: Iq0 , I
q
1 , ..., I
q
K 1
(0  q < Np), where the tiles assigned to each processor is implemented as a sequence of
inner-loops. Then, overlapped tiling can be applied within each tile. However, since the
sub-tiles are to be mapped to cores of the same processor, it is expected that the overhead
of synchronization of cores within the same processor, t0s will be signicantly smaller than
the synchronization between cores across processors, ts. As a result, for each tile in the
33
second level of tiling the number of fused loops (fusion depth) can be smaller. Although
this increases the amount of synchronization, it also reduces the amount of redundant com-
putation. Suppose the second level of tiling only fuses K 0 consecutive loops each time and
K 0 < K, then the amount of redundant computation for each processor in the second level
tiling would be:
RC2 = RC(Nc; K
0) = Nc K 0  (K 0   1) (3.6)
Therefore, the total amount of redundant computation of 2-level overlapped tiling is:
RC 0 = RC1 +Np RC2
= Np K  (K   1) +Np Nc K 0  (K 0   1)
=
Q
Nc
K  (K   1) +QK 0  (K 0   1)
= QK  (K   1) ( 1
Nc
+
K 0  (K 0   1)
K  (K   1) )
 RC  (1=Nc + (K 0=K)2)
Furthermore, additional K=K 0 local synchronization operations are introduced during
the second level of tiling. This adds an extra latency of t0s  K=K 0. Hence the overhead
dierence between 2-level overlapped tiling and traditional tiling is:
Overhead0 = ts  (K   1)  t0s K=K 0   tc RC 0=Q
 ts  (K   1)  t0s 
K
K 0
  tc
Q
RC  ( 1
Nc
+ (
K 0
K
)2)
= ts  (K   1  t
0
s
ts
 K
K 0
)  tc
Q
RC  ( 1
Nc
+ (
K 0
K
)2)
As mentioned before, t0s is much smaller than ts (Equation 3.5). Thus, if K
0 is appro-
priately chosen, t
0
s
ts
 K
K0 can still be much smaller than 1. Then Overhead
0 can be made
34
larger than Overhead as dened in 3.4, which shows the potential benet of hierarchical
overlapped tiling.
3.4 Unied Tiling Representation
This section uses the unied tiling representation dened in Section 2.1 to describe the
overlapped tiling and hierarchical overlapped tiling discussed in this Chapter.
The model and analysis of overlapped tiling above in this chapter are for a sequence
of consecutive parallel loops, which is the paradigm of streaming applications. If every
parallel loop in Figure 3.4 has the same form, that is, I0=I1=:::=Ik=:::=IK 1 and 8k; j,
Lk = L = Mk = M , ~f
j
k =
~f j, ~gjk = ~g
j, the consecutive parallel loops in gure 3.4 become
equivalent to the following loop nest:
1 for ( int k = 0 ; k < K ; + + k ) f
2 paral le l for ( int ~i = [i0; i1; :::; in 1] 2 I0 ) f
3 . . . = A0[~f0(~i)] ;
4 . . . = A1[~f1(~i)] ;
5 . . .
6 . . . = AL 1[~fL 1(~i)] ;
7 B0[~g0(~i)] = . . . ;
8 B1[~g1(~i)] = . . . ;
9 . . .
10 BM 1[~gM 1(~i)] = . . . ;
11 g
12 swap (A , B ) ;
13 g
Figure 3.5: Equivalent loop nest to the K consecutive parallel loops
The loop nest in Figure 3.5 forms an (n + 1)-dimensional iteration space: [0 : K   1] 
I0, denoted as I. The iteration space does not necessarily have the shape of an (n + 1)-
dimensional parallelepiped, e.g., if the shape of I0 is irregular, so the corresponding basis
matrix E may not exist. Because of Line 12, each iteration of the outmost k loop depends
on the previous iterations. The dependence vectors should have the form of (1; :::). So the
35
tiling problem has the iteration space I and the following dependence matrix D (each row
of D is a dependence vector):
D =
0BBBBBBB@
~d0
~d1
:::
~dm 1
1CCCCCCCA
=
0BBBBBBB@
1; :::
1; :::
:::
1; :::
1CCCCCCCA
:
Each row in D determines a instance of producing or consuming relation.
Each tile is consisted by the Iq0 , I
q
1 , ..., I
q
K 2, I
q
K 1, I
q
k = P (C(I
q
k+1)), where K is the
parameter that aects performance. So tile shape T is:
T =
K 1[
k=0
f(k :~i) j ~i 2 Iqkg; Iqk = P (C(Iqk+1))
Because Iqk = P (C(I
q
k+1)), to determine T it is only necessary to choose the tile in the
last step. So IqK 1, which is the tile in the last step, determines the size and shape of the
tiles in other steps. Usually the shape of IqK 1 is chosen to be an n-dimensional rectangle
parallelepiped. For example, let IqK 1 = span(W ), in which
W =
0BBBBBBB@
w0; 0; 0; :::; 0
0; w1; 0; :::; 0
:::
0; 0; 0; :::; wn 1
1CCCCCCCA
: (3.7)
Then IqK 1 becomes an n-dimensional rectangle parallelepiped with w0, w1, ...,wn 1 being
the width in each dimension. If n = 1, the shape of T becomes a trapezoid with top width
w0 and height K. For n = 2, the shape usually becomes a bounded pyramid. Since T does
not have a hyper-parallelepiped shape, it is not possible to use a tiling matrix T to describe
36
the shape of tiles.
Overlapped tiling allows overlap between tile. According Equation 2.7, the union of all
the tiles must contain the original iteration space. For a trapezoid-shaped or pyramid shaped
tile, the top part IqK 1 is the narrowest. So the choice of repetition pattern must guarantee
that there is no gap between tiles even on the narrowest part of tiles. If IqK 1 = span(W ) as
dened in Equation 3.7, the repetition matrix is shown as follows:
R =
0BBBBBBBBBB@
K; 0; 0; 0; :::; 0
0; w0; 0; 0; :::; 0
0; 0; w1; 0; :::; 0
:::
0; 0; 0; 0; :::; wn 1
1CCCCCCCCCCA
=
0B@ K; 0
0;W
1CA :
In the tiled iteration space I 0, because of the redundant computation introduced, there
is no dependence between tiles in the same step (i.e., there is no "horizontal" dependence).
Therefore, every dependence vector between tiles must have the following form:
~d0 = (d0; d1; :::; dn 1); d0 = 1; dk = 0;1; k = 1; 2; :::; n  1
The dependence matrix in I 0 must have the following form:
D0 =
0BBBBBBB@
~d00
~d01
:::
~d0m 1
1CCCCCCCA
=
0BBBBBBB@
1; 
1; 
:::
1; 
1CCCCCCCA
:
Hence a scheduling mapping, S, that maximizes parallelism can expose n degrees of paral-
lelism. That is, all the tiles within the same step can be executed in parallel. Then the
37
scheduling matrix S is
S =
0BBBBBBB@
1
0
:::
0
1CCCCCCCA
;
S(~i0) = ~i0  S = (i00; i01; :::; i0n)  S = i00:
The eect of overlapped tiling makes the above schedule S valid. After overlapped tiling,
the rst component in each row of the dependence matrix D0 is still 1, which is the same
form as the original D. Therefore, the same degree of parallelism can be exposed for coarse
grain tiles.
To sum up, Table 3.1 shows the unied representation of overlapped tiling. Since hierar-
Input I;D =
0BB@
1; 
1; 
:::
1; 
1CCA
D0 D0 =
0BB@
1; 
1; 
:::
1; 
1CCA :
T W =
0BB@
w0; 0; 0; :::; 0
0; w1; 0; :::; 0
:::
0; 0; 0; :::; wn 1
1CCA, IqK 1 = span(W )
Iqk = P (C(I
q
k+1)), T =
SK 1
k=0 [k : I
q
k ].
R R =

K; 0
0;W

:
S S =
0BB@
1
0
:::
0
1CCA :
Table 3.1: Representation of Overlapped Tiling
38
chical overlapped tiling is the recursion of overlapped tiling in the top-down approach, the
representation of tiling in each level is the same as Table 3.1, except that the input iteration
space I is the tile T of the higher level tiling.
3.5 Evaluation
3.5.1 Implementation
An experimental system is implemented to evaluate the eciency of overlapped tiling and
hierarchical overlapped tiling. A diagram representing the experimental system is shown
in Figure 3.6. The dashed arrows represent oine data ow while solid arrows represent
runtime data ow. The system contains three major components: a delayed compilation
mechanism, an oine analyzer and the optimizer. The oine analyzer is implemented as a
pass of Cetus [21]. The optimizer is a source-to-source translator which reads the original
kernel code and generates OpenCL code, which is fed to the OpenCL runtime. Both, the
oine analyzer and the optimizer use the Omega library [22] to perform the integer tuple
space computations. In addition, the analyzer uses the Omega Library to generate loops
from the integer tuple sets.
Kernel
Code
Host
Code
Kernel
Summary
Computation
Device
Offline
Analyzer
Delayed
Compilation
Mechanism
Optimizer
Transformed
Kernel
Figure 3.6: Framework of the experimental system.
39
3.5.2 Environment Setup
The eciency of overlapped tiling and hierarchical overlapped tiling is evaluated on an SMP
workstation with 4 Intel Xeon L7555 processors running at 1.87GHz. Each processor has
8 cores, sharing a 24MB unied L3 cache on chip. Each core contains a 256KB private L2
cache and 32KB L1 D-cache. SMT is disabled for each core, so there is one hardware thread
per core.
3.5.3 Benchmarks
Eight benchmarks are evaluated: 1D/2D/3D-Jacobi, PathFinder, Poisson, Biharmonic, HotSpot
and Cell. The rst 3 benchmarks (1D/2D/3D-Jacobi) are Jacobi iterations for synthetic lin-
ear systems; PathFinder uses dynamic programming to nd a minimum weighted path;
Poisson is a numerical solver of the poisson equation, calculating the Laplace operator [4]
over a 2D grid with the 5-point stencil. Biharmonic is the numerical PDE solver calculating
the Biharmonic operator [4] over a 2D grid with a 13-point stencil. HotSpot implements a
chip temperature estimation model[19]. Cell [3] is a 3D game of life. For each application,
the operation in the body of the stencil loop is implemented as an OpenCL kernel. The
inputs of the benchmarks are listed in Table 4.2.
Data Problem Points
Dimension Size of Stencil
1D-Jacobi 1 64K 3
2D-Jacobi 2 256x256 9
3D-Jacobi 3 64x64x64 27
PathFinder 1 100K 3
Poisson 2 256x256 5
Biharmonic 2 256x256 13
HotSpot 2 512x512 9
Cell 3 60x60x60 27
Table 3.2: ISL Benchmarks
40
For some stencil programs, such as Jacobi, the number of steps of the outer loop de-
pends on a convergence test. This requires a synchronization between kernel code and host
code that prevents loop fusion. To enable the overlapped and hierarchical overlapped tiling
optimizations the code is modied so that the convergence test (and as result the synchro-
nization) only occurs every 1024 iterations. The total iterations number for the benchmarks
without convergence test is 16,384.
3.5.4 Experimental Results
Performance Overview
Figure 4.15 shows the performance speedup of traditional tiling, overlapped tiling and hier-
archical overlapped tiling relative to the original OpenCL code. Since OpenCL only supports
2 levels of work item organization, a 2-level hierarchical overlapped tiling is used for each
benchmark. For overlapped tiling and hierarchical overlapped tiling, the gure shows the
speedup for the best value of the loop fusion depth K, as shown in Table 3.3 (and discussed
in the next Section). In general, the performance of hierarchical overlapped tiling is always
better than that of plain overlapped tiling, and overlapped tiling is always better than that
of traditional tiling, with the exception of Cell and 3D-Jacobi. For Cell and 3D-Jacobi,
the performance curves of the three tiling transformations are very similar. The reason is
that their main data structure is 3-dimensional and the amount of redundant computation
grows quartically with the fusion depth K. Therefore, there are not many opportunities for
overlapped tiling and hierarchical overlapped tiling. In experiments, overlapped tiling and
the 2-level hierarchical overlapped tiling achieves an average speedup of 18% and 37% over
traditional tiling, respectively.
41
0
1
2
3
4
5
6
7
8
9
10
Traditional Tiling
Overlapped Tiling
2-level Hierarchical Overlapped Tiling
S
p
ee
d
u
p
 o
v
er
 o
ri
g
in
al
 O
p
en
C
L
 c
o
d
e
Figure 3.7: Speedup of traditional tiling, overlapped tiling and hierarchical overlapped tiling
over the original OpenCL code.
Parameter Sensitivity
Loop Fusion Depth for the First Level of Tiling. Figure 3.8 shows the performance of
overlapped and hierarchical overlapped tiling as the value of the loop fusion depthK changes.
For overlapped tiling, there is only one value of K; for 2-level hierarchical overlapped tiling,
K is the loop fusion depth for the rst level of tiling, whileK 0 is the value of loop fusion depth
for the second level of tiling (K 0 is kept constant at 2 for the experiments in Figure 3.8). As
discussed in Subsection 3.3.1 and 3.3.2, K, determines the amount of redundant computation
introduced by overlapped and hierarchical overlapped tiling, and thus the overall speedup
over traditional tiling.
In Figure 3.8, lines a, b and c show the speedup of traditional tiling, overlapped tiling, and
2-level hierarchical overlapped tiling over the original OpenCL code, respectively. In addi-
tion, by manually moding the overlapped tiling code to remove the redundant computation
(hence, there are race conditions and the results of the executed code are not guaranteed
to be correct), the performance is shown by Line d. OpenCL standard does not support
a global barrier across work item groups, which is required for traditional tiling (See Fig-
ure 3.1-(b)); thus, the barriers used for traditional tiling (Line a in Figure 3.8) are barriers
42
024681
1
Line a: Traditional Tiling
Line b: Overlapped Tiling
Line c: 2-level Hierarchical Overlapped Tiling
Line d: Tile & Fuse without Overlap
0
2
4
6
8
10
12
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
Fusion Depth (K)
S
p
ee
d
u
p
 o
v
er
 o
ri
g
in
al
 O
p
en
C
L
 c
o
d
e
0
1
2
3
4
5
6
7
8
9
10
1 2 4 8 16 32 64 128256
Fusion Depth (K)
S
p
ee
d
u
p
 o
v
er
 o
ri
g
in
al
 O
p
en
C
L
 c
o
d
e
(a) 1D-Jacobi (b) 2D-Jacobi
0
1
2
3
4
1 2 4 8 16
Fusion Depth (K)
S
p
ee
d
u
p
 o
v
er
 o
ri
g
in
al
 O
p
en
C
L
 c
o
d
e
2
3
4
5
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
Fusion Depth (K)
S
p
ee
d
u
p
 o
v
er
 o
ri
g
in
al
 O
p
en
C
L
 c
o
d
e
(c) 3D-Jacobi (d) PathFinder
0
2
4
6
8
10
12
14
1 2 4 8 16 32 64 128256
Fusion Depth (K)
S
p
ee
d
u
p
 o
v
er
 o
ri
g
in
al
 O
p
en
C
L
 c
o
d
e
0
1
2
3
4
5
6
7
1 2 4 8 16 32 64 128256
Fusion Depth (K)
S
p
ee
d
u
p
 o
v
er
 o
ri
g
in
al
 O
p
en
C
L
 c
o
d
e
(e) Poisson (f) Biharmonic
0
1
2
3
4
5
6
1 2 4 8 16 32 64 128256
Fusion Depth (K)
S
p
ee
d
u
p
 o
v
er
 o
ri
g
in
al
 O
p
en
C
L
 c
o
d
e
0
1
2
3
4
1 2 4 8 16
Fusion Depth (K)
S
p
ee
d
u
p
 o
v
er
 o
ri
g
in
al
 O
p
en
C
L
 c
o
d
e
(g) HotSpot (h) Cell
Figure 3.8: Evaluating the performance overlapped tiling and hierarchical overlapped tiling
by scaling fusion depth K.
43
that implemented with low level primitives. However, overlapped tiling and 2-level hierar-
chical overlapped tiling only require synchronization within the work item groups, which is
supported by the OpenCL standard. The discrepancy between overlapped tiling and Line
d shows the overhead of the redundant computation introduced by overlapped tiling; the
dierence between traditional tiling and Line d shows the synchronization overhead saved
by fusing the kernels. In Figure 3.8 it can be seen that Line d typically grows as the loop
fusion depth K increases. This is because the number of synchronization operations removed
grows with the depth. If do not count the cost of the redundant computation introduced,
the performance benet is always positive (if RC = 0, Overhead in Equation 3.4 always
increases when K increases). In fact, Line d denes the upper bound of performance for
overlapped and hierarchical overlapped tiling.
For the two 1D benchmarks (1D-Jacobi and PathFinder), both plain overlapped tiling
and hierarchical overlapped tiling can achieve signicant speedup over traditional tiling,
because the growth rate of redundant computation for overlapped tiling is low. For the 2D
benchmarks (2D-Jacobi, Poisson, Biharmonic and HotSpot) the growth rate of the redundant
computation is higher than for the 1D benchmarks. Thus, the interval of benet (the values
of fusion depth K for which lines b or c are above line a) of the 2D benchmarks is smaller
than that of the 1D benchmarks for both, overlapped and hierarchical overlapped tiling.
The gure also shows that the interval of benet of hierarchical overlapped tiling is always
bigger than that of plain overlapped tiling. Within the 2D benchmarks, Biharmonic shows
the least speedup with overlapped tiling over traditional tiling. This is because the stencil
of Biharmonic depends on 13 neighboring points versus 9 for 2D-Jacobi, 5 for Poisson,
and 5 for Hotspot (see Table 4.2). As a result, the redundant computation of Biharmonic
grows faster than that of the other 2D benchmarks. However, hierarchical overlapped tiling
still achieves speedup over traditional tiling. Since the input data of Cell and 3D-Jacobi
are 3-dimensional, the amount of redundant computation increases so fast that there is no
44
opportunity for overlapped tiling.
When comparing hierarchical overlapped tiling versus overlapped tiling the plots in Fig-
ure 3.8 show that hierarchical overlapped tiling performs better as the value of loop fusion
depth K increases (the only exception occurs with the 3D benchmarks where all 3 tiling
mechanisms behave the same), because the growth rate of redundant computation is lower
for hierarchical overlapped tiling than for plain overlapped tiling. Hierarchical overlapped
tiling is less sensitive to the value of K than overlapped tiling because, as mentioned above,
the benecial region of hierarchical tiling is larger than that of overlapped tiling.
The optimal value of fusion depth K value are determined by the value of ts=tc of the
target platform. The values of K used in experiments are shown in Table 3.3.
Hierarchical
Overlapped Tiling Overlapped Tiling
1D-Jacobi K = 32 K = 64
2D-Jacobi K = 8 K = 16
3D-Jacobi K = 2 K = 2
PathFinder K = 32 K = 64
Poisson K = 8 K = 16
Biharmonic K = 4 K = 8
HotSpot K = 4 K = 16
Cell K = 1 K = 2
Table 3.3: Loop Fusion depth K used in experiments.
Loop Fusion Depth for the Second Level of Tiling. Compared to the loop fusion
depth K for the rst level tiling of hierarchical overlapped tiling, the tuning of loop fusion
depth K 0 for the second level tiling is more involved. This is partially because the perfor-
mance of the second level tiling depends on the rst level of tiling. Figure 3.9 shows the
performance of hierarchical overlapped tiling for 1D-Jacobi and PathFinder with dierent
pairs of K and K 0. We can see that K 0 = 2 is the best choice for PathFinder; but there is
no obvious optimal value for 1D-Jacobi. Thus, manual tuning may be required to nd the
optimal fusion depth K 0 for the second level of hierarchical overlapped tiling. However, the
45
performance impact of the second level tiling is signicantly smaller than that of the rst
level tiling. For instance, when K 0 is set to 2, the average performance loss is less than 5%
compared to the best K 0.
02
46
810
12
1
6
4
K'=1 K'=2 K'=4 K'=8 K'=16
3
4
5
6
7
8
9
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
1st level Fusion Depth (K)
S
p
ee
d
u
p
 o
v
er
 o
ri
g
in
al
 O
p
en
C
L
 c
o
d
e
2
3
4
5
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2
1
0
2
4
1st level Fusion Depth (K)
S
p
ee
d
u
p
 o
v
er
 o
ri
g
in
al
 O
p
en
C
L
 c
o
d
e
(a) 1D-Jacobi (b) PathFinder
Figure 3.9: Fusion depth K 0 for the second level of hierarchical overlapped tiling.
Input Size. Figure 3.10 shows the speedup of hierarchical overlapped tiling over plain
overlapped tiling for the dierent benchmarks, as the input size increases. The speedups
are computed using the best value of K. The gure shows that hierarchical overlapped
tiling performs better than overlapped tiling for the 2D-benchmarks. It also shows, that the
performance dierence between hierarchical overlapped tiling and plain overlapped tiling
becomes smaller as the input data size increases. This is because since the total number of
worker threads remains constant, the tile size is determined by the input data size. With
smaller tiles, the redundant computation has a higher impact; thus, overlapped tiling is
less ecient, while hierarchical overlapped has more opportunities to reduce the overheads
introduced by the redundant computation. With larger tiles, the redundant computation
has less impact, and as a result, there is less dierence between overlapped and hierarchical
overlapped tiling. For the 3D benchmarks 3D-Jacobi and Cell, since the amount of redundant
computation grows so fast, the optimal K for both plain overlapped tiling and hierarchical
overlapped tiling is less than 2, so there is no obvious performance discrepancies between
46
the two schemes.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
32
K
64
K
12
8K
12
8x
12
8
25
6x
25
6
51
2x
51
2
12
8x
12
8
25
6x
25
6
51
2x
51
2
50
K
10
0K
20
0K
12
8x
12
8
25
6x
25
6
51
2x
51
2
12
8x
12
8
25
6x
25
6
51
2x
51
2
12
8x
12
8
25
6x
25
6
51
2x
51
2
12
8x
12
8
25
6x
25
6
51
2x
51
2
1D-Jacobi 2D-Jacobi 3D-Jacobi PathFinder Poisson Biharmonic HotSpot CellSp
ee
du
p
o
v
er
 
pl
ai
n
 
o
v
er
la
pp
ed
 
til
in
g
Figure 3.10: Speedup of hierarchical overlapped tiling over plain overlapped tiling with
dierent input sizes. The horizontal axis is the input size for each benchmark.
3.5.5 Compilation Overhead
Since the experimental system transforms OpenCL kernel code at runtime, the overheads
introduced need to be considered. The overhead consists of two parts: The OpenCL runtime
that transforms the kernel code and the execution of the Omega Library. Caching the existing
compilation result can help reduce both parts of the runtime overhead: if the sequence of
kernels selected for optimization are the same as a previous sequence, no compilation needs
to be done. For the benchmarks used in this paper, the OpenCL runtime compilation needs
to be invoked at most twice, one for the steady state and another for the epilog. Figure
3.11 shows the compilation times of overlapped tiling and hierarchical overlapped tiling,
normalized to the compilation time of the original program. On average, overlapped tiling
requires 37% more compilation time, while hierarchical overlapped tiling costs about 60%
more.
47
00.5
1
1.5
2
2.5
3
Overlapped Tiling
2-level Hierarchical Overlapped Tiling
C
o
m
p
il
e 
ti
m
e 
o
v
er
 o
ri
g
in
al
 p
ro
g
ra
m
Figure 3.11: Compile time for overlapped tiling and hierarchical overlapped tiling, normalized
to the compile time of the original OpenCL code.
3.6 Conclusion
In this chapter, a new transformation, hierarchical overlapped tiling, is proposed. By creating
hierarchical overlapping tiles, the new transformation can reduce communication overhead
among tiles while introducing smaller amount of redundant computation compared to plain
overlapped tiling. An experimental system is implemented to apply the proposed transfor-
mations to OpenCL programs. Experimental results show that on average overlapped tiling
and hierarchical overlapped tiling achieves 18% and 37% speedup over traditional tiling,
respectively.
48
Chapter 4
Hiding Communication Latency with
Conjugate-Trapezoid Tiling
4.1 Hiding Communication Latency
The Jacobi code in Figure 4.1-(a) is an example of a 1D ISL where for simplicity the bound-
ary checking code has been ommitted. Figure 4.1-(b) shows the correponding standard tiled
code for a distributed memory system with P nodes, where boundary elements must be ex-
plicitly sent to and received from neighboring tiles (Line 6-9) before the computation at each
time step starts. Since the communication latency in a distributed memory systems is not
negligible, if assuming that the computation time of Line 12 is tcomp, and the communication
cost is tcomm, the total execution time of the code in Figure 4.1 is T  (tcompN=P + tcomm),
where the communication overhead is T  tcomm. Thus, depending on the value of tcomm, the
performance penalty incurred due to the communication latency can be signicant.
49
1 for ( int t = 0 ; t < T ; t++) f
2 for ( int i = 0 ; i < N ; i++)
3 Y[ i ]=(X[ i  1]+X[ i]+X[ i+ 1 ] ) /3 ;
4 swap (X, Y) ;
5 g
(a) Sequential 1-dimensional Jacobi code
1 rank = . . . // the rank o f the node
2 int w=N /P ;
3 int lowerbound = rankw ;
4 int upperbound = ( rank+1)w ;
5 for ( int t = 0 ; t < T ; t++) f
6 MPI Isend (X[ lowerbound ] , . . . , rank 1, t , . . . ) ;
7 MPI Isend (X[ upperbound 1] , . . . , rank+1, t , . . . ) ;
8 MPI Recv (X[ lowerbound  1] , . . . , rank 1, t , . . . ) ;
9 MPI Recv (X[ upperbound ] , . . . , rank+1, t , . . . ) ;
10
11 for ( int i = lowerbound ; i < upperbound ; i++)
12 Y[ i ]=(X[ i  1]+X[ i]+X[ i+ 1 ] ) /3 ;
13 swap (X, Y) ;
14 g
(b) Standard tiled code for P nodes with no overlapping between communication and computation.
Figure 4.1: 1-dimensional Jacobi code with MPI.
4.2 Conjugate-Trapezoid Tiling
This section discusses the design of Conjugate-Trapezoid Tiling. First, Conjugate-Trapezoid
Tiling is introduced with a 1-dimensional stencil example, and then the idea is generalized
to stencils with higher dimensional stencils.
4.2.1 Denitions
The discussion in this chapter focuses on iterative stencil loops (ISL), which are iterative
computations in which the elements of an array are updated using the values of neighboring
elements, following a xed pattern or stencil. An example of an ISL is shown in Figure 4.2,
where in each time step element i0; i1; :::; id 1 of Y is computed in terms of a set of neighboring
elements in X. The outermost t loop determines the number of timesteps.
50
1 for ( int t = 0 ; t < T ; t++) f
2 for ( int i0 = 0 ; i0 < N0 ; i0++)
3 for ( int i1 = 0 ; i1 < N1 ; i1++)
4 . . .
5 for ( int id 1=0; id 1<Nd 1 ; id 1++) f
6 z0 = X[ ~f0(i0; i1; :::; id 1) ] ;
7 z1 = X[ ~f1(i0; i1; :::; id 1) ] ;
8 . . .
9 zm 1 = X[ ~fs 1(i0; i1; :::; id 1) ] ;
10 Y[ i0; i1; :::; id 1 ] = g(z0; z1; :::; zm 1) ;
11 g
12 swap (X, Y) ;
13 g
Figure 4.2: An ISL with a d-dimensional stencil. The total depth of the loop nest is d+ 1.
The ISL forms an (n + 1)-dimensional iteration space I. The shape of I is a (n + 1)-
dimensional hyper-cuboid:
I = span(E) = span(
0BBBBBBBBBB@
T; 0; 0; :::; 0
0; N0; 0; :::; 0
0; 0; N1; :::; 0
:::
0; 0; 0; :::; Nn 1
1CCCCCCCCCCA
):
Each ~fk is a Zd ! Zd mapping. ~f0, ~f1, ..., ~fm 1 denes m dependence vectors: ~dk =
(1; dk;0; dk;1; :::; dk;n 1); k = 0; 1; :::;m   1, where the rst component of each dependence
vector is 1. 1 These dependences are collected in a matrix:
D =
0BBBBBBB@
~d0
~d1
:::;
~dm 1
1CCCCCCCA
:
1Because the values computed in iteration t0 of the outermost loop are consumed in iteration t0 + 1.
51
Here n is the dimensionality of the array involved in the computation and m is the number
of point in the stencil.
The dependence vectors generate a cone in the (n + 1)-dimensional space, as shown in
Figure 4.3-(a) for a ve-point stencil in a 3D space < ~t;~i0;~i1 >. The projections of the cone
on < ~t;~i0 > and < ~t;~i1 > planes are triangles, as shown in Figure 4.3-(b). The boundaries of
the projected triangles are dened by the dependence vectors in the boundaries of the cone.
Vectors ~Bleft and ~Bright represent the boundarie edges of the projection of the dependence
cone on the plane< ~t;~ij > (j = 0; 1; :::; d  1):
~Bleft(~t;~ij) = (1;minfxk;jjk = 0; 1; :::;m  1g)
~Bright(~t;~ij) = (1;maxfxk;jjk = 0; 1; :::;m  1g)
i0
t
i1
(1,0,0)
(1,1,1)(1,-1,1)
(1,1,-1)
(1,-1,-1)
i0
t
(1,-1,±1) (1,1, ±1)(1,0,0)
i1
t
(1,±1,-1) (1,±1,1)(1,0,0)
(, 		
)= (1,-1) (, 		
)= (1,1)
(, 		)= (1,-1) (
, 		)= (1,1)
(a) Cone generated by the dependence vectors (b) Projections of the cone
Figure 4.3: Dependence vectors of a 5-point stencil
To facilitate later discussion, dene Proj<~t;~ij>(~u) as the projection of vector ~u = (u
0, u0,
u1, ..., uj, ..., un 1) on plane < ~t;~ij >:
Proj<~t;~ij>(~u) = (u
0; uj)
52
And dene Proj<~ij0 ;~ij1 ;:::;~ijk 1>
(~u) as the projection on the k-dimensional hyperplane <
~ij0 ;~ij1 ; :::;~ijk 1 >:
Proj<~ij0 ;~ij1 ;:::;~ijk 1>
(~u) = (uj0 ; uj1 ; :::; ujk 1)
4.2.2 1-dimensional Stencil
Consider the 1-dimensional (2-dimensional iteration space) Jacobi example shown in Figure
4.1 , whose dependence matrix is:
D =
0BBBB@
~d0
~d1
~d2
1CCCCA =
0BBBB@
1; 1
1; 0
1; 1
1CCCCA :
i
t
Node 0 Node 1 Node 2 Node 3
Node 2 Node 3 Node 0 Node 1
hw
T
N
Figure 4.4: Tiling 1-dimensional Jacobi code. Thin arrows represent the dependence vectors.
Thick arrows show the dependence between tiles. w and h are the parameters decided by
users to determine the size of each tile.
The Conjugate-Trapezoid Tiling is composed of two levels of tiling. At the outer level,
the iteration space is partitioned (tiled) into parallelogram tiles of width w and height h. For
each tile, two edges are perpendicular to the t axis, and the other two edges are parallel to
the dependence vector on the right boundary, ~Bright(~t;~i)=(1; 1). The tiling scheme is shown
53
in Figure 4.4, in which w and h are the parameters decided by users to determine the size
of each tile.
For a more precise description, we use the normal vector of each tiling hyper-plane of the
tiling scheme, and make the reciprocal of the length of the normal vector be the distance
between parallel tiling hyper-planes. Let ~h0 be the normal vector of the edge that is parallel
to ~Bright(~t;~i), ~h0? ~Bright(~t;~i), then ~h0 is:
t i
~h0 =
1
w
 ~Bright(~t;~i) 
0B@ 0 1
 1 0
1CA
=
1
w
 ( 1; 1)
Note that the matrix
0B@ 0 1
 1 0
1CA stands for a rotation transformation of 90 degrees in
clockwise. And 1j~h0j is the distance between the tiling hyper-planes parallel to
~Bright(~t;~i).
Similarly, let ~h1 be the normal vector of the edge that is perpendicular to ~t. Intuitively
~h1 = (1=h; 0) and
1
j~h1j = h. The paralleogram tiles in Figure 4.4 can be represented by
the following matrix H, where each row of H is the normal vector to the edges of the
parallelogram.
t i
H =
0B@ ~h0
~h1
1CA =
0B@   1w 1w
1
h
0
1CA
This tile shape has been chosen so that each tile only depends on its right neighbor tile.
Thus, if the tiles in a row are distributed onto dierent nodes, a node only needs to receive
data from its right neighbor node, as shown in Figure 4.4. This strategy also achieves data
54
locality and load balance: data is reused if tiles in each (oblique) column are assigned to the
same node; work load is balanced because all the tiles have the same area (tiles in the right
and left boundaries can be mapped to the same node).
In order to execute the tiles within the same horizontal level in parallel, ne-grain inter-
tile communication for every tile step must be done, as the example code in Figure 4.1-(b).
However, with this strategy each time step a tile needs to send boundary data to its left
neigbor, and receive data from its right neighbor, as shown in Figure 4.5-(a). To eliminate
the communication overhead from the total execution time, a second level of tiling is applied.
Each parallelogram is partitioned into two trapezoid subtiles (subtile A and B in Figure 4.5-
(a)). where the edge that divides subtile A and B is parallel to the left-boundary dependence
vector ~Bleft(~t;~i)=(1; 1), so that there is only dependence from subtile A to subtile B. We
call A and B conjugate trapezoid subtiles. Notice that these tiles can have exactly the
same shape (if the stencil is symmetric), by appropriately choosing where to partition the
outer tile. In the example in the Figure, this partition point is determined by w0, which is
computed as h+ w=2.
As a result of this subtiles, send operations are in subtile A, and all receive operations
are in subtile B. Subtile A can start execution right away, as it is independent of the B
subtile, while tile B is delayed by h0 time steps. Thus, each receive operation in subtile B is
delayed by h0 from its correponding send receive in subtile A. Thus, if the communication
latency is smaller than the computation time of the h0 time steps, the data sent can arrive
to its destination before the receive operation. Thus, the communication latency can be
totally hidden when using asynchronous send and receive operations. The dashed box in
Figure 4.5-(b) stands for the scope for each tile after subtile B has been delayed; and the
shaded stripe covers the amount of local computation between each pair of send and receive.
The requirement of this scheme is that the bandwidth of the system must be large enough
to allow h0 messages on the y during the computation time of h0 time steps. The parameter
55
it
send
send
send
send
receive
receive
receive
receive
h
w
w'
A B
(a) Subtile the parallelogram tile into 2 conjugate trapezoid
subtiles. There is a dependence from subtile A to subtile B.
send
A
receiveB
i
t
h'
h'
receive
send
A
B
(b) Delay subtile B by h0 time steps to hide communication latency
Figure 4.5: Subtiling and delay
h0 provides extra exibility to our tiling scheme, as h0 can be adjusted to t target platforms
with dierent congurations of bandwidth and communication latency.
4.2.3 2-dimensional Stencil
1 for ( int t = 0 ; t < T ; t++) f
2 for ( int i0 = 0 ; i0 < N0 ; i0++)
3 for ( int i1 = 0 ; i1 < N1 ; i1++)
4 Y[ i0 ] [ i1 ]=(X[ i0  1][i1+1]+X[ i0 ] [ i1+1]+X[ i0+1] [ i1+1]
5 +X[ i0  1][i1 ] +X[ i0 ] [ i1 ] +X[ i0+1] [ i1 ]
6 +X[ i0  1][i1 1]+X[ i0 ] [ i1 1]+X[ i0+1] [ i1 1]) /9 ;
7 swap (X, Y) ;
8 g
Figure 4.6: Sequential 2-dimensional Jacobi code.
Consider next the Jacobi code in Figure 4.6 as an example of an ISL with a 2-dimensional
stencil (with a 3-dimensional iteration space). The 9-point stencil is dened by the following
56
dependence matrix:
D =
0BBBBBBB@
~d0
~d1
:::
~d8
1CCCCCCCA
=
0BBBBBBBBBBBBBBBBBBBBBBBB@
1; 0; 0
1; 1; 0
1; 1; 0
1; 0; 1
1; 0; 1
1; 1; 1
1; 1; 1
1; 1; 1
1; 1; 1
1CCCCCCCCCCCCCCCCCCCCCCCCA
:
The projections of the dependence vectors on the < ~t;~i0 > and < ~t;~i1 > planes are:
~Bleft(~t;~i0) = (1; 1) ~Bright(~t;~i0) = (1; 1)
~Bleft(~t;~i1) = (1; 1) ~Bright(~t;~i1) = (1; 1)
As in the example in Figure 4.4, the outer tile can be computed using the right projection
of the dependence vector on the planes < ~t;~i0 > and < ~t;~i1 >. The projections of ~h0 and ~h1
57
are computed as follows:
t i0
Proj<~t;~i0>(
~h0) =
1
w0
 ~Bright(~t;~i0) 
0B@ 0 1
 1 0
1CA
=
1
w0
 ( 1; 1)
t i1
Proj<~t;~i1>(
~h1) =
1
w1
 ~Bright(~t;~i1) 
0B@ 0 1
 1 0
1CA
=
1
w1
 ( 1; 1)
The rest components in ~h0 and ~h1 are zeros.
1
j~h0j and
1
j~h1j stand for the distance between
corresponding tiling hyper-planes. Similarly ~h2 = (1=h; 0; 0). Thus, the resulting tile can be
represented by the following matrix:
t i0 i1
H =
0BBBB@
~h0
~h1
~h2
1CCCCA =
0BBBB@
  1
w0
1
w0
0
  1
w1
0 1
w1
0 0 1
h
1CCCCA
The resulting tiles are parallelepiped, as shown in Figure 4.7-(a). The top and bottom
faces of the parallelepiped tiles are rectangles, while the other four faces are parallelograms.
Figure 4.7-(b) shows the top-down view of tiles on the <~i0;~i1 > plane. Since the slopes of
side faces are determined by the right projections of the dependence vectors, each tile only
depends on the neighbors above (this is actually the parallelepiped behind if we consider the
58
3D space) and on the right, as shown in Figure 4.7-(b).
t
i
0
i
1
h
w
0
w
1
i
1
i
0
(a) Tile shape (b) Top-down view of tiles
Figure 4.7: Tiling with 2-dimensional stencils. The thick arrows in (b) represent the depen-
dence between tiles.
Next step is to do subtile the parallelepiped tiles, by applying the same subtiling strategy
in Figure 4.5-(a), so that the slope of the boundary planes between tiles is determined by
the left projection of the dependence vectors on the < ~t;~i0 > and < ~t;~i1 > planes. As a
result of this tiling strategy, four subtiles are generated: A,B0, B1 and C, as shown in Figure
4.8-(a). The dependence relations among subtiles are: A! B0, A! B1, B0 ! C, B1 ! C
and A ! C. Subtile A and C are two pyramids of the same shape, but B0 and B1 are not
pyramids. However, if w00 = h+ w0=2 and w
0
1 = h+ w1=2, the projections of the subtiles on
< ~t;~i0 > and < ~t;~i1 > planes are still conjugate trapezoids, as shown in Figure 4.8-(b).
Figure 4.9-(a) shows the slice of a tile for one time step, where all the tiles are rectangles.
Because of the dependences among subtiles, the computations in subtile Amust be done rst;
then B0 and B1 (the order of B0 and B1 is not important because there are no dependences
between them); and nally, C. Based on the dependences among tiles shown in Figure 4.7-
(b), each time step a tile only needs to send the data near the left and bottom boundaries
(the shadowed area in the Figure) to the corresponding neighbor nodes. So only subtiles
A, B0 and B1 send data, while subtile C, B0 and B1 receive data. There are ve separate
send operations: S0 (from A to the lower-left neighbor), S1 (from A to the left neighbor), S2
(from A to the neighbor below), S3 (from B0 to the neighbor below), and S4 (from B1 to the
59
AB
1
C
B
0
t
i
0
i
1
w'
0
w'
1
(a) Subtiling
i
0
t
A
B
0
i
1
t
A
B
1
(b) Projections of subtiles
Figure 4.8: Subtiling with 2-dimensional stencils.
neighbor below). R0, R1, R2, R3 and R4 are the corresponding receive operations. In total,
each time step there are 14 computation and communication operations within a tile. Figure
4.9-(b) shows the order of these operations, as enforced by the dependences. Because each
tile is assigned to a single node, we can schedule the above 14 operations in any sequential
order that satises the dependencies in Figure 4.9-(b).
In order to hide communication latency, it is necessary to delay the computation asso-
ciated with receive operations. The start of the computation of subtile B0 and B1 can be
delayed by h0 time steps from subtile A, so that the latency of transmitting the data from
S1 and S2 to R1 and R2, respectively, is hidden. However, in subtile C, the R3 and R4 need
to receive the data from S3 and S4 in tiles, B0 and B1, respectively. Hence execution of C
should be delayed by h0 time steps from subtile B0 and B1, or 2 h0 time steps from subtile
A. The projections of the subtiles after these delays are shown in Figure 4.10. The dashed
trapezoids are the projections of the subtiles behind the others.
60
i
1
i
0
A
B
1
B
0
C
S
0
S
1
S
2
S
4
S
3
R
2
R
1
R
0
R
3
R
4
(a) Slice of a tile when t = t0.
A
B
0
B
1
CS0(t0)
S1 (t0)
S2(t0)
R1(t0-h')
S4(t0-h')
R2(t0-h') S3(t0-h')
R4(t0-2hh')
R0(t0-2hh')
R3(t0-2hh')
stage 2 (delay h' ) stage 3 (delay 2hh' )stage 1
(b) Dependence among the operations within each time step.
Figure 4.9: Schedule of subtiles for each time step.8j, Sj and Rj are the corresponding send
and receive pair.
t
i
0
A
B
0
C
B
1
h'
2hh' t
i
1
A
B
1
C
B
0
h'
2hh'
Figure 4.10: Projections of subtiles with 2-dimensional stencils.
4.2.4 Higher Dimensions
For ISLs with higher dimensional stencils, Conjugate-Trapezoid Tiling is more complicated.
For example, the 3-dimensional Jacobi code is a 27-point stencil, where the iteration space
is 4-dimensional and requires a 3  3 cube of elements to compute each output element.
61
Although we cannot draw the shape of 4-dimensional tiles, the approach discussed in Section
4.2.3 for a 3-dimensional iteration space can be used to apply tiling to the 4-dimensional
iteration space. The 4-dimensional cone that contains all the dependence vectors is projected
on to < ~t;~i0 >, < ~t;~i1 > and < ~t;~i2 > planes. Based on the projection of the dependence
cone, we apply the same tiling strategy as in Figure 4.4 to produce parallelogram tiles on
each 2-dimensional plane. The resulting tile is a 4-dimensional hyper-parallelepiped. Subtiles
are computed as shown Figure 4.5-(a) on the projection of the hyper-parallelepiped tile on
each plane, so that the projections of subtiles on each plane are conjugate trapezoids. The
total number of subtiles within a tile is 23 = 8. For a given time step, the slice of a hyper-
parallelepiped tile is a cuboid, as shown in Figure 4.11, where subtile A send the boundary
data to 7 neighbor nodes, and subtile D receive data from other 7 neighbor nodes. Thus,
for each time step, the computation of subtile B0, B1 and B2 depends on subtiles A, and
subtiles C0, C1 and C2 depend on subtiles B0, B1 and B2, and subtile D depend subtiles C0,
C1 and C2. Hence, when delaying subtiles to hide communication latency, the subtiles are
divided into four stages. The rst stage includes subtile A, as well as the send and receive
operations associated to A; the second stage, includes subtiles B0, B1 and B2, which are
delayed by h0 time steps from subtile A; the third stage, includes subtiles C0, C1 and C2,
that are delayed by 2 h0 time steps; and the fourth stage includes the subtile D, which is
delayed 3 h0 time steps.
A
B
2
B
0
B
1
C
1
C
0
C
2
D
i
2
i
0
i
1
Figure 4.11: Slice of a 4-dimensional hyper-parallelepiped tile.
Based on the above discussion, the algorithm of Conjugate-Trapezoid Tiling for ISLs with
62
n-dimensional stencil ((n + 1)-dimensional iteration space) can be summarized as shown in
Algorithm 1.
Algorithm 1 Conjugate-Trapezoid Tiling for ISL with d-dimensional stencil ((d + 1)-
dimensional iteration space)
1: Compute ~Bleft(~t;~ij) and ~Bright(~t;~ij) of the cone of dependence vectors for each plane
< ~t;~ij > (j=0,1,...,d  1).
2: Compute H = (~hT0 ,
~hT1 , ...,
~hTd 1, ~h
T
d )
T . Each row of H is the normal vector of a tiling
hyperplane. The last row ~hd = (1=h; 0; 0; :::; 0). Other rows ~hj (j = 0; 1; :::; d   1) are
computed as follows:
Proj<~t;~ij>(
~hj) =
1
wj
 ~Bright(~t;~ij) 

0 1
 1 0

The rest of the components in ~hj are lled with zeros.
3: Tile the iteration space with the tiling hyperplanes dened by H. The height of each tile
is h, and the width of each tile on~ij dimension is wj. Distribute tiles on nodes organized
in a d-dimensional mesh <~i0, ~i1, ..., ~id 1 >.
4: Divide the projection of the tile on each < ~t;~ij > plane into two conjugate trapezoid
subtiles. The boundary of between subtiles is parallel to ~Bleft(~t;~ij). This results in 2
d
subtiles in each (d+ 1)-dimensional tile in total.
5: Partition the 2d subtiles into d + 1 stages according to the dependence among subtiles.
Schedule the subtiles in order of stage. Delay the k-th stage by k  h0 time steps
(k = 0; 1; ::; d).
6: In each subtile, receive necessary boundary data from neighbor nodes before computa-
tion, and send boundary data to neighbor nodes after computation.
4.3 Performance Model
This section introduces a performance model of Conjugate-Trapezoid Tiling which is used
to the explain the experimental results.
63
4.3.1 Denitions
At each time step, the total number of stencil points computed by each node is:
Ncomp =
d 1Y
j=0
wj
Ncomp stands for the volume of the d-dimensional tile slice at each time step. The order of
Ncomp is d. If the average computation time for each stencil point is tpoint, the computation
time at each time step would be:
tcomp = tpoint Ncomp = tpoint 
d 1Y
j=0
wj
Only the data points at the boundaries of the tile need to be communicated (to be sent or
received) with neighboring nodes. For 1-dimensional Jacobi, the total number of data points
to be sent at each time step is 2. For 2-dimensional Jacobi, the grey area in Figure 4.9-(a)
shows the data points to be sent. I assume that the messages sent to dierent destination
nodes do not cause congestion with each other, so I only need to consider the largest message,
which would cause the longest latency. The maximum number of data points to be sent at
each time step is 2  maxfw0; w1g. In general, for a n-dimensional stencil, the maximum
number of data points to be sent at each time step is:
Vsend = c maxf
n 2Y
k=0
wjk j8~j = (j0; j1; :::; jn 2)g
in which c is a constant determined by the stencil. For the Jacobi stencils of any dimension,
c = 2, which means Jacobi stencil only requires each points to exchange data with its direct
neighbor points. Intuitively, Vsend is proportional to the size of one "surface" of the n-
dimensional tile slice at each time step, so the order of Vsend is n  1. Because of symmetry,
64
for any non-boundary node, the volume of data to be sent is equal to the volume of data to
be received, so let Vsend = Vrecv = V .
For each send -receive pair, let tlatency denote the latency. In practice tlatency consists of two
partss: the xed startup overhead tstartup and the transmission latency which is determined
by the size of the data to be transmitted:
tlatency = tstartup + ttrans  V
For small messages, the total latency is dominated by the startup overhead, so tlatency 
tstartup. However, if the message is large, the total latency is proportional to the data size
being transmitted:
tlatency  ttrans  V  V
For each send or receive operation, even with asynchronous mode, there is some CPU
time that cannot be overlapped with computations. Let tsend and trecv denote the overhead of
asynchronous send and receive operations, respectively. Similarly, because of symmetry, for
any non-boundary node the total number of send operations must be the same to the total
number receive operations at each time step. Let Nsend = Nrecv = Nmsg. Nmsg is determined
by the number of neighbor points that needs to do data exchange, which is a constant for
a given stencil. Because the total number of neighbor points for a n-dimensional stencil is
3n  1, the upper bound of Nmsg is (3n  1)=2. For the Jacobi stencil, the values of Nmsg are
1, 3, 7 in the 1, 2, 3-dimensional cases, respectively.
65
4.3.2 Performance Model
If communication and computation are not overlapped, the total execution time at each time
step is:
texec = tcomp + (tsend + trecv) Nmsg + tlatency
= tpoint Ncomp + (tsend + trecv) Nmsg
+tstartup + ttrans  V
Consider the Conjugate-Trapezoid Tiling proposed in this paper. Let the total number
of points computed at each time step be N 0comp. At each time step in the steady state,
N 0comp  Ncomp. Because of subtiling, the size of the message sent by each subtile is smaller,
so the maximum number of data points to be sent V 0 is smaller than V : V 0  V . The cost
is that the total number of send or receive operations, N 0msg, becomes larger: N
0
msg  Nmsg.
For the Jacobi stencil, with subtiling the values of N 0msg are 1, 5, 19 in the 1, 2, 3-dimensional
cases respectively. The number of messages to be sent for 2-dimensional Jacobi. is shown
in Figure 4.9-(a). Figure 4.12 shows the number of messages to be sent for 3-dimensional
Jacobi. For standard tiling, 7 messages need to be send to dierent neighboring tile (1
message for the data on the vertex, 3 messages for 3 edges and another 3 for the 3 faces), as
shown in Figure 4.12-(a). After Conjugate-Trapezoid Tiling, some messages are divided into
2 (the original edge messages) or 4 messages (the original face messages) due to subtiling, as
shown in Figure 4.12-(b). However, those message that are divided from the same original
message are still to be sent to the same destination tile. The total number of messages
becomes 19. The increase of the number of the messages may cause performance problems,
which will be addressed in Section 4.3.3.
Because subtiles are delayed in Conjugate-Trapezoid Tiling, there are at least h0 time
steps between each send and receive pair. Let t0exec denote the total execution time at each
66
(a) Nmsg = 7. (b) N
0
msg = 19.
Figure 4.12: Number of messages to be sent for 3-dimensional Jacobi. Each arrow stands
for a message to be sent to an neighboring tile. The arrows with the same direction means
that the destination of the messages are the same.
time step with Conjugate-Trapezoid Tiling, and t0latency = tstartup+ ttrans V 0 is the latency of
the largest message. Then if t0latency is smaller than the total execution time of h
0 time steps,
h0  t0exec  t0latency = tstartup + ttrans  V 0
the communication latency can be fully overlapped:
t0exec = t
0
comp + (tsend + trecv) N 0msg
= tpoint N 0comp + (tsend + trecv) N 0msg (4.1)
According to above, the minimum number of delayed time step h0 to fully hide communication
latency is determined by:
h0  t
0
latency
t0exec
=
tstartup + ttrans  V 0
tpoint N 0comp + (tsend + trecv) N 0msg
(4.2)
67
The performance benet achieved by Conjugate-Trapezoid Tiling is:
 = texec   t0exec
= tpoint  (Ncomp  N 0comp)
+(tsend + trecv)  (Nmsg  N 0msg) + tlatency
 tlatency   (tsend + trecv)  (N 0msg  Nmsg) (4.3)
4.3.3 Optimization
Communication-Coalesce Optimization
With Conjugate-Trapezoid Tiling, the number of send and receive operations at each time
step, N 0msg, is greater than Nmsg, which can cause performance degradation according to
Equation 4.3. N 0msg  Nmsg is because the data to be exchanged with the same neighbor
node are divided into two or more messages by the delayed subtiles. For example, in Figure
4.9, the destination of message S1 and S4 is the same node. However, after schedule of
subtiles, within the same time step t0, S1 is provided by subtile A with the data stands
for the original time step of t0, but S4 is provided by subtile B0 with the data stands for
the original time step of t0   h0. This leads to S1 and S4 becoming two separate sending
operations, and so as S2 and S3. If postpone S1 until the computation in subtileB0 completes,
the send operations of S1 and S4 can be merged into one because the destination node is the
same. S2 and S3 can also be merged by postponing S2 until subtile B1 is complete. Similarly,
for receive operations if bring R4 forward before subtile B0 starts, R1 and R4 can be merged
because they have the same source node, and so as R2 and R3.
Figure 4.13 shows how communication operations are merged. Compared to Figure
4.9-(b), S1 and S2 are postponed while R3 and R4 are moved into an earlier stage. This
optimization of merging communication operations is always possible, because send opera-
68
AB
0
B
1
CS0(t0)
S1 (t0)
S2(t0)
R1(t0-h')
S4(t0-h')
R2(t0-h') S3(t0-h')
R4(t0-2hh')
R0(t0-2hh')
R3(t0-2hh')
stage 2 (delay h' ) stage 3 (delay 2hh' )stage 1
S1 (t0)
S2(t0)
R4(t0-2hh')
R3(t0-2hh')
Figure 4.13: Merge communication operations. The send or receive operations with in the
same dashed box can be merged. Compared to Figure 4.9-(b), the send or receive operations
denoted by the dashed arrows are moved (delayed or brought forward) to the operations
denoted by corresponding thick arrows.
tions only need to be postponed, and receive operations only need to be moved to an earlier
stage. This optimization still preserve the correctness of communication. Although S1 and
R4 are the corresponding send and receive operations, and S1 and R4 are moved into the
same stage as shown in Figure 4.13, there are still h0 time steps of execution between the
corresponding send and receive operations in the pipeline (when S1 sends out the data for
time step t0, R4 is receiving the data for time step t0   h0). Hence this optimization does
not aect the threshold of h0 calculated in Equation 4.2.
Because the send or receive operations with the same destination or source node are
merged into one operation, this optimization results in that N 0msg = Nmsg and V
0  V on
the steady state of the pipeline.
69
Balanced Multi-Threading
If nodes have more than one processors, the computation can be evenly distributed to every
processor because the stencil points at each time step are fully parallelizable. However, the
overhead of asynchronous send or receive operations can not be reduced through paralleliza-
tion. This optimization is to balance the overall load of computation and communication
operations. As shown in Figure 4.14, the white bars stand for computation time, and the grey
bars stand for communication operation overhead. When nodes work in sequential mode,
the total execution time of each time step is t0exec = t
0
comp+(tsend+ trecv) N 0msg. If nodes work
in parallel mode, assume each node has 4 processors, the computations can be divided into
partitions and distributed on to each processor. However, the overhead of communication
operation cannot be divided. As a result, instead of evenly dividing the computations in
into 4 partitions, the partition(s) which produces data for sending or depends on the data
to be received are designed to contain less computations. So the overall load distribution is
balanced.
time
Sequential 
Mode
Parallel 
Mode
Processor 0
Processor 1
Processor 2
Processor 3
t'exec
t'comp tsend ·N’op
t'exec /4
trecv ·N’op
Figure 4.14: Balanced multi-threading with four processors. The white bars stand for com-
putation time, and the grey bars stand for communication operation overhead.
In general, if each node has P processors, and maxftsend; trecvg  t0exec=P , ideally the
execution time in parallel mode can be t0exec=P if work load among threads are well balanced.
70
This is equivalent to that the amortized overhead of each communication operation is also
reduced by parallelization, even though that the communication operation it self is not
parallelized.
4.4 Unied Tiling Representation
This section uses the unied tiling representation framework dened in Section 2.1 to describe
the Conjugate-Trapezoid Tiling discussed in this Chapter.
The application of Conjugate-Trapezoid Tiling focuses on regular ISLs with the format
shown in Figure 4.2. For ISLs with n-dimensional stencils, the iteration space I is an (n+1)-
dimensional hyper-cuboid, so the basis matrix of I is a diagonal matrix:
I = span(E) = span(
0BBBBBBBBBB@
T; 0; 0; :::; 0
0; N0; 0; :::; 0
0; 0; N1; :::; 0
:::
0; 0; 0; :::; Nn 1
1CCCCCCCCCCA
):
The rst dimension of I is the time dimension, consisting of the time steps of the ISL. The
rest of the n dimensions are the space dimensions.
The stencil of an ISL may contain m points, which means that the computation of each
data points depends on m neighboring data points from the previous time step. This results
in m dependence vectors ~d0, ~d1, ..., ~dm 1 in the iteration space I. Because each time step
only needs the results from the previous time step, the rst component of each dependence
71
vector must be 1. So the dependence matrix D must have the following form:
D =
0BBBBBBB@
~d0
~d1
:::
~dm 1
1CCCCCCCA
=
0BBBBBBB@
1; 
1; 
:::
1; 
1CCCCCCCA
:
Because the tile shape T is an (n+1)-dimensional parallelepiped, there must exist a tiling
matrix T :
T =
0BBBBBBB@
~t0
~t1
:::
~tn
1CCCCCCCA
;
such that T = span(T ). Since the projection of T on <~i0;~i1; :::;~in 1 > is an n-dimensional
hyper-rectangle, 1 to n rows of T must be:
~t1 = (w0; 0; 0; :::; 0);
~t2 = (0; w1; 0; :::; 0);
:::
~tn = (0; 0; 0; :::; wn 1):
For the rst row of T , ~t0 is determined by the height of the parallelepiped tile h and the
dependence vectors. More specically, the projections on ~t0 on each < ~t0;~ij > is parallel to
~Bright(~t;~ij):
Proj<~t;~ij>(
~t0) = cj  ~Bright(~t;~ij):
72
And because the height of the parallelepiped tile is h, the rst component of Proj<~t;~ij>(
~t0)
must be h:
Proj<~t;~ij>(
~t0) = (h; xj) = cj  ~Bright(~t;~ij)
According to above equations:
xj =
~Bright(~t;~ij)  (0; 1)T
~Bright(~t;~ij)  (1; 0)T
 h; j = 0; 1; :::; n  1
~t0 = (h; x0; x1; :::; xj; :::; xn 1):
Since Conjugate-Trapezoid Tiling is tessellating tiling, the repetition matrix R = T .
After tiling, there still exists "horizontal" dependences (dependence vectors that are perpen-
dicular to the dimension of ~t). The dependence matrix between tiles are as follows:
D0 =
0BBBBBBBBBBBBBBBBBBBB@
1; 
1; 
:::
1; 
0; 
0; 
:::
0; 
1CCCCCCCCCCCCCCCCCCCCA
:
The rows in D0 with the rst component 0 are the dependence vectors perpendicular to
the ~t dimension. Because of these "horizontal" dependences, not enough parallelism can be
exposed among tiles if each tile are executed atomically (as shown in Figure 4.4). However,
after subtiling and delaying dependent subtiles, tiles within the same level can be executed
73
in parallel. As a result, the following schedule of tiles become valid:
S =
0BBBBBBB@
1
0
:::
0
1CCCCCCCA
;
S(~i0) = ~i0  S = (i00; i01; :::; i0n)  S = i00:
Similar to overlapped tiling, the purpose of the technique of subtiling and delaying dependent
subtiles is to make the above schedule S valid and expose enough parallelism for coarse grain
tiles.
To sum up, Table 4.1 shows the unied representation of Conjugate-Trapezoid Tiling.
4.5 Evaluation
In this section, Conjugate-Trapezoid Tiling is evaluated on a distributed memory machine
and the experimental results are presented.
4.5.1 Target Platform
The measurement of the performance is done with a distributed memory cluster of 32 nodes.
Each node has an Intel Xeon quad-core X3430 CPU at 2.4GHz. The nodes are interconnected
with a gigabit Ethernet. The MPI environment is MPICH2. The compiler is GCC 4.7.2,
with the option -O3.
74
Input I;D =
0BB@
1; 
1; 
:::
1; 
1CCA
D0 D0 =
0BBBBBBBBBB@
1; 
1; 
:::
1; 
0; 
0; 
:::
0; 
1CCCCCCCCCCA
, after subtiling: D0 =
0BB@
1; 
1; 
:::
1; 
1CCA :
T T =
0BB@
~t0
~t1
:::
~tn
1CCA =
0BBBB@
h; x0; x1; :::; xj; :::; xn 1
w0; 0; 0; :::; 0
0; w1; 0; :::; 0
:::
0; 0; 0; :::; wn 1
1CCCCA
xj =
~Bright(~t;~ij)(0;1)T
~Bright(~t;~ij)(1;0)T  h; j = 0; 1; :::; n  1
R R = T:
S S =
0BB@
1
0
:::
0
1CCA :
Table 4.1: Representation of Conjugate-Trapezoid Tiling
75
Stencil Input Points
Dimension Size of Stencil
1-D Jacobi 1 512K 3
2-D Jacobi 2 1024 x 512 9
3-D Jacobi 3 64 x 64 x 64 27
PathFinder 1 512K 3
Poisson 2 1024 x 512 5
Biharmonic 2 1024 x 512 13
HotSpot 2 1024 x 512 9
Cell 3 64 x 64 x 64 27
Table 4.2: Benchmarks
4.5.2 Implementation
The Conjugate-Trapezoid tiling discussed in this chapter is implemented in the Cetus com-
piler [21], which is a source-to-source C compiler. The inputs to the compiler are the se-
quential C codes of the ISLs. The dependence analysis result of Cetus provides the stencils
denitions of the ISLs, which is the input to the tiling algorithm. The implementation use
the Omega library [22] to perform tiling operations on the iteration space polyhedron. The
code generation of Cetus is modied to generate parallel code with MPI APIs. In order to
implement the optimization discussed in Section 4.3.3, the computation within each node is
parallelized with Pthreads instead of OpenMP.
4.5.3 Benchmark
The same 8 ISL programs are evaluated as benchmarks as in Chapter 3: 1,2,3-dimensional
Jacobi, PathFinder, Poisson, Biharmonic, HotSpot and Cell. These benchmarks contains 5
dierent stencils, from 1-dimensional to 3 dimensional. The input size for each benchmarks
is listed in Table 4.2. Table 4.3 shows dierent stencils and the corresponding dependence
vectors.
76
Stencil Dependence Vectors Benchmark
1-D Jacobi,
(1; 1); (1; 0); (1; 1) PathFinder
(1; 0; 0),
(1;1;1); 2-D Jacobi,
(1; 0;1); HotSpot
(1;1; 0)
(1; 0; 0),
(1; 0;1); Poisson
(1;1; 0)
(1; 0; 0);
(1;1;1);
(1; 0;1); (1;1; 0); Biharmonic
(1; 0;2); (1;2; 0)
8x =  1; 0; 1; 8y =  1; 0; 1; 3-D Jacobi,
8z =  1; 0; 1; (1; x; y; z) CELL
Table 4.3: Dierent stencils of the benchmarks
4.5.4 Experimental Result
Performance Overview
Figure 4.15-(a) shows the speedup achieved by Conjugate-Trapezoid Tiling with dierent
values of h0. The baseline is the performance achieved by tiling with no communication
overlap. On average, Conjugate-Trapezoid Tiling achieves 1:9X speedup when h0 = 4.
Among all the benchmarks, the common trend is that, as the value of h0 becomes larger
than 4, higher performance is achieved. If communication latency is longer than the com-
putation time within a tile, increasing h0 will delay the dependent subtiles, which are the
receivers of messages, so that larger portion of communication latency can be hidden. This
trend exactly shows the motivation of Conjugate-Trapezoid Tiling. However, It can also be
seen that after exceeding a certain threshold, there is very little or no performance gain by
keeping increase h0. This is because if h0 is already large enough to hide the entire latency,
77
00.5
1
1.5
2
2.5
3
No Overlap
h'=1
h'=2
h'=3
h'=4
h'=5
h'=6
h'=7
h'=8
S
p
eed
u
p
 O
v
er N
o
 O
v
erlap
1-D
Stencils
2-D
Stencils
3-D
Stencils
(a) Speedup with the Communication-Coalesce optimization.
0
0.5
1
1.5
2
2.5
3
No Overlap
h'=1
h'=2
h'=3
h'=4
h'=5
h'=6
S
p
eed
u
p
 O
v
er N
o
 O
v
erlap
S
p
eed
u
p
 O
v
er N
o
 O
v
erlap
(b) Speedup without the Communication-Coalesce optimization.
Figure 4.15: Speedup with dierent values of h0. The baseline is the performance achieved
by tiling with no communication overlap. The gure shows the performance with h0 up to 6
for 2-dimensional stencils and h0 up to 4 for 3-dimensional stencils, because the trend shows
that there would be no additional performance gains with a larger h0.
78
there will be no extra benet with an even larger h0. Equation 4.2 in Section 4.3.2 discussed
the threshold of h0. In Figure 4.15, for the benchmarks with 1-dimensional stencils (1-D Ja-
cobi and PathFinder), the threshold is 6 or 7, while for the benchmarks with 2-dimensional
stencils (1-D Jacobi, Poisson, Biharmonic and HotSpot) the threshold is 3 or 4. What is
more, for 3-D Jacobi and Cell, which are the benchmarks with 3-dimensional stencils, peak
performance is achieved when h0 = 1. This can be explained by Equation 4.2 as follows. The
input sizes for each benchmarks shown in Table 4.2 are chosen to make the number of stencil
points allocated to each node be the same across all benchmarks. So Ncomp is the same each
benchmarks, and so is N 0comp because Ncomp  N 0comp. However the amount of computation
of higher dimensional stencils is usually larger than that of lower dimensional stencils, such
as 1-D Jacobi (3 points) versus 2-D Jacobi (9 points). For higher dimensional stencils the
amount of computation of each stencil point is larger, and more time consuming, too. This
means that tpoint is larger for higher dimensional stencils, which contributes one factor that
leads to a smaller threshold of h0 according to Equation 4.2. The other factor is that for
higher dimensional stencils each tile has more neighboring tiles, so Nmsg and N
0
msg are also
larger. The above two factors makes a larger denominator in Equation 4.2, and a smaller
threshold of h0 as a result.
Communication-Coalesce Optimization
Figure 4.15-(b) shows the speedup achieved without the Communication-Coalesce optimiza-
tion introduced in Section 4.3.3. Since for 1-dimensional stencils Nmsg = N
0
msg = 1, there
is no dierence with or without the Communication-Coalesce optimization. So 1-D Jacobi
and PathFinder are not included in Figure 4.15-(b). Compared to Figure 4.15-(a), it can be
seen that 2-dimensional stencils benchmarks (2-D Jacobi, Poisson, Biharmonic and HotSpot)
achieve a lower speedup without the Communication-Coalesce optimization. Another obser-
vation is that the peak performance is reached with a smaller h0: without the optimization
79
020000
40000
60000
80000
100000
1-D Jacobi 2-D Jacobi 3-D Jacobi CELL
No Overlap
h'=1
h'=2
h'=3
h'=4
h'=5
h'=6
h'=7
h'=8
Cycles
110000 180000
Figure 4.16: Communication overhead with dierent h0.
the peak performance is reached around h0 = 3 while in Figure 4.15-(a) the peak performance
is reached around h0 = 4 for most 2-dimensional stencils benchmarks. Because without the
optimization, the additional overhead introduced by (N 0msg   Nmsg) more send and receive
operations, the execution time at each time step becomes longer. So fewer time steps are
needed to overlap with communication latency. This problem is more serious for bench-
marks with 3-dimensional stencils, since for those benchmarks, without the optimization
N 0msg = 19, which is much larger compared to Nmsg = 7. As shown in Figure 4.15-(b), 3-D
Jacobi and CELL can only achieve very small speedup with any h0. The overhead of addi-
tional communication operations almost kill the performance benet gained by overlapping
communication and computation.
Communication overhead
Figure 4.16 shows the communication overhead with dierent h0. The overhead is measured
by the average number of cycles taken by each blocking receive call MPI_Recv. So this over-
head shown in Figure 4.16 stands for the trecv plus the part of tlatency that is not overlapped
with computation. The bar of "No Overlap" corresponds to trecv + tlatency, which is the
upper bound of the overhead shown in the gure. It can be seen that from 1-D Jacobi to 3-D
Jacobi, as the dimension of the stencil increases, the value of trecv + tlatency also increases.
This is because according to the performance model dened in Section 4.3.2, if assuming
80
that trecv and tstartup are constant, the other part of tlatency is proportional to the number of
data points to be communicated (V , which is the volume of the data to be sent or received),
and the order of V is d  1. This means that because higher dimensional stencils have
larger communication volume, the communication overhead, if not hidden by overlapping
with computations, is higher. In Figure 4.16 the "No Overlap" bar of CELL is lower than
3-D Jacobi. CELL and 3-D Jacobi share the same stencil, which means the number of data
points to be communicated is the same. However, each data point of CELL only takes one
byte storage, while each data point of 3-D Jacobi is a oating point value. So the actual
communication volume V of CELL is smaller than that of 3-D Jacobi.
As the value of h0 increases, the communication overhead decreases for every benchmark
shown in Figure 4.16, because more portion of communication overhead is hidden by over-
lapping it with computations. This can explain the trend of performance shown in Figure
4.15. When h0 exceeds some threshold, communication overhead does not reduce any more
because tlatency is fully hidden and the lower bound trecv is reached. As in Figure 4.15,
benchmarks with higher dimensional stencils reach the lower bound sooner with a smaller h0
compared to benchmarks with lower dimensional stencils.
Input Size Sensitivity
Figure 4.17 shows the speedup with dierent input size for 1-D and 2-D Jacobi. For both
benchmarks, h0 reaches the threshold with a smaller value with larger input size. This is
because larger input size results in larger Nmsg and N
0
msg. This makes the denominator in
Equation 4.2 larger and thus a smaller threshold of h0. It also shows than for both benchmarks
as the input size getting larger, the peak speedup achieved by Conjugate-Trapezoid Tiling
is lower. This is because a larger input size leads to larger tile size if the total number of
computation nodes remains the same. When the large tile size is large enough the program
becomes CPU-bound instead of communication-bound. The optimization opportunity for
81
00.5
1
1.5
2
2.5
3
3.5
4
0 1 2 3 4 5 6 7 8
256K
512K
1024K
2048K
S
p
eed
u
p
 O
v
er N
o
 O
v
erlap
h' 0
0.5
1
1.5
2
2.5
0 1 2 3 4 5 6
768×384
1024×512
1440×720
2048×1024
S
p
eed
u
p
 O
v
er N
o
 O
v
erlap
h'
(a) 1-D Jacobi (b) 2-D Jacobi
Figure 4.17: Input size sensitivity.
hiding communication latency for a cpu-bound program is small.
4.6 Conclusion
This chapter discusses the design and implementation of a new tiling scheme called Conjugate-
Trapezoid Tiling for ISLs. The proposed tiling scheme divides each tile into subtiles, and
schedules subtiles within each tile to overlap computation and communication. This chap-
ter describes the tiling algorithm in polyhedron model, and compare Conjugate-Trapezoid
Tiling and standard tiling analytically. The proposed tiling scheme is implemented in Cetus
compiler to automatically generated code for d-dimensional stencils. Experimental results on
a cluster show that Conjugate-Trapezoid Tiling is able to eectively reduce the overhead of
communication latency, and achieves signicant speedup over traditional tiling strategy.gy.
82
Chapter 5
Tile Shape Selection for Hierarchical
Tiling
In this chapter, the tile shape selection for hierarchical tiling is studied. In hierarchical tiling,
tile shapes at dierent levels interact with each other and together determine execution time.
This means that, in order to minimize execution time, simply selecting tile shape for each
level separately will not necessarily lead to the optimal choice; optimal tile shape selection
can only be achieved by tackling hierarchical tiling in a global manner. Figure 5.1 shows
an example of dierent tile shape choices. Arrows represent the dependences between tiles.
Dashed lines show the tiles that can be executed in parallel in a wavefront schedule. The
numbers stand for the order of the execution the wavefronts. Suppose the iteration space of
the loop nest has two dependences: ~d0 = (1; 0) and ~d1 = (0; 1). There are dierent possible
tile shapes for each level of tiling. The most common choice is the square. In Figure 5.1-
(a), both levels use square tiles. The tiles along the same diagonal line can be executed in
parallel. Assume that the computation time of each level 0 tile is one unit time, the total
execution time of schedule of in Figure 5.1-(a) is 77 = 49 units of time. There are, however,
better tile shapes. If the level 1 tile shape is changed to a parallelogram as shown Figure
83
1  2  3   4
7
6
5 
1      2     3      4
5 
6
7 
1
2 
3 
4 
1  2  3  4  5  6 7  8 
9 
10 
11 
4 
5 
6 
7 
3 
2 
1 
1  2  3  4  5  6 7  8 
9 
10 
11 
Level 0
Tile Shape
Level 1
Tile Shape
Iteration Space Execution time
7×7=49
4×11=44
7×11=77
(a)
(b)
(c)
Figure 5.1: Dierent tile shape choices for hierarchical tiling. Arrows are the dependences
between tiles. Dashed lines represent the tiles that can be executed in parallel in a wavefront
schedule. The numbers stand for the order of the wavefronts
5.1-(b), the total number of time unit of execution time becomes 411 = 44 < 49. However,
parallelogram shaped tile is not always the best choice. If we also choose parallelogram tile
for the level 0 tiling, as shown in Figure 5.1-(c), the total execution time would increase to
7  11 = 77 units of time (here it is assumed that the parallelogram tile at level 0 takes
the same time to execute as the square tile.). A quantitative model can be used to nd the
optimal tile shape choice combination for each level of tiling to achieve minimal execution
time.
84
5.1 Problem Denition
Given the iteration space I with the shape of n-dimensional hyper-parallelepiped, assume
~ek = (ek;0; ek;1; :::; ek;n 1), k = 0; 1; :::; n 1 are the n edge vectors of the hyper-parallelepiped,
then the following matrix E is the basis matrix of I:
E =
0BBBBBBB@
~e0
~e1
:::
~en 1
1CCCCCCCA
:
And I = span(E).
Assume there are m dependences in I. The m dependence vectors ~d0; ~d1; :::; ~dm 1, forms
an m n dependence matrix D as follows:
D =
0BBBBBBB@
~d0
~d1
:::
~dm 1
1CCCCCCCA
:
Without loss of generality, I assume that m  n and there are n dependence vectors that
are linearly independent 1.
Then select a tile shape T with n-dimensional hyper-parallelepiped shape, and ~tk =
(tk;0; tk;1; :::; tk;n 1), k = 0; 1; :::; n  1 are the n edges of the hyper-parallelepiped tile, so the
1Otherwise, it must be possible to make at least one loop full permutable through a sequence of ane
transformations, and then it can be simplied into a lower dimensional iteration space.
85
following matrix T is the tiling matrix (the basis matrix of space T):
T =
0BBBBBBB@
~t0
~t1
:::
~tn 1
1CCCCCCCA
:
As discussed in previous chapters, if using tile shape T(T) to tile the iteration space
I, the tiled space I 0 is another n-dimensional iteration space. It is possible to apply tiling
transformation for I 0 again. To describe an l-level hierarchical tiling, dene a sequence of
iteration spaces I0, I1,..., Il:
I0 = I; I1 = I
0
and 8k = 1; 2; :::; l, Ik is the tiled space of Ik 1. This follows the bottom-up approach of
hierarchical tiling. Since each Ik must have the shape of n-dimensional parallelepiped, dene
Ek is the basis matrix for Ik:
8k = 0; 1; :::; l; Ik = span(Ek):
At each level of tiling, the tile shape is Tk. In order to simplify the problem, tile shapes
are restricted to n-dimensional parallelepiped, too. Let Tk be the tiling matrix at each level,
8k = 0; 1; :::; l   1; Tk = span(Tk):
The tiles at dierent levels are scheduled independently. And each tile is executed atomically,
which means there is no communication with other tiles at the same level before computation
within it is completed.
With the denitions above, the problem is how to select tile shapes T0, T1, ..., Tl 1 for
86
t
1
t
0
d
0
d
1
d
2
Figure 5.2: Constraints of tile shape imposed by dependences.
each level of tiling, so that with any valid scheduling of tiles on each level, the total execution
time of the nal loop nest is minimized.
5.2 Constraints on Tile Shape Selection
In the discussion of this chapter, tile shapes are restricted with hyper-parallelepiped. And
the tiles on each tiling level are scheduled atomically. Besides, tiles are tessellating, which
means there is no overlapped between tiles introduced to eliminate dependences. With the
assumptions above, Figure 5.2 describes the constraints of tile shape selection imposed by
dependences in general. In order to produce a valid tile shape, the innite cone spanned
by extending the tile edges ~t0, ~t1, ..., ~tn 1 must contain every dependence vector. For the
example in Figure 5.2, ~t0 and ~t1 must not be in the shaded area. To describe the above
observation formally, each dependence vector ~dk must be covered by the cone spanned by
the extension cords of ~t0, ~t1, ..., ~tn 1,
8~dk; 9~a = (a0; a1; :::; an 1)  ~0n = (0; 0; :::; 0) (5.1)
~dk = a0  ~t0 + a1  ~t1 + :::+ an 1  ~tn 1 = ~a  T
Assume that every tile is always large enough so that inter-tile dependences only exist
between adjacent tiles (including diagonal adjacent). This requirement can be expressed as
87
follows:
8~dk; 9~a = (a0; a1; :::; an 1)  ~1n = (1; 1; :::; 1) (5.2)
~dk = a0  ~t0 + a1  ~t1 + :::+ an 1  ~tn 1 = ~a  T
The above two requirements can be merged into the following:
8~dk; 9~a; ~0  ~a  ~1; ~dk = ~a  T (5.3)
which is the constraint imposed by dependences when choosing the tile shape.
When the original iteration space I is tiled by matrix T and produces the tiled space I 0,
The relation among I, I 0 and T is studied as follows. First, assume the tiling transformation is
an ane transformation, the shape of I 0 should still be an n-dimensional hyper-parallelepiped
under any ane transformation. Suppose E 0 is the basis matrix of I 0 and I 0 = span(E 0),
which is the space of tiles. There must be some n n matrix A such that:
E 0 = E  A
Next, consider the eect of tiling transformation represented by A. As shown in Figure 2.4,
after tiling, under the new coordinates system shown as axis labels i00 and i
0
1, vector ~t0 and
~t1 become (1; 0) and (0; 1), respectively. More generally, we have the following equation (1n
denotes the n n identity matrix):
T  A =
0BBBBBBB@
~t0
~t1
:::
~tn 1
1CCCCCCCA
 A =
0BBBBBBB@
1; 0; :::; 0
0; 1; :::; 0
:::
0; 0; :::; 1
1CCCCCCCA
= 1n
88
e0
e1
dk=(2,2)
(1,0)
(0,1)
t1=(0,8)
t0=(4,0)
Figure 5.3: Inter-tile dependences (thicker arraows) generated by original dependence vector
~dk.
It can be concluded that A = T 1. So the relation among I, I 0 and T is: E 0 = E  T 1. So
T 1 represents the ane transformation of the tiling transformation with tile shape T .
When the original iteration space I is tiled by matrix T , each dependence vector ~dk is
also transformed to ~d0k by the ane transformation represented by T 1. T must satisfy the
constraints of Equation (5.3),
~d0k = ~dk  T 1 = ~a  T  T 1 = ~a; ~0n  ~a  ~1n (5.4)
As shown in Figure 5.3, a dependence vector (2; 2) would result in three dependence
vectors after tiling: (1; 0), (0; 1), and (1; 1). In general, a single dependence vector ~dk in
original iteration space I generates one or more dependences between tiles in the tiled itera-
tion space I 0. The rule is that, for ~d0k = ~a = (a0; a1; :::; an 1), if aj > 0, there is a dependence
(0; :::; 1; :::; 0) imposed on the j-th dimension in I 0. In addition, if there are n dependence
vectors ~dk0 ,
~dk1 , ...,
~dkn 1 which are linearly independent, the inter-tile dependences in I
0
generated by those n dependence vectors must include n orthonormal unit vectors.
Considering the above observation in the context of hierarchical tiling dened in last
section, if at the k-th level of tiling, the dependence matrix in iteration space Ik is Dk, Dk
89
must have the following form:
D0 = D;
Dk =
0BBBBBBB@
~dk;0
~dk;1
:::
~dk;m 1
1CCCCCCCA
=
0BBBBBBBBBB@
1; 0; :::; 0
0; 1; :::; 0
:::
0; 0; :::; 1
:::
1CCCCCCCCCCA
=
0B@ 1n

1CA ;
8~dk;j; k  1; 0  j < m; ~0  ~dk;j  ~1: (5.5)
5.3 Execution Model
Assume that the sequential execution time only depends on amount of computation, and
that the execution time of each iteration is always the same. So, if the computations of
iteration space I = span(E) is not parallelized, the total execution time is proportional to
the number of iterations within it:
Times(I) = jIj = jdet(E)j: (5.6)
The execution time of a parallelized iteration space depends on the schedule of iterations
within the iteration space. In order not to be tied to any specic scheduling scheme, the
concept of ideal execution time is used in analysis. Ideal execution time is the minimal
execution time of an iteration space I that can be achieved by any valid schedule of iterations.
If each iteration in I is viewed as an activity and D represents the dependences between
activities, the ideal execution time is determined by the critical path in I. It is assumed
that there is innite hardware parallelism available at each level. In practice, the above
assumption means that the amount of hardware parallelism should be always larger than the
90
maximum number of iterations that can be executed in parallel. Under these assumptions,
the critical path in I is the longest path of dependences between iterations.
Let L(E;D) denote the length of the longest path of dependent iterations in the iteration
space I with basis matrix E and dependence matrix D, the ideal execution time of I can be
calculated as follows:
Time(I) = L(E;D) (5.7)
Section 5.4 discusses how to calculate L(E;D). In general, L(E;D) depends on every
~dk in D. However, if the dependence vectors are not linearly independent, we can ignore
some rows in D when calculating L(E;D). For example, the iteration space shown in Figure
5.4 has three dependence vectors ~d0, ~d1 and ~d2, with ~d2 = ~d0 + ~d1. P and P
0 are two
dependence paths within the iteration space. If there is any dependence path containing a
pair of neighboring iterations with the dependence of ~d2, such as P
0, it is always possible
to replace ~d2 on the path with a combination of ~d0 and ~d1, as shown in Figure 5.4. This
will resulting a longer dependence path. As a result, it can be concluded that the longest
dependence path must only contain ~d0 and ~d1 (P ), and other dependence vectors can be
ignored 2 when calculating L(E;D).
In particular, if there are n dependence vectors which are orthonormal unit vectors, it
is safe to ignore other rows in D and only consider those n dependence vectors, because
any other single dependence vector can be replaced by the combination of one or more
orthonormal unit vectors, which will result in a longer dependence path. Based on above
observation,
L(E;
0B@ 1n

1CA) = L(E; 1n) (5.8)
2Assume that the iteration space is much larger compared to each dependence vector.
91









′
Figure 5.4: The dependent pathes (P and P 0) within an iteration space. The length of a
dependent path is the number of iterations on the path. If any path contains ~d2 (P
0), it is
always possible to replace ~d2 with ~d1 and ~d0 on the same position, which resulting a longer
path. This means the longest path (P ) must only contain ~d0 and ~d1.
According to Equation 5.5, under the context of hierarchical tiling the dependence matrix
each at each level Dk =
0B@ 1n

1CA, 8k  1. As a result, Equation (5.8) can be used to simplify
further analysis.
5.4 Calculate the Longest Dependent Path
When applying tiling transformations recursively to a given iteration space I, hierarchical
tiling contains a tiling transformation at each level. Following a bottom-up approach for
hierarchical tiling, rst tiling matrix T0 is used to tile the original iteration space I0 = I,
producing a new iteration space I1. More generally, on the k-th level of tiling, tiling matrix
Tk is applied to tile the iteration space Ik and generated the new iteration space Ik+1. Assume
Ek is the basis matrix of Ik,
Ik = span(Ek); Ek+1 = Ek  T 1k = E0 
kY
j=0
T 1j
The iterations in the bottom level tiles (the nest grain) are executed in sequential mode,
92
so the per-tile execution time can be calculated by Equation 5.6:
Time(T0) = Times(T0) = jdet(T0)j
For upper level tiles (Tk; 1  k < l), each tile at the level immediately below is considered as
a single iteration.Let Dk denote the dependences at the k-th level with D0 = D, the per-tile
execution time is calculated using Equation 5.7:
Time(Tk) = L(Tk; Dk) = Time(Tk 1)  L(Tk; Dk)
Then the total execution time after tiling is:
Time(I) = Time(El) = Time(Tl 1)  L(El; Dl)
= Time(Tl 2)  L(Tl 1; Dl 1)  L(El; Dl)
= jdet(T0)j  (
l 1Y
k=1
L(Tk; Dk))  L(El; Dl)
= jdet(T0)j  (
l 1Y
k=1
L(Tk; Dk))  L(E0 
l 1Y
k=0
T 1k ; Dl)
According to Equation 5.5 and 5.8, the above equation can be simplied to the following
form:
Time(I) = jdet(T0)j  (
l 1Y
k=1
L(Tk; 1n))  L(El; 1n) (5.9)
So, the problem of optimal tile shape selection for hierarchical tiling is equivalent to selecting
a sequence of tiling matrices T0, T1, ..., Tl 1 to minimize the above equation. Equation 5.9
does not include D; however, according to the discussion in Section 5.2 the shape of T0 must
conform the constraints (Equation 5.1 and 5.2) imposed by each dependence vector in D.
93
5.4.1 Calculate L(Tk; 1n)
The meaning of L(Tk; 1n) is the length of the longest dependent path with dependence matrix
1n in tile Tk. The iteration space in tile Tk can be normalized to an n-dimensional hyper
cube by applying the ane transformation 3 represented by T 1k . Thus,
L(Tk; 1n) = L(Tk  T 1k ; 1n  T 1k ) = L(1n; T 1k ):
Consider a row in 1n, ~dj. Since Dk =
0B@ 1n

1CA, ~dj is also a row in Dk. Because Tk
represents the tile shape at the k-th level of tiling, Tk must conform to the constraints
imposed by each dependence vector in Dk. According to Equation 5.1 and 5.2:
9~aj; ~0n  ~aj  ~1n; ~dj = ~aj  Tk
Then, let D0 denotes the dependences within the tile (normalized tile so that the tile is
dened by tessellating edge vectors):
D0 =
0BBBBBBB@
~d00
~d01
:::
~d0n 1
1CCCCCCCA
= 1n  T 1k =
0BBBBBBB@
~d0
~d1
:::
~dn 1
1CCCCCCCA
 T 1k =
0BBBBBBB@
~a0
~a1
:::
~an 1
1CCCCCCCA
8~dj; ~0n  ~d0j = ~aj  ~1n
For the 2-dimensional cases, the intuitive explanation of the above observation is that the
direction of each ~dj must be in the upward, right or upper right direction, as shown in Figure
3In order to keep the problem in the integer domain, the ane transformation can be revised to T 1k 
jdet(Tk)j, and the discussion in this section would be still valid.
94
5.5 (a) and (b). Since the normalized iteration space is a hyper-cube, the longest path must
start from the bottom left corner, which is the base point (0; 0; :::; 0).
The problem of computing L(Tk; 1n) is that nding the longest path P , which is a se-
quence of points ~p0, ~p1, ..., ~pL 1 (Figure 5.5-(c), each ~p1 stands for the coordinate of the
point) such that:
~p0 = ~0 = (0; 0; :::; 0);
8j = 0; 1; :::; L  1;
~0 = (0; 0; :::; 0)  ~pj  (1; 1; :::; 1) = ~1;
9~d0rj ; ~pj+1 = ~pj + ~d0rj (5.10)
L is the length of the dependent path found. Then L(Tk; 1n) = maxfLg. We have that the
last point ~pL 1 is dened as:
~pL 1 = ~pL 2 + ~d0rL 2 =
L 2X
j=0
~d0rj 0  kj < n
= c0  ~d00 + c1  ~d01 + :::+ cn 1  ~d0n 1
= ~c D0 ~c = (c0; c1; :::; cn 1)  ~0; ~c 2 Zn
= ~c  T 1k
Because ~0n  ~d0j  ~1n, the condition ~0  ~pj  ~1 is satised if ~0  ~pL 1  ~1. And
because
Pn 1
j=0 cj is the total number of steps of the path, so maxfLg can be solved through
the following linear programming problem LP (T 1k ;~0;~1):
~0  ~c  T 1k  ~1; ~c  ~0; maxf
n 1X
j=0
cjg: (5.11)
Therefore, we conclude that L(Tk; 1n) = LP (T
 1
k ;~0;~1).
95
,
,


∙ 	


′
′
(1,0)
(0,1)
(1,0)
(0,1)




(a) Original tile (b) Normalized tile (c) Dependent pathes
Figure 5.5: Determine the length of the longest dependent path within the iteration space
of tile Tk. Each ~p1 stands for the coordinate of the point.
5.4.2 Calculate L(El; 1n)
The meaning of L(El; 1n) is the length of the longest dependent path in iteration space of
the highest level tiling. Similarly, it is possible to apply the ane transformation represented
by E 1l to normalize the iteration space:
L(El; 1n) = L(El  E 1l ; 1n  E 1l ) = L(1n; E 1l ):
Let D0 denote the transformed dependence matrix:
D0 =
0BBBBBBB@
~d00
~d01
:::
~d0m 1
1CCCCCCCA
= 1n  E 1l = E 1l
However, unlike the case when calculating L(Tk; 1n), there is no guarantee that 8~d0j; ~0 
~dj  ~1. More intuitively, each dependence vector ~d0j can point to any direction. So L(El; 1n)
cannot be directly solved by the linear programming problem introduced in Equation (5.11).
Since dependence vectors ~d0j can point to any direction, the longest dependent path does
not necessarily start from base point (0; 0; :::; 0) of the hyper-cube iteration space. Thus the
96
dependent path P in the iteration space El is dened the same as in Equation 5.10 except
that ~p0 does not necessarily equal to ~0. To simplify the analysis, we dene a related problem
that is graphically depicted in Figure 5.6: nd the longest path P 0, which is a sequence of
points ~p00, ~p01, ..., ~p0L0 1 such that:
~p00 = (0; 0; :::; 0) = ~0;
~ 1 = ( 1; 1; :::; 1)  ~p0L0 1  (1; 1; :::; 1) = ~1;
8j = 0; 1; :::; L  2; 9~d0rj ; ~p0j+1 = ~p0j + ~d0rj (5.12)
L0 is the length of path P 0. P 0 starts from that base point, and only requires that the
coordinate of the end point ~p0L0 1 is within in the hype-cube surrounding the base point.
~p0L0 1 can be calculated as follows:
~p0L0 1 = ~p0L0 2 + ~d0rL0 2 =
L 2X
j=0
~d0rj 0  kj < n
= c0  ~d00 + c1  ~d01 + :::+ cn 1  ~d0n 1
= ~c D0 ~c = (c0; c1; :::; cn 1)  ~0; ~c 2 Zn
= ~c  E 1l
SomaxfL0g can be solved through the following linear programming problem LP (E 1l ; ~ 1;~1):
~ 1  ~c  E 1l  ~1; ~c  ~0; maxf
n 1X
j=0
cjg: (5.13)
Next step is to prove thatmaxfL0g is approximately equal tomaxfLg so that LP (E 1l ; ~ 1;~1)
is a good approximation of L(El; 1n). First, given any path P , it is always possible to con-
struct a path P 0 by letting ~p0j = ~pj   ~p0. So maxfLg  maxfL0g is easily proven.
On the other hand, given a path P 0, the start point is ~p00 = ~0, and the end point is
97
~p0L0 1 = (p
0
0; p
0
1; :::; p
0
n 1). In order to move path P
0 into the hyper-cubic space span(1n),
Construct ~s = (s0; s1; :::; sn 1) as follows:
8j = 0; 1; :::; n  1 sj =
(
1; pj < 0,
0; else.
By shifting the oset of ~s, ~p00 and ~p0L0 1 are moved into the hyper-cubic space span(1n):
~p0s = ~p00 + ~s; ~0  ~p0s  ~1; ~p0s 2 span(1n)
~p0e = ~p0L0 1 + ~s; ~0  ~p0e  ~1; ~p0e 2 span(1n)
~p0e = ~p0s + ~c D0 (5.15)
Assume that the iteration space is much larger compared to the length of each dependence
vector: 8~d0j; j~d0jj << 1. Then it is always possible to nd two point ~ps and ~pe in a small
surrounding space around ~p0s and ~p0e respectively, such that:
j~ps   ~p0sj < "; ~ps 2 span(1n);
j~pe   ~p0ej < "; ~pe 2 span(1n);
~pe   ~ps = b  ~c0 D0; b 2 Z; ~c0 2 Zn: (5.16)
Then construct new path P within the hyper-cubic space span(1n) as follows:
C =
nX
j=0
c0j; L = b  C
~p0 = ~ps;
~pjC+k = j  ~c0 D0 + ~d0jk ; 0  j < b; c0jk 6= 0
~pL 1 = ~pe:
98
′
′
(0,1)(0,-1)
(-1,0)
(1,0)
′ 

′
′	

′
′
 = 
	 = 

Figure 5.6: Construct path P according to a given path P 0. The lengthes of P and P 0 are
approximately equal.
The length of path P is L = b  C. Considering Equation 5.15 and 5.16, there is:
L > L0   d "
minfj~d0jjg
e = L0   "0;
maxfLg > maxfL0g   "0:
The idea of above analysis is, we try to break path P 0 into small repeated pieces, and
arrange the same number of repeated pieces in a line to construct P . Since the start and
end points of P and P 0 are very close, the length of these two pathes should also be close.
Figure 5.6 shows the intuition of above analysis.
Since maxfLg  maxfL0g is already proved, it can be concluded that maxfLg 
maxfL0g, with the error no larger than "0. As a result, the solution of the linear programming
problem LP (E 1l ; ~ 1;~1) can be used to estimate L(El; 1n):
L(El; 1n) = L(E0 
l 1Y
k=0
T 1k ; 1n)  LP (E 1l ; ~ 1;~1)
99
5.4.3 Summary
According to the discussion in last two sections, the problem of computing execution time
of hierarchical tiled loop nests can be transformed into the following form:
Time(I) = jdet(T0)j  (
l 1Y
k=1
L(Tk; 1n))  L(E 1l ; 1n)
 jdet(T0)j  (
l 1Y
k=1
LP (T 1k ;~0;~1))  LP (E 1l ; ~ 1;~1)
= f(E; T0; T1; :::; Tl 1)
Given the basis matrix E of the original iteration space I, the optimal tile shape for hierar-
chical tiling is the sequence of tiling matrices T0; T1; :::; Tl 1 that minimize f(E; T0; T1; :::; Tl 1).
If I is n-dimensional, each tiling matrix Tk contains n
2 elements. In total function f has
n2  l variables. Besides, f is a non-linear function. Finally, because f contains terms of the
solutions to linear programming problems, the problem of optimal tile shape selection is a
multidimensional, nonlinear, bi-level programming problem.
5.5 Automatic Tile Shape Selection
According to the discussion in last section, optimal tile shape selection for hierarchical tiling
is a multidimensional, nonlinear, bi-level programming problem, which is not easy to solve
analytically. As a result, simulated annealing is adopted to select tile shape automatically
according the analytical model introduced in last section. The sketch of simulated annealing
algorithm to select tile shape is shown in Algorithm 2.
Line 8 in Algorithm 2 valid(NewS) checks whether the randomly generated NewS is a
valid tiling solution, otherwise NewS will be generated again. This includes checking each
Tk in NewS satisfy the constraints imposed by dependences (Equation 5.1 and 5.2). This
100
Algorithm 2 Sketch of simulated annealing algorithm to select tile shape for hierarchical
tiling
1: Solution = Solution0 =< T0; T1; :::; Tn 1 >;
2: Time = f(E; Solution0) = f(E; T0; T1; :::; Tn 1);
3: Solutionbest = Solution;
4: Timebest = Time;
5: Step = 0;
6: while Step < Stepmax do
7: NewS = < T0; T1 ; :::; Tn 1 > =Solution+ random();
8: if valid(NewS) then
9: NewT = f(E;NewS);
10: Temp = temperature(Step; Stepmax);
11: if accept(NewT; T ime; Temp; random()) then
12: Solution = NewS;
13: Time = NewT ;
14: end if
15: if Time < Timebest then
16: Solutionbest = Solution;
17: Timebest = Time;
18: end if
19: Step = Step+ 1;
20: end if
21: end while
101
also includes checking that the maximum number of parallelizable tiles does not exceed the
hardware parallelism at each level. In addition, it is necessary to keep each tile large enough
to aromatize the overhead of tiling. So we add an additional requirement in valid(NewS)
such that the iterations within each tile Tk much be more than a certain threshold Thk at
each level of tiling:
8Tk; jdet(Tk)j  Thk
Note that upper bound is not set for the size of each tile Tk. This is because the simulated
annealing algorithm is self-adaptive and is able to converge to tiles with suitable size. The
solution with very large tiles will lead to high execution time and hence the probability to be
accepted in Line 11 of Algorithm 2 is low. For initial solution Solution0=< T0, T1,..., Tn 1 >,
just choose the smallest possible tile that is valid for each level of tiling: jdet(Tk)j = Thk.
5.6 Unied Tiling Representation
The tiling schemes studied in this chapter are tessellating, atomic tiling, and the shape of
iteration space and tile shapes are restricted to be n-dimensional hyper-parallelepipeds. As
a result, there exists an basis matrix E such that I = span(E), and tiling matrix T such
that T = span(T ). For tessellating tiling where the tile shapes are n-dimensional hyper-
parallelepipeds, the repetition matrix R = T . This chapter studies performance of the best
possible schedule of tiles in stead of any particular scheduling scheme. Table 5.1 shows the
unied representation of a single level of tiling. Because hierarchical tiling is done in the
bottom-up approach in this chapter, the input for the kth level of tiling is: Ik = Ik 1  T 1k ,
Dk = D
0
k 1.
102
Input I = span(E); D
D0 D0 =
0BBBBBB@
1; 0; 0; :::; 0
0; 1; 0; :::; 0
0; 0; 1; :::; 0
:::
0; 0; 0; :::; 1

1CCCCCCA
T the selected tile shape T
R R = T:
S Best possible S that minimizes the ideal execution time
Table 5.1: The Unied Tiling Representation
5.7 Evaluation
5.7.1 Environment Setup
Two platforms with hierarchical parallelism are step up for evaluation. One is the GPU
platform consisting of an NVIDIA GeForce GTX 480 graphic card with 15 MPs and 480
SPs in total. The other platform is a 32-node cluster; each node has 4 Quad-Core AMD
Opteron processors, or 16 cores. Because both platforms only have a 2-levels hierarchy, only
2-level hierarchical tiling is evaluated. For the GPU, the hardware parallelism at level 1 is
15, which is the number of MPs. However, although the number of SPs within each MP
is 32, we consider the hardware parallelism at level 0 as 512 for the experiments. This is
because NVIDIA GPUs require a certain number of concurrent hardware threads for each
MP to achieve high throughput. For the cluster, the hardware parallelisms at two levels are
16 and 32, which are the number of cores per node and the total number of nodes.
Two stencil computation programs are used as benchmarks: Gauss-Seidel and Jacobi. 1-
D stencils use 1-dimensional input array, and result in 2-dimensional iteration spaces, while
2-D stencils result in 3-dimensional iteration spaces. The iteration space of 1-D Gauss-Seidel
contains two dependence vectors: (1; 0) and (0; 1), and the iteration space of 1-D Jacobi
contains three dependence vectors: (1; 1), (1; 0) and (1; 1). There are three dependence
103
(1,0)
(0,1)
(1,0)
(1,1)(1,-1) (1,0,0) (0,1,0)
(0,0,1)
(a) 1-D Gauss-Seidel (b) 1-D Jacobi (c) 2-D Gauss-Seidel
Figure 5.7: The dependence vectors in the iteration space of stencil computation programs.
vectors for 2-D Gauss-Seidel: (1; 0; 0), (0; 1; 0) and (0; 0; 1). Figure 5.7 shows the dependence
vectors in the iteration space of stencil computation programs.
An automatic system which uses simulated annealing is implemented to select tile shape
for hierarchical tiling. The lp solve[8] is used to solve the linear programming problems in the
model. The Omega Library[22] is used to do the tiling transformation and code generation.
On the GPU platform, we generate OpenCL[23] code. On the cluster platform we generate
hybrid MPI-OpenMP code: the higher level tiles are parallelized using MPI across nodes,
and the lower level tiles are parallelized with OpenMP across processor cores within each
node.
5.7.2 Performance
The performance of three common tiling schemes is compared with that of the code generated
by our system: Wavefront, Diamond and Skewing, as shown in Table 5.2. The optimal sizes
for the tiles of these three tiling schemes are chosen under the condition that the amount
parallelism exposed does not exceed the hardware parallelism at the corresponding level.
Figure 5.8 shows the speedup of the dierence hierarchical tiling schemes for 2-dimensional
the iteration spaces for both GPU and cluster platforms, and Figure 5.9 shows the perfor-
mance data for 3-dimensional iteration spaces. The horizontal axis is the size of iteration
space. The vertical axis is the speedup over Wavefront or Diamond. On average, the tiling
104
Wavefront Diamond Skewing
Tile Shape
Tiling Matrix

x; 0
0; x
 
x; x
x; x
 
x; 0
 x; x

Table 5.2: Common tiling schemes: Wavefront, Diamond and Skewing.
scheme with automatically selected tile shape can achieve over 90% speedup over Wave-
front for Gauss-Seidel stencil, and 20%-30% speedup over Diamond for Jacobi stencil. The
speedup achieved is consistent over dierent iteration space sizes. This means our automatic
system can always nd better tile shapes compared to common tiling schemes.
For 1-D Jacobi stencil shown in Figure 5.8-(c) and (d), it can be seen that on the GPU
platform Skewing shows a 10% speedup over Diamond, but no speedup is observed on the
cluster platform. It is shown that the ideal execution times of Diamond and Skewing are
the same for the Jacobi stencil. However, the amount of parallelism within each higher level
tile varies less with Skewing than Diamond. This reason is that, as mentioned before, the
execution model of NVIDIA GPU requires a certain number of concurrent hardware threads
to achieve high throughput, Skewing shows better performance than Diamond on GPU.
5.7.3 Tile Shape
Table 5.3 shows the tile shape selection for 1-D Gauss-Seidel and Jacobi on GPU platform of
each tiling scheme corresponding to Figure 5.8. Table 5.4 shows the tile shape selection for
2-D Gauss-Seidel on GPU. T0 and T1 are the tiling matrix at each level, and T0 T1 represents
the tile shape in the original iteration space. In Table 5.3, under each matrix the shape of
the tile is shown graphically. It can be seen that the common tiling schemes of Wavefront,
Diamond and Skewing are combinations of regular tile shapes, including squares, diamonds
105
Wavefront Skewing Auto-Selected
T0

32; 0
0; 32
 
32; 0
0; 32
 
2; 30
0; 12

T1

256; 0
0; 256
 
256; 0
 256; 256
 
256; 250
0; 56

T1  T0

8192; 0
0; 8192
 
8192; 0
 8192; 8192
 
512; 10680
0; 672

(a) 1-D Gauss-Seidel on GPU (128K128K)
Diamond Skewing Auto-Selected
T0

18; 18
18; 18
 
18; 18
18; 18
   2; 18
8; 8

T1

256; 0
0; 256
 
256; 0
 256; 256
 
64; 0
 512; 512

T1  T0

4608; 4608
4608; 4608
 
4608; 4608
0; 9216
   128; 1152
5120; 13312

(b) 1-D Jacobi on GPU (128K128K)
Table 5.3: Tile shape selection for 1-D Gauss-Seidel and Jacobi.
106
00.5
1
1.5
2
2.5
Wavefront Skewing Auto-Selected
S
p
eed
u
p
S
p
eed
u
p
0
0.5
1
1.5
2
2.5
Wavefront Skewing Auto-Selected
S
p
eed
u
p
(a) 1-D Gauss-Seidel on GPU (b) 1-D Gauss-Seidel on cluster
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Diamond Skewing Auto-Selected
S
p
eed
u
p
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Diamond Skewing Auto-Selected
S
p
eed
u
p
(c) 1-D Jacobi on GPU (d) 1-D Jacobi on cluster
Figure 5.8: Hierarchical tiling performance for 2-dimensional iteration space on GPU and
cluster platforms. The horizontal axis is the size of the iteration space. The vertical axis is
the speedup over Wavefront or Diamond.
and regular parallelograms. However, the automatically selected tile shapes are irregular
parallelograms. Since it is nonintuitive to gure out the performance impact of a particular
tiling scheme with irregular tile shapes (especially for higher dimensional iteration spaces),
it is necessary to build an analytic model to evaluate tile shape selection quantitatively, and
select tile shape automatically.
107
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Wavefront Auto-Selected
S
p
eed
u
p
0
0.5
1
1.5
2
2.5
Wavefront Auto-Selected
S
p
eed
u
p
(a) 2-D Gauss-Seidel on GPU (b) 2-D Gauss-Seidel on cluster
Figure 5.9: Hierarchical tiling performance for 3-dimensional iteration space on GPU and
cluster platforms. The horizontal axis is the size of the iteration space. The vertical axis is
the speedup over Wavefront.
Wavefront Auto-Selected
T0
0@ 16; 0; 00; 16; 0
0; 0; 16
1A 0@ 16; 48; 00; 16; 0
0; 48; 16
1A
T1
0@ 16; 0; 00; 16; 0
0; 0; 16
1A 0@ 8; 8; 00; 8; 0
0; 16; 16
1A
T1  T0
0@ 256; 0; 00; 256; 0
0; 0; 256
1A 0@ 128; 512; 00; 128; 0
0; 1024; 256
1A
Table 5.4: Tile shape selection for 2-D Gauss-Seidel on GPU (192019201920).
5.7.4 Model Accuracy
Figure 5.10 compares the real execution time and the ideal execution time. The data is for
128K  128K 1-D Gauss-Seidel running on GPU, with tile shape choices listed below:
T0 =
0B@ 32; 0
0; 32
1CA ; T1 =
0B@ 256; 0
x; 256
1CA :
The value of x in T0 is scaled from 0 to  320. The shape of resulting T0 changes from a
square to a parallelogram. It is shown that the trends of the real execution time and the
108
0.0E+00
5.0E+06
1.0E+07
1.5E+07
2.0E+07
2.5E+07
0.0E+00
5.0E+03
1.0E+04
1.5E+04
2.0E+04
2.5E+04
3.0E+04
3.5E+04
0 -32 -64 -96 -128 -160 -192 -224 -256 -288 -320
Real Execution Time (Milliseconds)
Ideal Execution Time (Time Units)
M
illis
e
c
o
n
d
s
T
im
e
 U
n
its
Figure 5.10: Comparison between the real execution time and the ideal execution time for
1-D Gauss-Seidel. The left axis is for real execution time and the axis on the right is for
ideal execution time.
ideal execution time are exactly the same when tile shape varies. The result shows that
ideal execution time commutated by the model in this paper can be used to direct tile shape
selection.
5.8 Conclusion
In this chapter an analytical model is built to analyze the relation between the tile shape at
each level of hierarchical tiling and the ideal execution time of the tiled loop nest. It is shown
that the problem of optimal tile shape selection for hierarchical tiling is a multidimensional,
nonlinear, bi-level programming problem. An automatic system which uses simulated an-
nealing together with our analytical model is implemented to automatically select tile shape
for hierarchical tiling. The tile shape selected by the automatic system can be quite dierent
from the intuitive regular shapes. The experimental results show that irregular tile shapes
may have the potential to achieve higher performance over regular tiling schemes.
Currently the model focuses on the interaction between tile shape and parallelism ex-
posure, and uses ideal execution time as the optimizing goal. The model could be more
accurate by considering other factors aecting locality and communication. Besides, given
109
the complexity of the multi-dimensional nonlinear optimization problem in our model, at
present simulated annealing is sued to nd the near-optimal solution instead of the guaran-
teed optimal one. In future, it is possible to use the state-of-the-art optimization problem
solving algorithms or libraries to solve the tile shape selection problem directly.
110
Chapter 6
Related Work
Loop tiling is a traditional but eective optimization for programs whose execution time is
dominated by loops, such as programs with stencil computations. Numerous optimizations
based on the tiling of iteration spaces have been proposed for improving data locality [39, 42,
1, 2, 36, 14, 33, 38, 41, 45, 48], or exploiting parallelism [40, 43, 6, 18, 44, 7, 47]. Most of these
works use polyhedral model as the representation of the tiling transformation. Bondhugula et
al. design and implement Pluto [11], which can automatically transform loops for parallelism
and locality based on the polyhedral model. Their transformation integrates traditional
tiling techniques. However, existing research on tiling mainly focuses on tessellating tiling
and scheduling tiles atomically. There have been ideas of taking advantage of the hierarchy
of the hardware [29, 27, 32, 24, 10, 16, 9]. Hierarchical tiling as an eective optimization
to exploit hardware hierarchy, has been proposed for better usage of memory hierarchy and
enhancing parallel schedule [12, 17, 13].
For existing work on non-tessellating tiling, the closest work to the Overlapped Tiling
discussed in this thesis is that of Krishnamoorthy et al.[26]. Their approach allows over-
lap between neighboring tiles to achieve balanced schedule of parallel tiles. Other works
such as Ripeanu et al. [35] and Meng et al. [30] describe performance models to predict the
111
optimal amount of redundant computation for stencil computations in a grid environment
with message passing or for GPUs, respectively. However, in those paper only single level of
overlapped tiling is considered, so that as more communication/synchronization overhead is
eliminated, the overhead introduced by redundant computations also increases in a higher
order, which would kill the performance benet of overlapped tiling. The most important
dierence between this thesis and existing work is the Hierarchical Overlapped Tiling trans-
formation. Based on the observation that the overhead of redundant computation is the
main drawback of the overlapped tiling approach, Hierarchical Overlapped Tiling applies
overlapped tiling hierarchically. In this way, Hierarchical Overlapped Tiling is able to take
advantage of the hierarchy of the hardware and provide more room of balance between the
additional overhead introduced by redundant computation and the communication/synchro-
nization overhead eliminated.
The purpose of tiling with non-atomic tiles is usually to achieve balanced schedule when
parallelizing tiles. When parallelizing ISLs, a typical strategy is to do loop skewing and
wavefronting. However, wavefronting strategy would lead to an imbalanced schedule [26]. In
order to achieve balanced schedule when tiling ISLs, Krishnamoorthy et al. [26] propose split
tiling. The idea of split tiling is similar to Conjugate-Trapezoid Tiling: each tile is divided
into sub-regions, and the sub-region without dependence is processed rst, followed by the
sub-regions with dependences. Since inter-tile communication needs to be done between
the computation of sub-regions within a tile, split tiling is a tiling scheme with non-atomic
tiles. On the other hand, split tiling can be also viewed as a tiling scheme with more
than one tile shapes as discussed in Section 2.1.3, and the set of neighboring tiles with
dierent shapes forms a hyper-parallelepiped-shaped super tile. Tang et al. [37] implement
the Pochoir compiler, which automates a trapezoidal decomposition for stencil code. Their
decomposition algorithm is very similar to split tiling. So it can be viewed as another example
of tiling with non-atomic tiles. However, their work is based on shared memory machines,
112
and the inter-tile communication overhead is not studied.
For split tiling and the Pochoir compiler, inter-tile communication is still aggregated at
the beginning or end of the execution of tiles or subtiles/sub-regions; inter-tile communica-
tion is not interleaved with computations. So although the tiles/super-tiles are non-atomic,
subtiles and sub-regions are still execution atomically. Because communication latency can-
not be overlapped with computation time, the overall performance could still suer from
the overhead caused by communication latency. This might be tolerable for shared memory
machines, but could cause performance problems for distributed machines, which is expected
to have higher inter-node communication latency. Demmel et al. [15] proposed techniques
to reduce communications for sparse matrix computations under distributed memory en-
vironment. Their technique allows communication during the execution within a tile, and
schedule computation and communications to reduce the penalty of latency. However, their
results show that speedups are only achieved for machines with very high communication la-
tency; no speedup is achieved by their parallel algorithm on machines with fast network. On
contrast, the Conjugate-Trapezoid Tiling introduced in this thesis can change the number
delayed steps h0 to t target platforms with dierent communication latency.
It is well known that tile shape, as well as tile size, can have a signicant impact on
locality, inter-tile communication and parallelism [31, 46, 28]. However, most of the existing
research on tiling mainly focus on choosing tile size assuming those of regular shapes (such as
rectangles, or regular parallelograms) that can be produced with the help of simple auxiliary
transformations such as loop skewing. This is because the performance impact of irregular
tiles shapes is usually nonintuitive and it is not easy to develop a strategy to nd a good tile
shape among irregular shapes. Besides, it is also dicult to write code with irregular tiles
manually. Xue [46] presents an approach to nd the parallelogram or hyper-parallelepiped
tile shape that is able to minimize the amount of communication between tiles. This work
does not consider the impact of tile shape on parallel scheduling. Hogstedt et al. [18] intro-
113
duce a model to select the tile shape that minimizes the execution time. Since their work
only consider a single level of tiling, their model shows that the complexity to determine of
execution time when the tiles are parallelograms is equivalent to the complexity of linear
programming problem. None of these works discussed above studies the tile shape selection
problem in the context of hierarchical tiling. Renganarayana and Rajopadhye [34] have the
closest work of using the model of linear programming problem to optimize tiling schemes.
Their work presents a model to estimate the overall execution time of loop nests. Their
framework determines the optimal tile sizes for hierarchical tiling by solving a convex opti-
mization problem. However, their work only considers hyper-rectangle tiles, as well as the
shape of the iteration space. Although it is more dicult to analyze the performance impact
of hierarchical tiling with the general tile shapes, experimental results given in this thesis
show that irregular tile shapes have the potential to achieve higher performance over regular
shapes.
114
Chapter 7
Conclusions
This thesis studies the application of tiling techniques for stencil compucations. previous
studies of tiling optimizations mainly focused on a single level tiling, tessellating tiling, reg-
ular shape tile, and executing tiles atomically. This thesis discusses several novel tiling tech-
niques, including hierarchical tiling, non-tessellating tiling, executing tiles non-atomically,
irregular tile shapes, and combinations of these techniques.
Contributions of this thesis include:
 The introduction of a unied tiling representation framework to represent general tiling
schemes, including non-tessellating tiling and executing tiles non-atomically. An au-
tomatic code generator is developed to facilitate this tiling representation framework,
which was used for the evaluation of the proposed tiling schemes.
 Overlapped Tiling and Hierarchical Overlapped Tiling are introduced in this thesis.
They belong to the category of non-tessellating tiling. Both of these tiling schemes
aim at eliminating communication/synchronization overhead by introducing redundant
computation. The second scheme also takes advantage of hardware hierarchy to reduce
the amount of redundant computation.
115
 The introduction of the design and evaluation of Conjugate-Trapezoid Tiling. Conjugate-
Trapezoid Tiling pipelines intra-tile computation and inter-tile communication, so that
the communication latency can be hidden by overlapping with computation time.
 A novel approach to the tile shape selection problem for hierarchical tiling. It is
concluded that optimal tile shape selection for hierarchical tiling is a multidimensional,
nonlinear, bi-level programming problem. Experimental results show that the irregular
tile shapes have the potential to outperform intuitive tiling shapes.
116
Bibliography
[1] W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie. On the performance enhancement of
paging systems through program analysis and transformations. IEEE Transactions on
Computers, 30(5):341{356, May 1981.
[2] N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In Proceed-
ings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), Supercomputing
'00, Washington, DC, USA, 2000. IEEE Computer Society.
[3] M. Alpert. Not just Fun and Games. Scientic American, 4, 1999.
[4] W. F. Ames. Numerical Methods for Partial Dierential Equations. Academic, San
Diego, CA, sencond edition, 1977.
[5] C. Ancourt and F. Irigoin. Scanning polyhedra with DO loops. In Proceedings of the
third ACM SIGPLAN symposium on Principles and practice of parallel programming,
PPOPP '91, pages 39{50, New York, NY, USA, 1991. ACM.
[6] R. Andonov, S. Balev, S. Rajopadhye, and N. Yanev. Optimal semi-oblique tiling.
In Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and
architectures, SPAA '01, pages 153{162, New York, NY, USA, 2001. ACM.
[7] R. Andonov and S. Rajopadhye. Optimal orthogonal tiling of 2-D iterations. Journal
of Parallel and Distributed Computing, 45(2):159 { 165, 1997.
117
[8] M. Berkelaar, K. Eikland, and P. Notebaert. Lpsolve.
http://lpsolve.sourceforge.net/5.5/.
[9] G. Bikshandi. Parallel Programming with Hierarchically Tiled Arrays. PhD thesis,
UIUC, 2007.
[10] G. Bikshandi, J. Guo, D. Hoeinger, G. Almasi, B. B. Fraguela, M. J. Garzaran,
D. Padua, and C. von Praun. Programming for parallelism and locality with hier-
archically tiled arrays. In Proceedings of the eleventh ACM SIGPLAN symposium on
Principles and practice of parallel programming, PPoPP '06, pages 48{57, New York,
NY, USA, 2006. ACM.
[11] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical auto-
matic polyhedral parallelizer and locality optimizer. In Proceedings of the 2008 ACM
SIGPLAN conference on Programming language design and implementation, PLDI '08,
pages 101{113, New York, NY, USA, 2008. ACM.
[12] L. Carter, J. Ferrante, and S. F. Hummel. Hierarchical tiling for improved superscalar
performance. In Proceedings of the 9th International Symposium on Parallel Processing,
IPPS '95, pages 239{245, Washington, DC, USA, 1995. IEEE Computer Society.
[13] L. Carter, J. Ferrante, S. F. Hummel, B. Alpern, and K.-S. Gatlin. Hierarchical tiling:
A methodology for high performance. Technical report, 1996.
[14] S. Coleman and K. S. McKinley. Tile size selection using cache organization and data
layout. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language
design and implementation, PLDI '95, pages 279{290, New York, NY, USA, 1995. ACM.
[15] J. Demmel, M. Hoemmen, M. Mohiyuddin, and K. Yelick. Avoiding communication in
sparse matrix computations. In Parallel and Distributed Processing, 2008. IPDPS 2008.
IEEE International Symposium on, pages 1{12, April.
118
[16] B. B. Fraguela, J. Guo, G. Bikshandi, M. J. Garzaran, G. Almasi, J. Moreira, and
D. Padua. The hierarchically tiled arrays programming approach. In Proceedings of the
7th workshop on Workshop on languages, compilers, and run-time support for scalable
systems, LCR '04, pages 1{12, New York, NY, USA, 2004. ACM.
[17] A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris,
J. Ramanujam, and P. Sadayappan. Parametric multi-level tiling of imperfectly nested
loops. In Proceedings of the 23rd international conference on Supercomputing, ICS '09,
pages 147{157, New York, NY, USA, 2009. ACM.
[18] K. Hogstedt, L. Carter, and J. Ferrante. Selecting tile shape for minimal execution
time. In Proceedings of the eleventh annual ACM symposium on Parallel algorithms
and architectures, SPAA '99, pages 201{211, New York, NY, USA, 1999. ACM.
[19] W. Huang, M. R. Stan, K. Skadron, K. Sankaranarayanan, S. Ghosh, and S. Velusam.
Compact thermal modeling for temperature-aware design. In Proceedings of the 41st
annual Design Automation Conference, DAC '04, pages 878{883, New York, NY, USA,
2004. ACM.
[20] F. Irigoin and R. Triolet. Supernode partitioning. In Proceedings of the 15th ACM
SIGPLAN-SIGACT symposium on Principles of programming languages, POPL '88,
pages 319{329, New York, NY, USA, 1988. ACM.
[21] T. A. Johnson, S.-I. Lee, L. Fei, A. Basumallik, G. Upadhyaya, R. Eigenmann, and S. P.
Midki. Experiences in using cetus for source-to-source transformations. In Proceedings
of the 17th international conference on Languages and Compilers for High Performance
Computing, LCPC'04, pages 1{14, Berlin, Heidelberg, 2005. Springer-Verlag.
[22] W. Kelly, V. Maslov, W. Pugh, E. Rosser, T. Shpeisman, and D. Wonnacott. The
Omega Library interface guide. Technical report, 1995.
119
[23] Khronos OpenCL Working Group. The OpenCL Specication, version 1.0.29, 8 Decem-
ber 2008.
[24] I. Kodukula, K. Pingali, R. Cox, and D. Maydan. An experimental evaluation of tiling
and shackling for memory hierarchy management. In Proceedings of the 13th interna-
tional conference on Supercomputing, ICS '99, pages 482{491, New York, NY, USA,
1999. ACM.
[25] G. Kreisel and J.-L. Krivine. Elements of mathematical logic. North-Holland Pub. Co.,
1967.
[26] S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and
P. Sadayappan. Eective automatic parallelization of stencil computations. In Pro-
ceedings of the 2007 ACM SIGPLAN conference on Programming language design and
implementation, PLDI '07, pages 235{244, New York, NY, USA, 2007. ACM.
[27] A. Leung, N. Vasilache, B. Meister, M. Baskaran, D. Wohlford, C. Bastoul, and
R. Lethin. A mapping path for multi-gpgpu accelerated computers from a portable
high level programming abstraction. In Proceedings of the 3rd Workshop on General-
Purpose Computation on Graphics Processing Units, GPGPU '10, pages 51{61, New
York, NY, USA, 2010. ACM.
[28] A. W. Lim, G. I. Cheong, and M. S. Lam. An ane partitioning algorithm to maxi-
mize parallelism and minimize communication. In Proceedings of the 13th international
conference on Supercomputing, ICS '99, pages 228{237, New York, NY, USA, 1999.
ACM.
120
[29] J. Liu, Y. Zhang, W. Ding, and M. Kandemir. On-chip cache hierarchy-aware tile
scheduling for multicore machines. In Proceedings of the 9th Annual IEEE/ACM Inter-
national Symposium on Code Generation and Optimization, CGO '11, pages 161{170,
Washington, DC, USA, 2011. IEEE Computer Society.
[30] J. Meng and K. Skadron. Performance modeling and automatic ghost zone optimization
for iterative stencil loops on GPUs. In Proceedings of the 23rd international conference
on Supercomputing, ICS '09, pages 256{265, New York, NY, USA, 2009. ACM.
[31] H. Ohta, Y. Saito, M. Kainaga, and H. Ono. Optimal tile size adjustment in compiling
general DOACROSS loop nests. In Proceedings of the 9th international conference on
Supercomputing, ICS '95, pages 270{279, New York, NY, USA, 1995. ACM.
[32] N. Park, B. Hong, and V. Prasanna. Tiling, block data layout, and memory hierarchy
performance. IEEE Transactions on Parallel and Distributed Systems, 14(7):640{654,
July.
[33] J. Ramanujam and P. Sadayappan. Tiling multidimensional iteration spaces for non-
shared memory machines. In Proceedings of the 1991 ACM/IEEE conference on Super-
computing, Supercomputing '91, pages 111{120, New York, NY, USA, 1991. ACM.
[34] L. Renganarayana and S. Rajopadhye. A geometric programming framework for optimal
multi-level tiling. In Proceedings of the 2004 ACM/IEEE conference on Supercomputing,
SC '04, pages 18{, Washington, DC, USA, 2004. IEEE Computer Society.
[35] M. Ripeanu, A. Iamnitchi, and I. T. Foster. Cactus application: Performance predictions
in grid environments. In Proceedings of the 7th International Euro-Par Conference
Manchester on Parallel Processing, Euro-Par '01, pages 807{816, London, UK, UK,
2001. Springer-Verlag.
121
[36] Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Pro-
ceedings of the ACM SIGPLAN 1999 conference on Programming language design and
implementation, PLDI '99, pages 215{228, New York, NY, USA, 1999. ACM.
[37] Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The
pochoir stencil compiler. In Proceedings of the 23rd ACM symposium on Parallelism
in algorithms and architectures, SPAA '11, pages 117{128, New York, NY, USA, 2011.
ACM.
[38] M. Wolf, D. Maydan, and D.-K. Chen. Combining loop transformations considering
caches and scheduling. International Journal of Parallel Programming, 26:479{503,
1998.
[39] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the
ACM SIGPLAN 1991 conference on Programming language design and implementation,
PLDI '91, pages 30{44, New York, NY, USA, 1991. ACM.
[40] M. E. Wolf and M. S. Lam. A loop transformation theory and an algorithm to maximize
parallelism. IEEE Transactions on Parallel and Distributed Systems, 2:452{471, October
1991.
[41] W. Wolf and M. Kandemir. Memory system optimization of embedded software. Pro-
ceedings of the IEEE, 91(1):165{182, Jan.
[42] M. Wolfe. Iteration space tiling for memory hierarchies. In Proceedings of the Third
SIAM Conference on Parallel Processing for Scientic Computing, pages 357{361,
Philadelphia, PA, USA, 1989. Society for Industrial and Applied Mathematics.
[43] M. Wolfe. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE conference
on Supercomputing, Supercomputing '89, pages 655{664, New York, NY, USA, 1989.
ACM.
122
[44] D. Wonnacott. Time skewing for parallel computers. In Proceedings of the 12th In-
ternational Workshop on Languages and Compilers for Parallel Computing, LCPC '99,
pages 477{480, London, UK, UK, 2000. Springer-Verlag.
[45] D. Wonnacott. Achieving scalable locality with time skewing. International Journal of
Parallel Programming, 30:181{221, 2002.
[46] J. Xue. Communication-minimal tiling of uniform dependence loops. In Languages
and Compilers for Parallel Computing, volume 1239, pages 330{349. Springer Berlin /
Heidelberg, 1997.
[47] J. Xue. Loop tiling for parallelism. Kluwer Academic Publishers, Norwell, MA, USA,
2000.
[48] J. Xue and C.-H. Huang. Reuse-driven tiling for improving data locality. International
Journal of Parallel Programming, 26:671{696, 1998.
[49] X. Zhou, J.-P. Giacalone, M. J. Garzaran, R. H. Kuhn, Y. Ni, and D. Padua. Hierar-
chical overlapped tiling. In Proceedings of the Tenth International Symposium on Code
Generation and Optimization, CGO '12, pages 207{218, New York, NY, USA, 2012.
ACM.
123
