Compiler Techniques for Optimizing Communication and Data Distribution for Distributed-Memory Computers by Palermo, Daniel Joseph
May 1996 UILU-ENG-96-2215
CRHC-96-Q9
University of Illinois at Urbana-Champaign
Compiler Techniques for Optimizing 
Communication and Data Distribution 
for Distributed-Memory Computers
Daniel Joseph Palermo
Coordinated Science Laboratory
1308 West Main Street, Urbana, IL 61801
: . i L U
SECURirY (Tl a s STf ic a t io n  of
REPORT DOCUMENTATION PAGE
* la. REPORT SECURITY CLASSIFICATION 
U n c l a s s i f i e d
lb. RESTRICTIVE MARKINGS 
N on e
2a. SECURITY CLASSIFICATION AUTHORITY
1 ............. .......  . .....
3. DISTRIBUTION/AVAILABILITY OF REPORT 
A p p r o v e d  f o r  p u b l i c  r e l e a s e ;  
d i s t r i b u t i o n  u n l i m i t e d2b. DECLASSIFICATION/DOWNGRADING SCHEDULE |
l
| 4. PERFORMING ORGANIZATION REPORT NUMBER(S)
UILU-ENG-96-2215 (CRHC-96-09)
5. MONITORING ORGANIZATION REPORT NUMBER(S)
6a. NAME OF PERFORMING ORGANIZATION 
! C o o r d in a t e d  S c ie n c e  L a b  
U n i v e r s i t y  o f  I l l i n o i s
6b. OFFICE SYMBOL 
(If applicable)
N/A
7a. NAME OF MONITORING ORGANIZATION
ARPA
6c ADDRESS (City, State, and ZIP Code)
1 ‘ 1308 W. Main S t .  
U rbana, IL 61801
7b. ADDRESS (City, State, and ZIP Code)
3701 N. F a i r f a x  D rive  
A r l in g to n ,  VA 22203-1714
8a. NAME OF FUNDING/SPONSORING  
ORGANIZATION
7a.
8b. OFFICE SYMBOL 
(If applicable)
9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER
10. SOURCE OF FUNDING NUMBERS
7b. PROGRAM PROJECT TASK
ELEMENT NO. NO. NO.
WORK UNIT 
ACCESSION NO.
11. TITLE (Include Security Classification)
Compiler Techniques f o r  O ptim izing  Communication and Data D i s t r i b u t i o n  
f o r  D is tribu ted-M em ory  M ulticom puters
12. PERSONAL AUTHOR(S)
D anie l Joseph Palermo
13a. TYPE OF REPORT 13b. TIME COVERED 14. DATE OF REPORT (fear, Month, Day) 15. PAGE COUNT
T e c h n ic a l  . FROM TO June 1996 145
16. SUPPLEMENTARY NOTATION
17. COSATi CODES
FIELD GROUP SUB-GROUP
18. SUBJECT TERMS {Continue on reverse if Recessary and identify by block number)
p a r a l l e l i z i n g  c o m p ile rs ,  d i s t r i b u t e d  memory m u lt ic o m p u te rs ,
communication o p t im iz a t io n ,  au to m a tic  d a ta  p a r t i t i o n i n g ,  
d a ta  r e d i s t r i b u t i o n
19. a b s t r a c t  (Cont/m Distributed-memory multicomputers, such as the Intel iPSC/860, the Intel Paragon, the IBM SP-
l/SP-2, the NCUBE/2, and the Thinking Machines CM-5, offer significant advantages over shared-memory 
multiprocessors in terms o f cost and scalability. However, lacking a global address space, they present a 
very difficult programming model in which the user must specify how data and computation are to be 
distributed across processors and determine these sections o f data to be communicated to specific processors. 
To overcome this difficulty, significant research effort has been aimed at source-to-source parallelizing 
compilers that relieve the programmer from the task of program partitioning and communication 
generation, while the specification of data distributions has remained a responsibility of the programmer.
The quality o f the data distribution for a given application is crucial to obtaining high performance 
on distributed-memory multi computers. By selecting an appropriate data distribution, the amount of 
communication required by an application can be dramatically reduced. The resulting performance using a 
given data distribution therefore depends on how well the compiler can optimize the remaining 
communication. In this thesis, I present and analyze several techniques to optimize communication and, 
based on the performance of these optimizations, automatically select the best data partitioning for a given 
application.
20. DISTRIBUTION/AVAILABILITY OF ABSTRACT 
0UNCLASSIFIED/UNLIMITED □  SAME AS RPT. □  OTIC USERS
21. ABSTRACT SECURITY CLASSIFICATION 
U n c l a s s i f i e d
22a. NAM E OF RESPONSIBLE INDIVIDUAL 22b. TELEPHONE ( Include Area Code) 22c. OFFICE SYMBOL
DD FORM 1 4 7 3 ,84 MAR SECURITY CLASSIFICATION OF THIS PAGE83 APR edition may be used until exhausted. 
All other editions are obsolete.
UNCLASSIFIED
U N C L A S S I F I E D _______
SECURITY CLASSIFICATION OR THIS RACE
UNCLASSIFIED
S E C U R I T Y  C L A S S I F I C A T I O N  O F  T H I S  P A r . F
COMPILER TECHNIQUES FOR 
OPTIMIZING COMMUNICATION AND DATA DISTRIBUTION 
FOR DISTRIBUTED-MEMORY MULTICOMPUTERS
BY
DANIEL JOSEPH PALERMO
B.S., Purdue University, 1990 
M.S., University of Southern California, 1991
THESIS
Submitted in partial fulfillment of the requirements 
for the degree of Doctor of Philosophy in Electrical Engineering 
in the Graduate College of the 
University of Illinois at Urbana-Champaign, 1996
Urbana, Illinois
©  Copyright by Daniel Joseph Palermo, 1996
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
THE GRADUATE COLLEGE
MAY 1996
WE HEREBY RECOMMEND THAT THE THESIS BY 
DANIEL JOSEPH PALERMO
COMPILER TECHNIQUES FOR OPTIMIZING COMMUNICATION AND 
ENTITLED____________________ __________________________________________________
DATA DISTRIBUTION FOR DISTRIBUTED-MEMORY MULTICOMPUTERS_____________
BE ACCEPTED IN PARTIAL FULFILLM ENT OF THE REQUIREMENTS FOR
t  Required for doctor's degree but not for master’s.
0 -5 17
COMPILER TECHNIQUES FOR 
OPTIMIZING COMMUNICATION AND DATA DISTRIBUTION 
FOR DISTRIBUTED-MEMORY MULTICOMPUTERS
Daniel Joseph Palermo, Ph.D.
Department of Electrical and Computer Engineering 
University of Illinois at Urbana-Champaign, 1996 
Prithviraj Banerjee, Advisor
Distributed-memory multicomputers, such as the Intel iPSC/860, the Intel Paragon, the IBM 
SP-1 /SP-2, the NCUBE/2, and the Thinking Machines CM-5, offer significant advantages over 
shared-memory multiprocessors in terms of cost and scalability. However, lacking a global ad­
dress space, they present a very difficult programming model in which the user must specify 
how data and computation are to be distributed across processors and determine those sections 
of data to be communicated to specific processors. To overcome this difficulty, significant re­
search effort has been aimed at source-to-source parallelizing compilers for multicomputers that 
relieve the programmer from the task of program partitioning and communication generation, 
while the specification of data distributions has remained a responsibility of the programmer.
The quality of the data distribution for a given application is crucial to obtaining high per­
formance on distributed-memory multicomputers. By selecting an appropriate data distribution, 
the amount of communication required by an application can be dramatically reduced. The re­
sulting performance using a given data distribution therefore depends on how well the com­
piler can optimize the remaining communication. In this thesis, we present and analyze several 
techniques to optimize communication and, based on the performance of these optimizations, 
automatically select the best data partitioning for a given application.
Previous work in the area of optimizing data distribution used constraints based on perfor­
mance estimates (which model the communication optimizations) to select high quality data 
distributions which remain in effect for the entire execution of an application. For complex pro­
grams, however, such static data distributions may be insufficient to obtain acceptable perfor­
mance. The selection of distributions that dynamically change over the course of a program’s 
execution (taking into account the added overhead of performing redistribution) adds another di­
mension to the data partitioning problem. In this thesis, we present a technique that extends the
iii

To my wife and my parents, 
without whose love and support 
none of this would have been possible.
ACKNOWLEDGMENTS
I would like to thank my advisor, Professor Prithviraj Banerjee, for his constant support and 
advice throughout the course of this research. I would also like to thank the members of my 
committee, Professors Wen-Mei Hwu, David Padua, and Constantine Polychronopoulos, for 
their valuable comments and suggestions on my work and this thesis.
I would like to thank all of the members of both the PARADIGM and ProperCAD projects 
for their technical expertise as well as their camaraderie. I would especially like to thank John 
Chandy, Eugene Hodges, and Ernesto Su, who have all made significant contributions during the 
course of this research, Amber Roy-Chowdhury and Antonio Lain for their constant feedback 
on the internal design of various aspects of the compiler, as well as Shankar Ramaswamy for 
teaching us all that “talk is research.” Additionally, I would also like to thank my good friend, 
Eugene Chen from the Solid State Devices Laboratory, for making me feel at home in Urbana- 
Champaign from the first moment I arrived.
For their support throughout the course of this research, I would like to thank the National 
Aeronautics and Space Administration and the Advanced Research Projects Agency. I would 
also like to thank the National Center for Supercomputing Applications, the San Diego Super­
computing Center, and the Argonne National Laboratory for providing access to their comput­
ing facilities.
Most of all I would like to thank my entire family for all of their love, encouragement, pa­
tience, and support throughout my studies. Most importantly, I also give thanks to God for the 
abilities and inspiration which He has given me.
vi
TABLE OF CONTENTS
CHAPTER PAGE
1 INTRODUCTION. 1
1.1 Overview of High Performance F o r t r a n ................................................................ 2
1.2 Overview of the PARADIGM P r o je c t .................................................. ... 5
1.2.1 Program analysis . ..................................... ..............................................  5
1.2.2 Automatic data partitioning .................................. ....................................  6
1.2.3 Compiling regular computations . .............................................................  6
1.2.4 Generic library interface . ...........................................................................  2
1.3 Thesis C ontributions.................................................................................................  8
1.4 Thesis O rgan ization .................................................................................................  y
2 RELATED W O RK ........................  11
2.1 Optimizing Communication ............................................... ....................................  H
2.2 Optimizing Data Distribution ................. ' ............................... 13
2.2.1 Static partitioning ....................  13
2.2.2 Dynamic p a rtit io n in g .....................................................    14
2.3 Optimizing Data R ed istribu tion ........................................   i8
2.4 Summary ............................................................    19
3 OPTIMIZING COMMUNICATION............  20
3.1 Message C oalesc ing ...........................• ■ ............................................................... 21
3.2 Message Vectorization .................................. ............................. .......................... • 22
3.3 Message A g g reg a tio n ................      24
3.4 Message P ip e lin in g ........................................................     26
3.4.1 Pipeline performance analysis ..............................    29
3.4.2 Serial computation cost estimation ..........................    33
3.5 Sum m ary.....................................................................................................................  34
4 OPTIMIZING DATA DISTRIBUTION..............................................  35
4.1 Static Performance E stim ation ....................... - .................................... ... 35
4.2 Static Distribution S e lec tio n ....................................................................................  37
4.2.1 A lignm en t..................................................................................................... 38
4.2.2 Block/cyclic distribution.............................................................................  38
4.2.3 Block size se le c tio n ....................................................................................  38
vii
4.2.4 Mesh co n fig u ra tio n .....................................................................................  39
4.3 Dynamic Distribution S e le c tio n .............................................................................  39
4.3.1 Motivation for dynamic d is tr ib u tio n s ...................................................... 39
4.3.2 Overview of the dynamic distribution ap p ro ach .....................................  41
4.3.3 Phase decomposition ................................................................................  43
4.3.4 Phase and phase transition selec tion .........................................................  49
4.3.5 Hierarchical Component Affinity Graph (H C A G ).................................. 55
4.4 Sum m ary.....................................................................................................................  58
5 OPTIMIZING DATA REDISTRIBUTION..........................................................  59
5.1 Data Redistribution A nalysis..............................   61
5.1.1 Computing reaching d istribu tions............................................................. 63
5.1.2 Constructing the distribution flow graph ...................................................  65
5.1.3 Representing distribution s e t s ..............................   66
5.2 Interprocedural Data Redistribution A n a ly s is ......................................................  68
5.2.1 Distribution sy n th e s is .......................    70
5.2.2 Redistribution synthesis . ..............................    73
5.2.3 Static Distribution Assignment (S D A )..............................   79
5.3 Implementation ...........................................    83
5.3.1 Source level data f lo w ....................   84
5.3.2 Virtual c lo n e s ..............................................................................................  85
5.3.3 Array re sh ap in g ....................   85
5.3.4 Examples ...........................................    86
5.4 Sum m ary................        91
6 EXPERIMENTAL RESULTS  ....................................................................... 92
6.1 Optimizing C om m unication....................................................................................  92
6.1.1 Results of run-time reso lu tion .................................    95
6.1.2 Results of message coalescing and loop bounds reduction.....................  96
6.1.3 Results of message vectorization ............................................................. 96
6.1.4 Results of message agg regation ................................................................ 97
6.1.5 Results of all non-pipelined optim izations...............................................  98
6.1.6 Results of message p ip e lin in g ......................................................   100
6.2 Optimizing Data Distribution and Redistribution ...............................................  107
6.2.1 2-D Alternating Direction Implicit (ADI2D) iterative m e th o d .............. 107
6.2.2 Shallow water weather prediction benchmark .........................................  110
6.3 Sum m ary.....................................................................................................................  114
7 CONCLUSIONS.....................................................................................................  115
7.1 Summary of Contributions.......................................................................................  115
7.2 Future W o r k ............................................................................................................... 117
7.2.1 Optimizing com m unication ......................................................................  118
7.2.2 Optimizing data distribution .......................................................................  118
7.2.3 Optimizing data redistribution...................................................................  120
viii
7.2.4 Distributed-shared memory architectures.....................................  120
REFERENCES ................................................................
.......................................................... ...  X ZiZ*
v , , v  ........................................................................................................................  133
IX
LIST OF TABLES
TABLE PAGE
3.1 Communication model parameters................ 2i
4.1 Communication estimate costs m odels......... ^
4.2 Communication prim itives........................................... ...............................  46
4.3 Detected phases and estimated execution times (sec) for ADI2D 52
6.1 Data partitioning for initial t e s t s ......................  93
6.2 Program efficiencies vs. op tim iza tion ............  99
6.3 Mesh configurations................ 101
6.4 Data partitioning for pipeline tests . ...........................................  2Q2
6.5 Comparison of the granularity estimates to the optimal granularity.................... 103
6.6 Empirically estimated time (ms) to transpose a 1-D partitioned m a t r i x ...............  110
x
LIST OF FIGURES
FIGURE PAGE
1.1 The HPF data mapping m o d e l ....................................................................................... 3
1.2 Example data distributions.............................................................................................  4
1.3 PARADIGM Compiler O v e rv ie w ................................................................................  5
3.1 Message vectorization..................................................   23
3.2 Message vectorization algorithm ................    23
3.3 Message aggregation........................................................................................................  24
3.4 Communication placement a lg o r ith m .......................................................................... 25
3.5 Code example requiring message pipelin ing................................................................  27
3.6 Examples of message pipelin ing ........................................................    28
3.7 Estimation framework for coarse grain p ip e lin in g ......................................................  30
4.1 Two-dimensional Fast Fourier T ran s fo rm ..................................................................  40
4.2 Overview of the dynamic distribution ap p ro ach .........................................................  41
4.3 2-D Alternating Direction Implicit iterative method (ADI2D) ..................................  42
4.4 Phase decomposition ........................................................................................................  43
4.5 Communication graph and some example initial edge costs for A D I2 D .................  46
4.6 Example graph illustrating the computation of a cut . . .  ..........................................  47
4.7 Partitioned communication graph for A D I2 D ............................................................  48
4.8 Phase transition graph for A D I2 D ..........................      50
4.9 Pseudo-code for the partitioning algorithm ................................................................ 53
4.10 Hierarchical Component Affinity Graph (HCAG) for A D I2 D .................................. 57
5.1 Overview of the array redistribution data-flow analysis fra m e w o rk ................. ... . 61
5.2 Splitting CFG nodes to obtain DFG nodes  ........................... ... . 65
5.3 Distribution set using bit vectors....................................................................................  67
5.4 Example call graph and depth-first traversal o rd e r ......................................................  68
5.5 Distribution s y n th e s is ..........................    71
5.6 Interprocedural analysis for distribution synthesis .  ...................................  72
5.7 Redistribution sy n th e s is ........................................     74
5.8 Interprocedural analysis for redistribution sy n thesis ..................................................  76
5.9 Initialization for redistribution synthesis....................................................................... 77
5.10 Algorithms for detecting non-conforming HPF p ro g ra m s ........................................  80
5.11 Example of static distribution ass ig n m en t..................................................................  82
xi
5.12 Example loop shown with flow transitions...................................................................  84
5.13 Synthetic example for interprocedural redistribution optimization . ........................  87
5.14 2-D ADI with interprocedural redistribution optim ization ........................................  89
5.15 Example of loop invariant redistribution....................................................................... 91
6.1 Run-time re so lu tio n ........................................................................................................  94
6.2 Reduced loop bounds and coalescing ..........................................................................  94
6.3 Message vectorization..................................................................................................... 94
6.4 Message aggregation..................................................   94
6.5 Comparison of combined optimizations .......................................................................  99
6.6 Performance com parison..............................................    101
6.7 Performance comparison of pipelining vs. granularity.....................................  105
6.8 Traces from SOR with pipelining on an Intel P a rag o n .......................    106
6.9 Modes of parallel execution for ADI2D .......................................................................  108
6.10 Performance of A D I 2 D .......................      109
6.11 Phase transition graph and solution for S h a llo w .........................................................  112
6.12 Performance of S h a llo w ....................    113
xii
CHAPTER 1
INTRODUCTION
Distributed-memory massively parallel multicomputers can provide the high levels of per­
formance required to solve the Grand Challenge computational science problems [1]. Multi­
computers, such as the Intel iPSC/860, the Intel Paragon, the IBM SP-1 /SP-2, the NCUBE/2, 
and the Thinking Machines CM-5, offer significant advantages over shared-memory multipro­
cessors in terms of both cost and scalability. Unfortunately, extracting all of the computational 
power from these machines requires users to write efficient software for them, which is a labo­
rious process. One major reason for this difficulty is the absence of a global address space. As a 
result, the programmer has to manually distribute computations and data across processors and 
manage communication explicitly.
To overcome this difficulty, significant research effort has been aimed at developing source- 
to-source parallelizing compilers for multicomputers that relieve the programmer from the task 
of program partitioning and communication generation. As a result, High Performance For­
tran [2] (HPF) has been developed in a collaborative effort between researchers in industry and 
academia to help standardize the specification of data distribution for such a compiler.
An HPF compiler must partition the computation and generate any required communication, 
but the actual specification of data distributions still remains a responsibility of the programmer. 
The quality of the data distribution for a given application is crucial to obtaining high perfor­
mance on distributed-memory multicomputers. By selecting an appropriate data distribution, 
the amount of communication required by an application can be dramatically reduced. The re­
sulting performance using a given data distribution therefore depends on how well the compiler
1
can optimize the remaining communication. In this thesis, we present several techniques to op­
timize communication and, based on the performance of these optimizations, automatically se­
lect the best data partitioning for a given application to relieve the programmer of the burden of 
selecting the data distribution.
The work in this thesis has been implemented as part of the PARADIGM (PARAllelizing 
compiler for Distributed memory General-purpose Multicomputers) project [3] in order to de­
velop an automated means to parallelize and optimize sequential programs for efficient execu­
tion on distributed-memory multicomputers.
1.1 Overview of High Performance Fortran
High Performance Fortran (HPF) was designed to provide a machine-independent program­
ming model for compiling (initially regular, dense-matrix) scientific programs for efficient exe­
cution on distributed-memory multicomputers. By providing directives to describe the distribu­
tion of data across the processors of a machine, computation is implicitly distributed (typically 
using the owner-computes rule, which states that the processor owning a data item performs 
all computations for that item). Any non-local data required in a given computation must be 
obtained through interprocessor communication. Thus, the programmer is still responsible for 
choosing partitionings that perform well but is freed from having to write code for the communi­
cation of data in the program. As compared to other efforts to standardize distributed-memory 
programming using message passing, such as the Message Passing Interface [4] (MPI), HPF 
provides a higher-level programming model using the philosophy that the complexity (of code 
partitioning and communication generation) can be shifted into the compiler.
In the HPF data mapping model, shown in Figure 1.1, dimensions of distributed arrays are 
first aligned with one another, and then distributed onto a rectilinear arrangement (or mesh) of 
abstract processors. The abstract processor mesh is then mapped to physical processors. Each 
array is either aligned to another, explicitly distributed, or distributed according to the com­
piler’s choice.
2
Abstract
Arrays or Group of processor Physical
other objects aligned objects arrangements processors
or or dependent
REALIGN REDISTRIBUTE directives
(optional)
Figure 1.1: The HPF data mapping model
In Figure 1.2, several examples of HPF data distributions are shown for a two-dimensional 
array distributed onto four processors. This illustrates how each dimension of an array can be 
given a different distribution. Blocked and cyclic distributions are actually two extremes of a 
general distribution commonly referred to as block-cyclic (or cyclic(A;) where k is the block 
size). A block-cyclic distribution on a dimension of an array assigns a fixed size block of that 
dimension to each processor in round-robin fashion until the entire dimension is distributed. A 
block distribution is equivalent to a block-cyclic distribution in which the block size is the size 
of the original array dimension (N ) divided by the number of processors (P ), cyclic(^). A 
cyclic distribution is simply a block-cyclic distribution with a block size of one, cyclic(l).
In HPF, dynamic distributions can be described, explicitly using executable redistribution 
directives (REDISTRIBUTE or REALIGN, which specify where new distributions become active) 
or implicitly by calling functions1 (which require different data distributions than the calling 
function). In order to actually compile an HPF program into an efficient form, however, both 
the redistribution1 2 operations as well as the possible distributions for the individual blocks of 
code must be known. Since the HPF REDISTRIBUTE and REALIGN directives only specify redis­
tribution information, both the intra- and inter-procedural control flow through a program must 
be examined to decide exactly which redistribution operations were last performed in order to 
determine which distributions are active at any given point in the program.
1For the purposes of simplifying this discussion, the word function will be used throughout this thesis to refer 
to either a Fortran program or subprogram (function or subroutine) and should be considered to be interchangeable 
with all of these terms.
2 In the HPF standard [5], the word remapping is used to refer to redistribution in order to avoid associating it 
with only the HPF REDISTRIBUTE directive. For this thesis, the word redistribution will be used equivalently to 
refer to remapping due to either the HPF REDISTRIBUTE or REALIGN directive.
3
(a) (BLOCK, * )
(c) (BLOCK, BLOCK)
(b) (CYCLIC(As), * )
(d) (CYCLIC(fci) ,  CYCLIC(k2))
Figure 1.2: E xam ple data distributions
4
PARADIGM: PARAIIelizing compiler for Distributed-memory 
General-purpose Multicomputers
Figure 1.3: PARADIGM Compiler Overview
1.2 Overview of the PARADIGM Project
Figure 1.3 shows a functional illustration of how we envision the complete PARADIGM 
compilation system [3]. The compiler front-end currently accepts either sequential Fortran 77 [6] 
or High Performance Fortran [7] and produces an optimized explicit message-passing version 
(in the form of a Fortran 77 program with calls to the selected message-passing library and the 
PARADIGM run-time library). The following are brief descriptions of the major phases that are 
most closely related to the communication and data distribution optimizations presented in this 
thesis. The remaining phases in the compiler are described in the overview of the PARADIGM 
system [3].
1.2.1 Program analysis
Parafrase-2 [6] is used as a preprocessing platform to parse the sequential program into an 
intermediate representation and to analyze the code to generate flow, dependence, and func­
tion call information that is used throughout the rest of the compiler. A number of machine-
5
independent code simplifying transformations, such as constant propagation and induction vari­
able substitution, are also performed at this stage.
1.2.2 Automatic data partitioning
The compiler can currently select a static distribution of data using a constraint-based algo­
rithm [8], which determines both the best configuration of an abstract multidimensional mesh 
topology along with how program data should be distributed on the mesh. As the interactions 
between the selection of data distributions and the use of available communication optimiza­
tions are tightly linked, estimates of the time spent in computation and communication drive 
the selection of the data distribution. Available communication optimizations performed by the 
compiler are reflected in the estimates in order to correctly determine the best distribution.
For complex programs, static data distributions may be insufficient to obtain acceptable per­
formance on distributed-memory multicomputers. By allowing the data distribution to dynami­
cally change over the course of a program’s execution, this problem can be alleviated by match­
ing the data distribution more closely to the different computations performed throughout the 
program. Such dynamic partitionings can yield higher performance than a static partitioning 
when the redistribution is more efficient than the communication pattern required by the stati­
cally partitioned computation. The selection of distributions that dynamically change over the 
course of a program’s execution adds another dimension to the data partitioning problem. As 
part of this thesis, we present a technique that extends the static data partitioning algorithm to 
automatically determine those partitionings most beneficial over specific sections of a program 
while taking into account the added overhead of performing redistribution.
1.2.3 Compiling regular computations
Using the owner-computes rule, the compiler divides computation across processors accord­
ing to the selected data distribution and generates inter-processor communication for required 
non-local data. A direct application of this rule without further optimizations leads to run-time 
resolution [9], which results in code that computes the ownership and communication for each
6
reference at run time. To avoid the overhead of computing ownership at run time, static anal­
ysis is used to partition loops at compile time (known as loop bounds reduction [10]) allowing 
processors to execute only those iterations that have assignments that write to local memory.
PARADIGM applies the owner-computes rule to distribute computation across processors 
by statically partitioning loops. Given a program’s data distribution information, which can be 
either manually specified with HPF directives or automatically selected by the compiler, the 
ACCESS sets of all references enclosed in loops are first computed. The ACCESS set of a ref­
erence with respect to a particular processor is the set of loop iterations for which that reference 
accesses data which is locally owned by that processor. According to the owner-computes rule, 
a processor only performs the computations in which the left-hand side (LHS) of an assignment 
statement references data which is owned by that processor. Loops are partitioned by letting 
each processor execute only those iterations which are in the union of the ACCESS sets of the 
LHS references in the loop body (also potentially masking each statement in the body by its 
own LHS ACCESS set). This union forms the reduced iteration set of the loop which is used 
to compute the reduced loop bounds [10-12].
•The references in assignment statements are also analyzed to detect the need for communica­
tion. Communication descriptors are constructed to summarize the iterations requiring commu­
nication of non-local data, the processors involved, and the exact regions of the arrays to be sent 
or received. To determine which iterations require interprocessor communication, PARADIGM 
computes the set difference between the left-hand side and right-hand side ACCESS sets [11, 
12]. Once the communication descriptors have been computed for individual references, a num­
ber of communication optimizations (which are presented as part of this thesis) can be per­
formed to reduce the overhead of communication [13].
1.2.4 Generic library interface
Support for specific communication libraries is provided through a generic library interface. 
Since more information is carried internally than is required by any specific implementation, 
the mapping is obtained by selecting a subset of the available parameters. For each supported
7
library, abstract functions are mapped to corresponding library-specific code generators at com­
pile time.
Library interfaces have been implemented for Thinking Machines CMMD, Parasoft Ex­
press, MPI [4], Intel NX, PVM [14], and PICL [15]. Execution tracing, as well as support for 
multiple platforms, is also provided in Express, PVM, and PICL. The portability of this inter­
face allows the compiler to easily generate code for a wide variety of machines.
1.3 Thesis Contributions
The contributions made in this thesis are focused in three main areas of performance opti­
mization compilation techniques for distributed-memory multicomputers, each of which builds 
upon the previous one.
(1) Optimizing communication,
(2) Optimizing data distribution using the available communication optimizations, and
(3) Optimizing redistribution for dynamic data distributions.
Since the resulting performance using a given data distribution depends heavily on how well 
the compiler can optimize the remaining communication, we first implemented several exist­
ing optimization techniques in the prototype PARADIGM compiler and evaluated their perfor­
mance on several distributed-memory architectures. As a result of this effort we have also devel­
oped a more robust message vectorization algorithm based on the dependence direction vector 
as well as a more comprehensive estimation framework for balancing the available parallelism 
with the overhead of communication in pipelined computations.
In the area of optimizing data distribution, we have developed a dynamic data partitioning 
algorithm as a layer built on top of the existing static partitioner [8]. A branch-and-bound tech­
nique is used to recursively decompose a program into a number of phases (each with a selected 
static distribution) after which redistribution costs are taken into account to determine which 
phases to use. Our data partitioning techniques make good use of performance estimates of
8
communication and computation in order to determine where to partition a program into sub­
phases as well as to determine which phase sequence will result in the best performance.
To support the dynamic partitioning techniques, we have extended the static data partitioned 
implemented in PARADIGM as part of a previous thesis [8], to function as a reentrant module 
in order to obtain static partitionings for individual regions of a program. We have extended the 
underlying communication and computation cost model estimation framework, using a virtual 
function interface similar to that used in the generic library interface, to make the cost estimation 
framework more architecture independent as well as easily extensible. We have also introduced 
the idea of a Hierarchical version of the Component Affinity Graph [16] (HCAG), which pro­
vides a representation for efficiently computing alignments of subregions of a program as well 
as performing interprocedural alignment analysis.
Finally, we have developed an interprocedural data-flow framework which provides a means 
to convert between a distribution-based representation (as specified by the data partitioner and 
required for compilation) and a redistribution-based form (as specified by HPF) while optimiz­
ing the amount of redistribution that must be performed. In addition to supporting the data parti­
tioned this framework also allows us to optimize dynamic HPF programs (using REDISTRIBUTE 
and REALIGN directives) as well as convert such programs into equivalent static versions through 
a process we refer to as Static Distribution Assignment (SDA) providing dynamic HPF support 
for existing subset HPF compilers.
We have evaluated the implementation of these optimizations in the PARADIGM compiler 
framework using several small kernels and example programs and measured their effect on the 
performance of the programs using an Intel Paragon and Thinking Machines CM-5.
1.4 Thesis Organization
The remainder of this thesis is organized as follows. Chapter 2 provides a summary of re­
lated work that has been performed in the area of optimizing communication and data distri­
bution. Chapter 3 presents several existing techniques for optimizing communication as well 
as the improvements we have made. Chapter 4 presents the technique currently used to obtain
9
static data distributions as well as the new techniques we have developed which select dynamic 
data distributions to further improve performance for applications which require different dis­
tributions over different phases of execution. Chapter 5 describes the interprocedural data-flow 
framework which converts distribution-based representations into an equivalent redistribution- 
based form while optimizing the amount of redistribution that must be performed. Chapter 6 
presents the evaluation of these optimizations in the PARADIGM compiler framework using 
several small kernels and example programs and measuring their effect on the performance of 
the programs using an Intel Paragon and Thinking Machines CM-5. Finally, Chapter 7 presents 
the conclusions on this work, and discusses directions for future work in this area.
10
CHAPTER 2
RELATED WORK
In this chapter we will describe the related work in each of the three areas of focus in this 
thesis:
(1) Optimizing communication,
(2) Optimizing data distribution using the available communication optimizations, and
(3) Optimizing redistribution for dynamic data distributions.
2.1 Optimizing Communication
As the issues related to optimizing communication for distributed-memory multicomputers 
are very similar to the issues which arise when optimizing accesses to memory hierarchies in 
general, a considerable range of research has been performed in areas directly or indirectly re­
lated to the communication optimizations described in this thesis. A number of techniques have 
been developed in the areas of blocking [17-20] and tiling [21,22] algorithms for improved 
locality cache performance, software controlled prefetching to reduce miss penalties for both 
uniprocessor [23-25] and multiprocessor systems [19,26-28], and combining communication 
operations required to obtain non-local data in distributed-memory systems [13,19,29-36] to 
reduce the communication overhead.
In cases in which loops contain cross-iteration dependencies due to recurrences within the 
innermost loops, parallelism can be extracted by executing the loop in a pipelined fashion. It
11
is possible to automatically extract pipeline parallelism from such recurrences through the use 
of general techniques such as the the hyperplane method1 [37-43] and unimodular transforma­
tions [22,44,45]. Such techniques have been widely used in the design of VLSI systolic ar­
rays [42,46,47], the extraction of doacross parallelism [48,49], as well as the development of 
software pipelining for distributed-memory machines [29,32,36,50,51].
Specifically for distributed-memory machines, Gallivan, Jalby, and Gannon [19] discussed 
some of the problems, in general terms, associated with automatically restructuring data and 
generating communication in order to improve performance for distributed-memory machines 
while still maintaining the correctness of the original program. Rogers and Pingali [29,36] ex­
amined several optimizations that address the issues of overhead and synchronization in mes­
sage transmission by comparing the results of the automatic techniques with hand-coded ver­
sions of several programs [29, 36]. Tseng [50,51] described several transformations used in 
the AL compiler that relate to software pipelining for distributed-memory machines. In [30], 
Gemdt described the message coalescing and vectorization optimizations in detail as imple­
mented as part of the SUPERB project [52]. Balasundaram, Fox, Kennedy, and Kremer [31] 
presented a message vectorization technique in relation to a performance estimation framework 
based on training sets. In [32], Hiranandani, Kennedy, and Tseng have evaluated a number of 
communication optimizations performed by the Fortran D [32] compiler and proposed a mea­
sure for controlling the granularity of a software pipeline. Amarasinghe and Lam [34] used data 
flow analysis techniques to determine where data was last written while optimizing communi­
cation in the SUIF compiler. Gong, Gupta, and Melhem [33] as well as Gupta, Schonberg, and 
Srinivasan [35] have also examined the use of data flow frameworks in order to eliminate re­
dundant messages between separate loop nests in addition to optimizing communication within 
a single loop nest.
In this thesis we present our implementation of a number of previously proposed commu­
nication optimizations in the PARADIGM compiler, a more robust message vectorization algo­
rithm based on the dependence direction vector, a more comprehensive estimation framework 
for balancing the available parallelism with the overhead of communication in pipelined com­
1 Which has also been referred to as the wavefront method [37].
12
putations, as well as an evaluation of the presented optimizations on different distributed archi­
tectures.
2.2 Optimizing Data Distribution
2.2.1 Static partitioning
As previously stated, for distributed-memory multicomputers, the quality of the data parti­
tioning for a given application is crucial to obtaining high performance. This task has tradition­
ally been the user’s responsibility, but in recent years much effort has been directed to automat­
ing the selection of data partitioning schemes. Several researchers have proposed static data 
partitioning systems that are able to select data distributions that remain in effect for the entire 
execution of an application. The two main problems which arise in selecting static distributions 
are determining how different arrays are aligned to each other and how the aligned arrays are 
distributed across the processors.
In [53], Mace examined the problem of determining the best memory storage patterns (shapes) 
for interleaved memories in vector processors showing that the optimal data partitioning prob­
lem is NP-complete. Knobe, Lukas, and Steele [54] developed alignment techniques for SIMD 
machines by constructing preferences between data references. A greedy heuristic is used to 
choose the highest weighted preferences within cycles of the preference graph. In [16], Li and 
Chen showed that the problem of array alignment is NP-complete and presented a heuristic so­
lution based on bipartite graph matching. They also were the first to describe how to synthesize 
collective communication operations based on a pattern-matching approach on data references 
in the source program. Ramanujam.and Sadayappan [55] developed a technique which selects 
data distributions in which a communication-free partitioning exists. They used a linear alge­
bra framework to describe array access patterns within a parallel loop and determine when a 
communication-free data distribution exists. As a communication-free distribution usually does 
not exist for an entire program, their techniques mainly apply to individual loop nests. In the 
description of the AL compiler, Tseng [50,51] presents a technique for determining the best
13
data distribution given which dimension of each array is to be distributed. Wholey [56] showed 
how performance estimation is a key in selecting good data distributions. As a hill climbing 
technique was used to select the best number of processors and dimensions to distribute, one 
main drawback of this technique is that it could get caught in local minima. Gupta and Baner- 
jee [57] described a constraint-based algorithm to determine both the best configuration of an 
abstract multidimensional mesh topology along with how program data should be distributed on 
the mesh. Their alignment analysis is based on earlier work on multidimensional array align­
ment [16] extending their approach with estimates of the communication cost penalty for mis­
aligned array dimensions. In [58], Chatterjee, Gilbert, Schreiber and Teng present a frame­
work for determining the alignment of arrays in data-parallel languages such as Fortran 90. 
They decomposed the alignment problem into axis, stride, and offset subproblems and were 
able to obtain alignments more general than those provided by the owner-computes rule. Ander­
son and Lam [59] developed an iterative technique using a linear algebra framework, based on 
both computation and data decompositions. Their technique iterated until the relationship be­
tween the data and computation decompositions is valid. Bau, Koduklula, Kotlyar, Pingali, and 
Stodghill [60] have also developed a technique for determining the alignment of arrays based 
on a linear algebra framework. They solved the resulting alignment equations by reducing the 
problem to determining a basis for the null space of a matrix, but heuristics still must be used 
to determine which constraints to leave unsatisfied when the system is overconstrained.
2.2.2 Dynamic partitioning
The selection of distributions that dynamically change over the course of a program’s execu­
tion adds another dimension to the data partitioning problem. As all of the difficulties involved 
in automatically selecting high quality static data distributions still exist in the selection of dy­
namic data distributions, many of the techniques used in the following approaches were based 
on algorithms or representations described previously in Section 2.2.1.
Hudak and Abraham have proposed a method for selecting redistribution points for Non- 
Uniform Memory Access (NUMA) shared-memory multiprocessors [61,62]. Their technique
14
is based on locating significant control flow changes in the program and inserting remapping 
points [62]. Remapping points are inserted between loop nests with different nesting levels, 
around control loops, and between a loop that requires nearest neighbor communication (which 
is assigned a block partitioning) and a loop that is linearly varying (which is given a cyclic par­
titioning). Once remapping points have been added to the program structure, a merge phase 
is performed to remove any unnecessary data redistribution. Remapping points are removed 
between two phases with identical distributions as well as when one phase has an unspecified 
distribution (in which case it is assigned a distribution from an adjacent phase). This heuristic 
was shown to improve the performance of several programs as compared to a strict data-parallel 
implementation. For a program consisting of a matrix multiplication followed by an LU factor­
ization, they observed a 26% improvement by applying this technique.
Chapman, Fahringer, and Zima have noted that there are many known good data distribu­
tions for important numerical problems that are frequently used [63]. They describe the design 
of a distribution tool that makes use of performance prediction methods when possible, but also 
heuristically uses empirical performance data when available. They have also developed cost 
estimates which model the work distribution, communication and data locality of a program 
while taking into account the features of the target architecture [64]. The combination of estima­
tion techniques along with profile information will be used to guide their system in potentially 
selecting lists of distributions for different arrays that appear in the program. Currently, the per­
formance prediction system has been integrated into a compiler based on Vienna Fortran [65], 
and research is under way on their partitioning system.
Anderson and Lam [59] have also proven that the dynamic decomposition problem is NP- 
hard by transforming the colored multiway cut problem (which is known to be NR-hard) into 
a subproblem of dynamic decomposition. They address the dynamic distribution problem by 
using a communication graph in which nodes are loop nests and edges represent the time for 
communication if the two loop nests involved in an edge are given different distributions. A 
greedy heuristic is used to combine nodes in such a way that the largest potential communica­
tion costs are eliminated first while maintaining sufficient parallelism. Their technique works 
from the bottom-up, examining highest cost edges first, combining two nodes if the cost of ex­
15
ecuting the combined nodes is less than the the sum of the original costs of the two nodes and 
the redistribution cost.
Work by Bixby, Kennedy and Kremer [66], by Kremer [67], and more recently by Garcia, 
Ayguade, and Labarta [68] formulates the data partitioning problem in the form of a 0-1 integer 
programming problem. For each phase, a number of candidate partial data layouts are enumer­
ated along with the estimated execution costs. The costs of the possible transitions among the 
candidate layouts are then computed forming a data layout graph. A static performance estima­
tor is used to obtain the node and edge weights of the graph. Since each phase only specifies 
a partial data distribution, redistribution constraints can cross over multiple phases, thereby re­
quiring the use of 0-1 integer programming. The number of possible data layouts for a given 
phase is exponential in the number of partitioned array dimensions resulting in potentially large 
search spaces. To benefit from advanced techniques in the field of integer programming, the re­
sulting formulation is processed by a commercial integer programming tool. In some previous 
work [69], Kennedy and Kremer also examined a dynamic programming technique that could 
determine dynamic distributions in polynomial time if each candidate layout specified a map­
ping for every array in the program. The main drawback with both of these techniques is that 
the selection'of phases must be made a priori to the formulation of the problem. This means that 
the size of the phases should be as small as possible so that the correct solution is found and that 
as many candidate layouts as possible should be specified in order to find the best dynamic data 
distribution. These factors contribute to the increase in the size of the search space.
Bixby, Kennedy, and Kremer [66] have also described an operational definition of a phase, 
which is defined as the outermost loop of a loop nest such that the corresponding iteration vari­
able is used in a subscript expression of an array reference in the loop body. Even though this 
definition restricts phase boundaries to loop structures and does not allow for overlapping or 
nesting of phases, it can be seen that for the example in Section 4.3.1 this definition is sufficient 
to describe the two distinct phases of the computation.
More recently, Sheffler, Schreiber, Gilbert, and Pugh [70] have applied graph contraction 
methods to the dynamic alignment problem [71] to reduce the size of the problem space that 
must be examined. Several localized graph transformations are employed to reduce the size of
16
the dynamic alignment problem significantly while still preserving the optimal solution. The 
alignment-distribution graph (ADG) is based on a single static assignment data-flow form of 
a data-parallel language. Nodes in the ADG represent data-parallel computations while edges 
represent the flow of data. The ADG itself is formed in a divide-and-conquer approach using 
heuristics to approximately solve a combinatorial minimization problem at each step, taking 
into account both redistribution costs as well as all candidate distributions, to determine where to 
partition the program into subphases [72]. Candidates are formed by first identifying the extents, 
or iteration space, of all objects in the program resulting in a number of clusters. These clusters 
appear to result in groups very similar to that formed by the operational definition of a phase. A 
set of representative extents is constructed by selecting one vector from each cluster. This set is 
then used to determine the preferred distributions of each dimension typically resulting in a few 
tens of candidate distributions. When the alignments of the candidates for two connected nodes 
are different, communication costs are modeled by the worst-case data transfer between any two 
candidate distributions for nodes involved. After the graph contraction transformations have 
been applied to the ADG, the final data distribution is selected using several different heuristics. 
By. first reducing the size of the problem, they are able to apply more expensive heuristics to 
obtain better overall solutions.
In this thesis, we present a dynamic data partitioning technique which we have developed 
as a layer built on top of an existing static partitioner [8]. A branch-and-bound technique is 
used to recursively decompose a program into a number of phases (each with a selected static 
distribution) by only considering the current communication constraints within a phase under 
examination. After decomposing the program into a hierarchy of phases, redistribution costs 
are then taken into account to determine which phases to use. Performance estimates of com­
munication and computation are used to determine where to partition a program into subphases 
as well as to determine which phase sequence will result in the best performance.
17
2.3 Optimizing Data Redistribution
Interprocedural analysis is a useful program analysis technique that has been applied to many 
different applications [73]. In the area of parallel processing, it has been widely used for depen­
dence analysis and program parallelization in the presence of function calls [74-77], detecting 
references to stale data for improving the performance of cache coherent architectures [78], as 
well as for compiling static Fortran D programs [79].
The work by Hall, Hiranandani, Kennedy, and Tseng [79] defined the term reaching de­
compositions for the Fortran D decompositions (or distributions, in HPF terminology) which 
reach a function call site. Their work describes extensions to the Fortran D compilation strat­
egy [80] using the reaching decompositions for a given call site to compile Fortran D programs 
that contain function calls as well as to optimize the resulting implicit redistribution. Since ex­
plicit redistribution did not exist in the Fortran D language, their techniques are limited to only 
optimizing implicit redistribution performed at function boundaries and do not address all the 
situations which can arise in HPF.
More specifically, in the area of array redistribution, there has been much recent work in 
generating efficient communication to actually perform the redistribution [81-87]. Techniques 
that have been applied to this problem range from resolving the array element-to-element map­
ping at run time [85] to various approaches for analyzing the communication patterns before 
performing the communication to minimize the total number of messages sent [81,82,84] to 
strip-mining redistribution operations in order to overlap the communication with the last com­
putation, which is generating the data to be redistributed [86,87].
The work by Coelho and Ancourt [81] also describes an optimization for removing useless 
remappings specified by a programmer through explicit realign and redistribute operations. In 
comparison to the work in the Fortran D project, they are also concerned only with determining 
which distributions are generated from a set of redistributions, but instead focus only on explicit 
redistribution. They define a new representation called a redistribution graph in which nodes 
represent redistribution operations and edges represent the statements executed between redis­
tribution operations. This representation, although correct in its formulation, does not seem to
18
fit well with any existing analysis already performed by optimizing compilers and also requires 
first summarizing all variables used or defined along an edge (all possible paths between succes­
sive redistribution operations) in order to optimize redistribution. As will be seen in Section 5.1, 
this representation is actually the dual of our representation which suffers from neither of these 
problems. Even though their approach currently only performs their analysis within a single 
function, they do suggest the possibility of an extension to their techniques which would al­
low them to also handle implicit remapping operations at function calls but do not describe an 
approach.
The definition of reaching distributions, however, is still a useful concept. We extend this 
definition to also include distributions which reach any point within a function in order to en­
compass both implicit and explicit redistributions thereby forming the basis of the work pre­
sented in this thesis. In addition to determining those distributions generated from a set of redis­
tribution operations, this extended definition allows us to address a number of other applications 
in a unified framework.
2.4 Summary
In this chapter we have described the related work in the areas of optimizing communication, 
data distribution, and data redistribution for distributed-memory multicomputers. In Chapter 3, 
we will present several existing techniques for optimizing communication as well as the im­
provements we have made. In Chapter 4, we will describe the techniques we have developed 
for optimizing data distributions, and in Chapter 5, we will describe the interprocedural data­
flow framework we have developed to optimize redistribution.
19
CHAPTER 3
OPTIMIZING COMMUNICATION
The start-up latency, or overhead, of communication for distributed-memory multicomput­
ers tends to .be two to three orders of magnitude greater than the per-byte transmission rate. For 
this reason, if multiple communication operations can be performed together, thereby reducing 
the frequency of communication operations in exchange for increasing the length of the mes­
sage communicated, the overhead can be amortized over multiple messages reducing the overall 
cost of communication. This property forms the basis for all of the communication optimiza­
tions that will described later in this chapter.
The point-to-point transfer cost of a message of m  bytes can be modeled as a linear func­
tion of the message length parameterized by the overhead and rate (see Table 3.1) for a specific 
architecture
transfer(m) =  ovhd + rate - m  (3.1)
In this cost model, we assume that any added contribution due to distance (or hops) between 
communicating processors is negligible in comparison to both the overhead and total transmis­
sion time. Since most modem interconnection networks utilize techniques such as wormhole 
routing over packet-switched interconnection networks with relatively low switch costs to min­
imize this effect, this is a valid assumption. For a machine in which the parameters depend on 
the length of the message, the model takes on different values for different ranges of message
20
Table 3.1: Communication model parameters
Machine Message size ovhd (fis) rate (ns) o vh dr a te
Intel Paragon any 50 0.018 2777
Intel iPSC/860 [88]
m  <  100 60 0.50 120
m > 100 160 0.36 444
Intel iPSC/2 [88]
m < 100 350 0.20 1750
m  > 100 700 0.36 1944
IBM SP-2 any 58 0.032 1812
TMC CM-5 [88] any 86 0.12 716
length. The model for a machine such as the iPSC/860 would be
transfer (m) = <
60 +  0.50m (if m  < 100) 
160 +  0.36m ( ifm > 1 0 0 )
To reduce the total amount of communication overhead, the communication optimizations 
examined in this chapter combine messages between different communication operations of un­
modified overlapping array regions (message coalescing, Section 3.1); across a dimension of an 
array traversed over different iterations of a loop nest (message vectorization, Section 3.2); of 
different arrays communicated between the same source and destination processors (message 
aggregation, Section 3.3); and of successive pipelined messages which arise in the presence of 
inner loop recurrences (message pipelining, Section 3.4).
3.1 Message Coalescing
Separate communication for different references to the same data is unnecessary if the data 
has not been modified between uses [30]. When statically analyzing the access patterns, these 
redundant communication operations are detected and coalesced into a single message, allowing 
the data to be reused rather than communicated for every reference. For sections of arrays which 
are not disjoint, unions of their overlapping communication descriptors [11,12] ensure that each
21
unmodified data element is communicated only once. Such software-based caching is always 
beneficial since entire communication operations can be eliminated.
3.2 Message Vectorization
Non-local elements of an array that are indexed within a loop nest can also be vectorized into 
a single larger message instead of being communicated individually (see Figure 3.1) [19,30]. 
The “itemwise” messages are combined, or vectorized, as they are lifted out of the enclosing 
loop nests to a selected level. Vectorization reduces the total number of communication opera­
tions, but at the cost of increasing the message length.1 For this reason, vectorization should per­
form well on machines with high communication overheads. The vectorized data may also have 
to be packed into a contiguous buffer before communication (unless the array section to com­
municate is already contiguous [89]). For all of the machines we have examined (in Table 3.1), 
the combined overhead of all of the non-vectorized (single-element, inner loop) communication 
operations greatly outweighs the cost of packing/unpacking the data at the vectorization loop 
level.
Dependence analysis is used to determine the outermost loop at which the combining can 
be applied. The algorithm Dep-UpLevel shown in Figure 3.2 examines the direction vector 
of an incoming dependence arc for a given reference and determines the level which carries the 
dependence. It then returns this location as a number of levels up from the current nesting level.* 2 
The vectorization level is the lowest loop level which carries all of the the flow dependencies 
for a given reference which is computed by Vect-UpLevel as the minimum up level of all 
incoming flow dependencies for that reference.
The use of the vectorization level is shown in the communication placement algorithm in 
Figure 3.4 for completeness, but it is actually computed when first constructing the communica­
tion descriptors [11,12]. Both the construction and placement of the communication descriptors 
should use the same level. Because Dep-UpLevel examines the dependence direction vector
Management of available memory may require that large regions of data be only partially vectorized [32].
2The “nesting level” is the number of loops which contain the reference while the “up level” is the number of 
loops to move up from the current level to reach the new nesting level.
22
Before
Figure 3.1: Message vectorization
Vect-UpLevel (re f )
1 > Start vectlevel at topmost level
2 vect-uplevel = NESTLEVEL(STMT (ref))
3 >  Minimum dependence “up level” determines vect-uplevel
4 for each edge in DEPENLIST(re/)
5 do if D EPTY PE(e^e) =  FLOW
6 then vectjuplevel 4- M in  (vect .uplevel, DEP-UPLEVEL(edge))
7 re tu rn  (vect-uplevel)
Dep-UpLevel (edge)
1 nest-level 4 -  NESTLEVEL(STMT (DEPSINK(edge)))
2 curr leve l  =  — 1 D> Dependence level not yet found
3 \> Direction vectors stored from outer to inner loop order
4 for i <- 0 to DEPDIRLEN(edge)
5 do if [<] € DE?DJR(edge)[{\
6 then currJevel = i >  Possible level with a “< ” direction
7 if ([<] € DE?DIR(edge)\i}) and ([=] $ DE?DIR(edge)[i\)
8 then curr-level — i >  Definite level with a strict “< ” direction
9 break
10 if curr-level < 0
11 then curr leve l  =  i
12 if (currlevel < DEPDIRLEN(ec?£e))
13 and ([<] £ DEPDIR(edge)[currlevel])
14 then currlevel++  >  Carried within loops with a “< ” direction
15 uplevel nest leve l  — curr leve l
16 retu rn  (uplevel)
Figure 3.2: Message vectorization algorithm 
Macros used to access fields in structures are shown in CAPS. 
Functions are indicated by S m a l l C a p s .
23
After
Figure 3.3: Message aggregation
(for both < and <  dependencies) to determine the level which carries the dependence, this is 
more robust than other dependence-based message vectorization algorithms [30,31,36] and is 
even useful for selecting vectorization levels for partially vectorizable forward references in the 
presence of recurrences (as will be discussed in Section 3.4).
3.3 Message Aggregation
Multiple messages (corresponding to several array sections) to be communicated between 
the same source and destination can also be aggregated into a single larger message [30] as 
shown in Figure 3.3. The communication operations at a given point in a program are first 
sorted by their source and destination before generating the communication operations. This 
is also shown in the PLACE-COMM algorithm in Figure 3.4, but in the actual implementation 
the sorting is only performed once for each list of communication descriptors before code gen­
eration instead of as each operation is placed. Messages with identical destinations can then be 
collected into a single communication operation. The gain from aggregation is similar to vec­
torization in that multiple communication operations can be eliminated at the cost of increasing 
the message length.
Aggregation can be performed between communication operations of individual data refer­
ences as well as vectorized communication operations (both of which will be examined in Sec­
tion 6.1). With respect to buffering, if the data is already packed for the individual messages, 
then there is no extra time spent packing the aggregated messages. Packing multiple messages
24
PLACE-COMM (comm, re f)
1 level 4- NESTLEVEL(STMT(re/))
2 >  Select the vectorization level for the reference
3 if  comm-vect
4 then level <- level -  VECT-UpLEVEL(re/)
5 >  Add comm  to communication descriptor list
6 commprev 4— NIL
7 commdesc 4— COMMLlST(level)
8 while commdesc and
9 (not comm-coal or DISJOINT {comm, commdesc))
10 do commprev 4- commdesc
11 commdesc 4- commdesc-^next
12 >  Coalesce overlapping communication operations
13 if  comm-coal and commdesc ^  NIL
14 then UNION(commdesc, comm)
15 else i f  commprev =/=■ NIL
16 then commprev-^next 4- comm
17 else COMMLIST(/eue.Z) 4- comm
18 co m m ^n ex t  4— commdesc
19 O If aggregating communication, sort by source and destination
20 i f  comm jag gr
21 then SORT(COMMLIST(/er;eZ), COMPARE(src, dest))
com m .coal = communication coalescing flag 
com m juect =  communication vectorization flag 
com m jaggr  =  communication aggregation flag
Figure 3.4: Communication placement algorithm
25
into the same buffer takes the same amount of time as it does to pack each individually into dif­
ferent buffers. Also, since it is likely that vectorized messages will require packing, the buffer­
ing associated with aggregating already vectorized communication is usually not a concern.
3.4 Message Pipelining
In loops in which there are no cross-iteration dependencies, parallelism is extracted by in­
dependently executing groups of iterations on separate processors. However, in cases in which 
there are cross-iteration dependencies due to recurrences within the innermost loops, it is not 
possible to immediately execute every iteration. When such recurrences are present in the pro­
gram, the message vectorization algorithm (in Figure 3.2) will not be able to move all of the 
communication operations (more specifically, those associated with references involved in the 
inner loop recurrence dependencies) out of the innermost loops. The communication operations 
that remain within the innermost loops give rise to a message ordering which is pipelined across 
the processors which are partitioned across the recurrence [32,36].
To illustrate how such situations can arise, a small code example that requires pipelining to 
'extract parallelism is shown in Figure 3.5. Assuming that the data is distributed by rows, par­
titioning the code in Figure 3.5(a) will cause the first processor to perform the computation on 
every row it owns before sending the border row to the waiting processor, thereby serializing 
execution of the overall computation (tL..3 indicate the processing order). In Figure 3.5(b), de­
pendencies allow the loops to be interchanged resulting in the first processor to compute instead 
one partitioned column of elements before sending the border element of that column to the next 
processor. The next processor can now begin its computation much earlier (upon receipt of the 
border element). As this process continues, it gives rise to a pipelined execution ordering for the 
recurrence. As previously described in Chapter 2, such pipelining techniques have been widely 
used to solve recurrences on high-performance systems.
Ideally, if communication has zero overhead, such fine grain pipelining exposes the most 
parallelism in such computations since no processor will wait unnecessarily. Unfortunately, for 
distributed-memory systems, this is not completely true. By again considering the overhead of
26
do i = 1 , Y  
doj = l, X
a(i, j) = a(i, j) * (1 - w) +
(a(i-l, j) + a(i+l, j) + 
a(i, j-1) + a(i, j+1)) * (w/4)
end do 
end do
(a) Before transformation
Dependencies
do j = 1, X  
do i = 1, Y
a(i,j) = a(i,j) * (1 - w) +
(a(i-1, j) + a(i+l, j) + 
a(i, j-1) + a(i, j+1)) * (w/4)
end do 
end do
(b) After loop interchange
Figure 3.5: Code example requiring message pipelining
communication (in Table 3.1), the cost of performing numerous single element communications 
can be quite expensive. To address this problem, the total cost of communication can be reduced 
by increasing the amount of computation performed before communicating thereby balancing 
the resulting reduction in parallelism due to partial serialization with the gain achieved through 
amortizing the communication overhead. Reducing the total cost of communication by control­
ling the pipeline granularity has become known as coarse grain pipelining [32].
An example of the code that would be generated for message pipelining is shown in Fig­
ure 3.6. For fine grain pipelining, Figure 3.6(a), an element is communicated after each column 
of computation. In Figure 3.6(b) the serial j  loop has been strip-mined to allow the computation 
of s columns before communicating. As the selection of an appropriate value of s is key to the 
performance of pipelined loops, this will be the focus of the remainder of this section.
27
do 1 = 1 ,L
if (my$p> 0) send (my$p-l, oc X ) 
if (my$p< p — 1) recv (my$p+l, ocX ) 
do j = 1, X
if (my$p> 0) recv (my$p-l, a  1) 
d o i=  1, m inC y-fy/p^m ySp, \Y /p 1)
computation 
end do
if (my$p< p-1) send (my$p+l, a  1) 
end do 
end do
(a) Fine grain pipelining
do 1 = 1,L
if (my$p> 0) send (my$p-l, oc X )  
if (my$p< p-1) recv (my$p+l, oc X )  
do jj = 1, X , s
jjmax = min(X, jj + s - 1) 
jjblk = jjmax - jj + 1 
• if (my$p> 0) recv (my$p-l, a  jjblk)
d° j = j j J j max
do i = 1, m in(y-[y/p]*m y$p, \Y /p \)  
computation 
end do 
end do
if (my$p< p-1) send (my$p+l, oc jjblk) 
end do 
end do
(b) Coarse grain pipelining
my$p = processor coordinate jj = stripmining loop
p = total number of processors jjmax = upper bound of stripmined loop
s = strip size jjblk = actual loop blocking factor
Figure 3.6: Examples of message pipelining
28
3.4.1 Pipeline performance analysis
Note that in the code examples in Figure 3.6 an outer loop (/) has been added which encloses 
the pipelined loop nest. This type of pipeline is called a two-level pipelined loop as opposed to a 
one-level pipelined loop (previously shown in Figure 3.5). A two-level pipeline allows consec­
utive pipelines to be overlapped with each other, or chained, which can be seen more clearly in 
the model in Figure 3.7 between the completion of the first pipeline and the start of the second. 
The reason for using a two-level model is that a one-level pipeline model would be too conser­
vative in that it would assume that there are barriers between successive pipelines (no chaining). 
The two-level model can capture the effect of a one-level model by setting the parameter L (the 
number of iterations of the enclosing loop) to one.
The generalized two-level pipeline model in Figure 3.7 corresponds directly to the code in 
Figure 3.6(b). In this model, the pipeline granularity is controlled by strip-mining the outer 
loop of the inner two-dimensional pipelined loop nest while chaining occurs between consecu­
tive pipelines at the outermost enclosing loop. Definitions of all variables used in the following 
analysis are also provided in Figure 3.7.
Since the amount of available parallelism is reduced as the granularity, or strip size (s), is 
increased, this value must be carefully selected. If s is one, the pipeline is fine grained expos­
ing the maximum parallelism. If s is equal to the bounds of the serial j  loop (X), the pipeline 
is serialized (although chaining will still occur at the outermost level). Somewhere in between 
lies an optimal s that balances the cost of communication with the available parallelism. As 
s decreases, the amount of communication increases while the start-up cost decreases. As s 
increases, the amount of communication decreases while the start-up cost increases. To inves­
tigate this tradeoff more closely, an execution time estimate is developed from the framework 
in Figure 3.7 in order to analytically determine a strip size that minimizes both effects.
The first major phase of execution is the start-up time required to fill the pipeline. This is 
related to the number of processors as well as the strip size. From Figure 3.7, the start-up cost 
is equal to
startup =  (p — l) ( s  • comp +  comm)
29
pr
oc
es
so
rs
Strip Computation |  Send Overhead (Sovhd) ---- ► Forward Reference Sync Transfer Time
|  Receive Overhead (Rovhd) -----*- Strip Transfer Time
Sovhd(m) + Rovhd(m) = ovhd(m)
comp = computation time for one column = \Y/p] c 
comm = time for transfers) (strip of a row) 
scomm = time for transfer(X) (entire row) 
transfer(m) = communication time of m  elements 
ovhd(m) = communication overhead for m  elements
L = outer loop iterations 
X , Y  = number of columns and rows 
p = total number of processors 
s = strip size
c = cost of inner loop instructions
Figure 3.7: Estimation framework for coarse grain pipelining
The next portion of execution is the time spent in the pipeline. Ideally, with zero commu­
nication costs, this time should be equal to the amount of computation (X  • comp). However, 
because of the presence of communication, the time for each message communicated in the 
pipeline must also be taken into account. From Figure 3.7, the number of communication op­
erations within the pipeline is ( 1J resulting in
pipeline =  X  ■ comp 4-
X
s
-  1 ] ovhd(s)
For loops with flow-dependencies caused by forward references (such as the a(i+l,j) and 
a(i,j+l) terms in the example in Figure 3.5), the execution of a sequence of pipelined loop nests 
will also generate communication incurring some additional synchronization costs to obtain the 
required data.3 As the forward references access data that was computed in a previous pipeline, 
they can be partially vectorized and moved outside of the innermost loops. The vectorization 
level of the forward reference communication operations is located just inside the outermost 
enclosing loop in Figure 3.6 as determined by V e c t -U p L e v e l .4
The “scomm” term is used to represent the amount of synchronizing communication. (If an 
entire row is communicated, then this cost is transfer (X)). Note that this will also be present in 
the start-up synchronization for the loop nest (see Figure 3.7.) As most parallel machines only 
support a single channel for memory transfer operations through the communication network, 
and since the pipeline will stall one stage waiting for a forward reference, this adds some extra 
delay to the “sync” term: Figure 3.7.
sync =  (scomm 4- 2 • comm — ovhd(s)) 4- (s • comp 4- comm)
Since L  pipelines are chained together, the total execution time is therefore
total cost =  scomm 4- startup 4- L • pipeline 4- {L — 1) sync
3 In some applications there are no forward references and, therefore, no need to synchronize between outer 
loop iterations. In these cases each processor can proceed without any further synchronization.
4Note that the inner loop pipelined messages are placed one level above the level computed by V ect-U pLevel 
since the pipelining itself will actually preserve the dependencies carried by the true vectorization level.
31
=  [LX + s{p + L -  2)] c 4- (p 4- 3L — 4) transfer(s)
Y _ '
P
[r ( V \  1+ s
— 2J +  1 ovhd(s) + L * transfer (X) (3.2)
By substituting the communication cost model previously presented in Equation (3.1) (with 
constant values for ovhd and rate for now) and 6 for the number of bytes that are communicated 
for each unit of the strip, the total cost becomes
total cost =  [LX + s(p 4- L — 2)]
+ L
X_
s
Y  
P
- 2 } +1
c +  (p 4- 3L — A)(ovhd 4- s • b • rate) 
ovhd 4- L(ovhd 4- X  • b ■ rate)
The total cost can then be minimized with respect to the strip size to select the optimal gran­
ularity
Q
— total cost =  (p 4- L  — 2)
os
Y
P
X
+  (p +  3L — 4)6 • rate — L • ovhd — = 0
Verification of the second derivative will show that this value of s is indeed a minimum. 
Solving for s results in
s =
X L *  ovhd
\  (p +  L — 2) (6 • rate +  [ j ]  c) +  26(L — 1 )rate
(3.3)
Since ovhd and rate are actually functions of s, Equation (3.3) is evaluated for each of the 
possible message size ranges in the communication model to obtain a value for s. If the program 
does not contain any flow dependencies caused by forward references, the “sync” and “scomm”
32
terms used in Equation (3.2) are zero resulting in the following expression for the minimum
X  • L • ovhd
(3.4)s
In comparison to this estimate, a simpler one-level strip size estimate developed in the For­
tran D project [32] yields an estimate of s of the following form
This estimate assumes that communication has a constant cost and that the array dimensions 
are square. Since it is based on a one-level pipeline model, it also assumes that chaining does not 
occur between consecutive pipelines. It is interesting to note that our two-level estimate (Equa­
tion (3.3)) can be reduced to the one-level estimate (Equation (3.5)) when several simplifying
communication cost (rate =  0), and no chaining (L =  1).
3.4.2 Serial computation cost estimation
Since the cost, c, of the inner loop computations appears in all of the final estimate expres­
sions, it is necessary to estimate the computation costs. Performance estimation, for even serial 
codes, is a fairly hard problem that ideally should model the effects of the memory system, in­
ternal hardware instruction scheduling, as well as low-level compiler optimizations and is still 
an active area of research [90, 91]. To simplify the computation estimation problem, a basic 
instruction counting technique supporting simple control flow will be used [8].
Instruction counting can be performed at either the source level or assembly level. For each 
method, the costs of the basic operations for a given machine are expressed in terms of clock 
cycles. In the case of source-level estimation, the actual source code is examined and a determi­
nation is made on a line by line basis of the instructions needed to compile the code. These costs
(3.5)
assumptions are made: square arrays (X  =  Y), approximate block sizes ( [ j ]  =  constant
33
must take into account any support instructions (address computation, register loads, stores, 
loop increment/decrement, etc.) performed for the actual computation. Assembly-level estima­
tion is of course more accurate since there is no need to guess at the operations that the compiler 
may generate or the low-level optimizations that it might perform. However, it does require be­
ing able to perform pre-compilation of the source under examination.
In both cases, estimation of the cost of a fixed block of code (which may contain loops and 
other control flow structures) requires knowledge of the timing costs (cycle times obtained from 
microprocessor hardware manuals) of individual machine instructions in order to compute a 
dynamic cycle count. The extent of loop bounds (constant or variable) and the shape of the 
iteration space (due to functions appearing in the loop bounds) can be more easily determined 
at the source level while more exact information can be obtained at the assembly level. For our 
purposes, source level cost estimation will be used here.
3.5 Summary
In this chapter we have described several communication optimizations: message coalesc­
ing, message vectorization, message aggregation, and message pipelining. Experimental results 
of applying these optimizations will be presented in Chapter 6.
34
CHAPTER 4
OPTIMIZING DATA DISTRIBUTION
Optimizing the data distribution for a given application is a difficult task requiring careful 
examination of numerous tradeoffs. Since communication tends to be more expensive relative 
to local computation, a distribution should be selected to maintain high data locality for each 
processor. Excessive communication can easily offset any gains made through the use of avail­
able parallelism in the program. At the same time, the distribution should also evenly distribute 
the workload among the processors, making full use of the parallelism present in the computa­
tion. Since the programmer may not be (and should not have to be) aware of all the interactions 
‘between distribution decisions and communication optimizations which affect the overall per­
formance of the program, this chapter addresses several techniques which we have developed 
to automate this process.
As many of the subproblems which must be solved to optimize the data distribution for a 
given program have been proven to be NP-complete [53,67,92], in the following sections we 
describe several effective heuristics employed to prune a considerable part of the search space 
while still reaching a good overall solution.
4.1 Static Performance Estimation
In order to optimize the data distribution for a program, it is necessary to first be able to 
estimate the performance for a given distribution. As the PARADIGM compiler has been de­
signed to be largely machine independent, the performance of a program is estimated in terms
35
Table 4.1: Communication estimate costs models
Base
cost
Interconnection network topology
Communication
primitive 9
/ &
9rv> 9
/
9s J 09
Transfer(m) ovhd + rate * m
Shift(m,p) Transfer(m) * 2 2( p - l )
Multicast(m, p) Transfer(m) * riog2p] 3 ( r ^ i - i )  2 ( r , / p i - i ) (p - i ) r(p —1) /2 ] 1
AllMulticast(m,p) (p—1) *Shift(m,p)
Reduce(m,p) Transfer(m) * Hog2pi 3 ( f ^  —1) 2 ( ^ - 1 ) (p - i ) r (p—1) / 2]  ( p - i )
AllReduce(m,p) Reduce(m,p) +  Multicast(m,p)
Scatter(m,p) (p—1) * Transfer(m)
Gather(m, p) (p —1) *Transfer(m)
of parameterized cost models [8,93]. For each target machine, a set of architectural parameters 
interfaces with the cost models effectively separating the static performance estimation frame­
work from any one specific distributed-memory architecture.
As already mentioned in Section 3.4.2, the computation time is determined by first estimat­
ing the time for sequential execution based on a source-level dynamic count of the basic opera­
tions performed (memory load, store, index, integer and floating-point addition, multiplication, 
•division, as well as integer modulo operation, and function calls). Parallel computation esti­
mates are then obtained by further parameterizing the execution cost of a loop which is parallel 
by the number of processors over which it is partitioned.
The communication required in the program is then analyzed, taking into account several of 
the optimizations1 presented in the previous chapter, to determine both the placement of com­
munication as well as the size of the resulting message to be communicated. To estimate the 
resulting cost of the selected communication, the linear point-to-point transfer model, as previ­
ously given in Equation (3.1) on page 20, is used as a basis for the communication model.
The static performance estimator can also currently detect a number of different access pat­
terns which give rise to higher level communication operations [8,93] as shown in Table 4.1. 
In comparison to a basic transfer cost model, the cost of a high-level (or collective) commu- *
xThe static performance estimation framework currently does not model the overlap of communication and 
computation for pipelined loops resulting in conservative communication estimates for loops containing cross­
iteration dependencies.
36
nication operation can be dependent on both the interconnect topology as well as the number 
of processors involved. To accurately represent the costs for each of these high-level commu­
nication operations, their cost models are ultimately based on the transfer cost model. As this 
approach was originally developed for hypercube based networks, we have extended it using a 
virtual function interface, similar to that used in the generic library interface, to make the cost 
estimation framework more architecture independent as well as easily extensible. Libraries such 
as MPI, which standardize distributed-memory programming using message passing, provide 
many of these common forms of collective communication. In the implementation of such a 
library, for a given parallel architecture, each communication operation must be written to effi­
ciently utilize the topology in order to provide the performance shown in Table 4.1.
Note that the functions shown in Table 4.1 serve as best-case approximations of the perfor­
mance of these communication operations. For architectures such as the 2-D (and 3-D) mesh, 
the costs also assume that the geometry of the physically allocated processors has an aspect ratio 
of 1:1(: 1) independent of the logical arrangement of the abstract mesh. If more detailed model­
ing is required [94], the functions could be extended to accept the number of processors in each 
dimension of the abstract mesh to take into account the effect of the physical processor map­
ping. For multidimensional architectures in which there is more than one processor in a given 
dimension (i.e., a 2-D or 3-D mesh) this would improve the accuracy of the costs when processor 
subsets are involved in an operation, e.g., a broadcast along one dimension of a 2-D data distri­
bution mapped onto a 2-D mesh is actually (p— 1) * Transfer (m) not 2( [y/p] - 1 )  * Transfer (m) 
whereas it is always [logp] * Transfer(m) when mapped onto a hypercube.
With the exception of the architecture-specific costs (the basic computation operation costs 
and the communication parameters ovhd and rate), these cost models allow the partitioning 
techniques to be relatively machine independent.
4.2 Static Distribution Selection
In the PARADIGM compiler, static data partitioning decisions are made in a number of dis­
tinct phases [8,57]. Often, there is a tradeoff between minimizing interprocessor communica­
37
tion and exploiting all available parallelism; the communication and the computational costs 
imposed by the underlying architecture must both be taken into account. Below are brief de­
scriptions of each of the major phases performed during the static data partitioning pass.
4.2.1 Alignment
The alignment pass identifies the array dimensions that should be mapped to the same pro­
cessor mesh dimension. The alignment preferences between two arrays can be between differ­
ent pairings of dimensions (interdimensional or axis alignment) as well as by an offset or stride 
within a given pair of dimensions (intradimensional or stride alignment). Currently, only inter­
dimensional alignment is performed in the partitioning pass.
4.2.2 Block/cyclic distribution
Once array alignment has been performed, the distribution pass determines whether each 
array dimension should be distributed in a blocked or cyclic manner. Array dimensions are first 
classified by their communication requirements. If the communication in a mesh dimension is 
recognized as-a nearest-neighbor pattern, it indicates the need for a blocked distribution. For 
dimensions that are only partially traversed (less than a certain threshold), a cyclic distribu­
tion may be more desirable for load balancing. Using alignment information from the previ­
ous phase, the array dimensions that cross-reference each other are assigned the same kind of 
partitioning to ensure the intended alignment.
4.2.3 Block size selection
When a cyclic distribution is chosen, the compiler is able to make further adjustments on 
the block size giving rise to block-cyclic partitionings. Since only a cyclic distribution is cho­
sen in the previous phase, in order to improve the load balance for partially traversed array di­
mensions, a closer examination of the communication costs must be performed. This analysis 
is sometimes needed when arrays are used to simulate recordlike structures (not supported di­
38
rectly in FORTRAN 77) or when lower-dimensional arrays play the role of higher-dimensional 
arrays.
4.2.4 Mesh configuration
After all the distribution parameters have been determined, the cost estimates are functions 
of only the number of processors in each mesh dimension. For each set of aligned array dimen­
sions, the compiler determines if there are any parallel operations performed. If no parallelism 
exists in a given dimension, it is collapsed onto a single processor. If there is only one dimen­
sion that has not been collapsed, all processors are assigned to this dimension. In the case of 
multiple dimensions of parallelism, the compiler determines the best arrangement of processors 
by evaluating the cost expression to estimate execution time for each feasible configuration.
4.3 Dynamic Distribution Selection
For complex programs we have seen that static data distributions may be insufficient to ob­
tain acceptable performance. Static distributions suffer in that they cannot reflect changes in 
a program’s data access behavior. When conflicting data requirements are present, static parti­
tionings tend to be compromises between a number of preferred distributions. Instead of requir­
ing a single data distribution for the entire execution, program data could also be redistributed 
dynamically for different phases2 of the program. Such dynamic partitionings can yield higher 
performance than for a static partitioning when the redistribution is more efficient than the com­
munication pattern required by the statically partitioned computation.
4.3.1 Motivation for dynamic distributions
Figure 4.1 shows the basic computation performed in a two-dimensional Fast Fourier Trans­
form (FFT). To execute this program in parallel on a machine with distributed memory, the main
2 A phase can be described simply as a sequence of statements in a program over which a given distribution is 
unchanged.
39
Figure 4.1: Two-dimensional Fast Fourier Transform
data array, Image, is partitioned across the available processors. By examining the data accesses 
that will occur during execution, it can be seen that, for the first half of the program, data is ma­
nipulated along the rows of the array. For the rest of the execution, data is manipulated along 
the columns. Depending on how data is distributed among the processors, several different pat­
terns of communication could be generated. The goal of automatic data partitioning is to select 
the distribution that will result in the highest level of performance.
If the array were distributed by rows, every processor could independently compute the FFTs 
for each row that involved local data. After the rows had been processed, the processors would 
now have to communicate to perform the column FFTs as the columns have been partitioned 
across the processors. Conversely, if a column distribution were selected, communication would 
be required to compute the row FFTs while the column FFTs could be computed independently. 
Such static partitionings, as shown in Figure 4.1(a), suffer in that they cannot reflect changes in 
a program’s data access behavior. When conflicting data requirements are present, static parti­
tionings tend to be compromises between a number of preferred distributions.
Instead of requiring a single data distribution for the entire execution, program data could 
also be redistributed dynamically for different phases of the program. For this example, as­
sume the program is split into two separate phases; a row distribution is selected for the first 
phase and a column distribution for the second (as shown in Figure 4.1(b)). By redistributing 
the data between the two phases, none of the one-dimensional FFT operations would require
40
Sequential Fortran 77 program 
no  d is tr ib u tio n  o r  red is trib u tion
Internal AST representation
e xp lic it d is tr ib u tio n
Figure 4.2: Overview of the dynamic distribution approach
communication. Such dynamic partitionings can yield higher performance than a static parti­
tioning when the redistribution is more efficient than the communication pattern required by the 
statically partitioned computation.
4.3.2 Overview of the dynamic distribution approach
The approach we have developed to automatically select dynamic distributions, shown in 
Figure 4.2, consists of two main steps which can be preceded by an initial analysis of array 
alignments. First, in Section 4.3.3, we will describe how to recursively decompose the program 
into a hierarchy of candidate phases obtained using existing static distribution techniques. Then, 
in Section 4.3.4 we will describe how to select the most efficient sequence of phases and phase 
transitions taking into account the cost of redistributing the data between the different phases. 
To avoid performing redundant work within the static partitioner, it would also be possible to 
decouple the analysis of array alignment preferences and the selection of the actual alignments 
by using an appropriate data structure which will described in Section 4.3.5. This approach 
allows us to build upon the static partitioning techniques previously described in Section 4.2.
41
p r o gram ADI2d op. pt ase
double precision u(N,N), uh(N,N), b(N,N), alpha do j = 2, N - 1 31
integer i, j, k uh(N - l.j) = uh(N - 1,j) / b(N - l.j) 32 VI
enddo 33
*** Initial value for u op. phase do j = 2, N - 1 34
do j = 1, N 1 do i = N - 2. 2, -1 35
do i = 1, N 2 uh(i,j) = (uh(i,j) + uh(i + l.j)) VII
u(i, j) = 0.0 3 k / b(i.j) 36
enddo 4 I enddo 37
u(l.j) = 30.0 5 enddo 38 J
u(n,j) = 30.0 6
enddo 7 *** Forward and backward sweeps along rows
do j = 2, N - 1 39
*** Initialize uh do i = 2, N - 1 40
do j = 1, N 8 b(i,j) = (2 + alpha) 41
do i = 1, N 9 u(i,j) = (alpha - 2) * uh(i,j) VIII
uh(i,j) = u(i,j) 10 11 k + uh(i + l.j) + uh(i - l.j) 42
enddo 11 enddo 43
enddo 12 enddo 44
do i = 2, N - 1 45
alpha = 4 * (2.0 / N) 13 u(i,2) - u(i,2) + uh(i,l) 46
do k = 1, maxiter 14 u(i,N - 1) = u(i,N - 1) + uh(i.N) 47
*** Forward and backward sweeps along columns enddo 48 J
do j = 2, N - 1 15 "1
do i = 2, N - 1 16 do j = 3, N - 1 49
b(i,j) = (2 + alpha) 17 do i = 2. N - 1 50
uh(i,j) * (alpha - 2) * u(i,j) III b(i,j) = b(i,j) - 1 / b(i,j - 1) 51 xk + u(i,j + 1) + u(i,j 1) 18 u(i,j) = u(i,j)
enddo 19 k + u(i,j - 1) / b ( i ,j - 1) 52
enddo 20 enddo 53
do j = 2, N - 1 21 enddo 54 _
uh(2,j) = uh(2,j) + u(l,j) 22 IV do i = 2. N - 1 55
uh(N - l.j) = uh(N - l,j) + u ( N ,j) 23 u(i,N - 1) = u ( i ,N - 1) / b(i,N - 1) 56 XI
enddo 24 _ enddo 57.
do j = N - 2, 2. -1 58
do j = 2, N - 1 25 do i = 2, N  - 1 59
do i = 3, N - 1 26 u(i.j) * (u(i,j) + u(i,j + 1)) XII
b(i,j) = b(i,j) - 1 / b(i - l.j) 27 v fe / b(i,j) 60u h ( i ,j) = u h (i ,j ) enddo 61
k  + uh(i - l.j) / b(i - l.j) 28 enddo 62.
enddo 29 enddo 63
enddo 30. end 64
Figure 4.3: 2-D Alternating Direction Implicit iterative method (ADI2D)
(shown with operational phases)
To help illustrate the dynamic partitioning technique, an example program will be used. 
In Figure 4.3, a two-dimensional Alternating Direction Implicit iterative method3 (ADI2D) is 
shown, which computes the solution of an elliptic partial differential equation known as Pois­
son’s equation [95]. Poisson’s equation can be used to describe the dissipation of heat away 
from a surface with a fixed temperature as well as to compute the free-space potential created 
by a surface with an electrical charge.
3To simplify later analysis of performance measurements, the program shown performs an arbitrary number of 
iterations as opposed to periodically checking for convergence of the solution.
I
Phase 1
Figure 4.4: Phase decomposition
For the program in Figure 4.3, a static data distribution will incur a significant amount of 
communication for over half of the program’s execution. For illustrative purposes only, the 
operational definition of phases previously described in Section 2.2 identifies twelve different 
“phases” in the program. These phases exposed by the operational definition need not be known 
for our technique (and, in general, are potentially too restrictive) but they will be used here for 
comparison as well as to facilitate the discussion.
4.3.3 Phase decomposition
Initially, the entire program is viewed as a single phase for which a static distribution is de­
termined. At this point, the immediate goal is to determine if and where it would be beneficial 
to split the program into two separate phases such that the sum of the execution times of the 
resulting phases is less than the original (as illustrated in Figure 4.4). Using the selected distri­
bution, a communication graph is constructed to examine the cost of communication in relation 
to the flow of data within the program.
We define a communication graph as the flow information from the dependence graph (gen­
erated by Parafrase-2 [6]) weighted by the cost of communication. The nodes of the communi­
cation graph correspond to individual statements while the edges correspond to flow dependen-
43
cies that exist between the statements. As a heuristic, the cost of communication performed for 
a given reference in a statement is initially assigned to (reflected back along) every incoming 
dependence edge corresponding to the reference involved. Since flow information is used to 
construct the communication graph, the weights on the edges serve to expose communication 
costs that exist between producer/consumer relationships within a program. Also since we re­
strict the granularity of phase partitioning to the statement level, single node cycles in the flow 
dependence graph are not included in the communication graph.
After the initial costs have been assigned to the communication graph, they are scaled ac­
cording to the number of outgoing edges to which each reference was assigned. The scaling 
conserves the total cost of a communication operation for a given reference, re /, at the con­
sumer, j ,  by assigning portions to each producer, i, proportional to the dynamic execution count 
of the given producer, i, divided by the dynamic execution counts of all producers. Note that 
the scaling factors are computed separately for producers which are predecessor or successors 
of the consumer as shown in Equation (4.2). Once the individual edge costs have been scaled to 
conserve the total communication cost, they are propagated back toward the start of the program 
(through all edges to producers which are predecessors) while still conserving the propagated 
cost as shown in Equation (4.3). Also, to further differentiate between producers at different 
nesting levels, all scaling factors are also scaled by the ratio of the nesting levels as shown in 
Equation (4.1).
Scaling initial costs:
• lratio(i,j) • comm(i, ref)
(4.1)
(4.2)
i<j  P(zin-pred(j,ref) 
i> j  PEinsucc(j,ref)
Propagating costs:
W(i , j , re f )  = W( i , j ,  ref) +
dyncount(P)
Pein-pred(j,*)
dyncount{i)
■lratio{i,j) J2  (4.3)
PGout(j)
44
In the actual implementation, this is accomplished in two passes over the communication 
graph after assigning the initial communication costs: the first to compute all the scaling factors 
and the second to propagate costs back toward the start of the program.
In Figure 4.5, the communication graph is shown for ADI2D with some of the edges labeled 
with the expressions automatically generated by the static cost estimator (using a problem size 
of 512 x 512 atid max i t e r  set to 100). For reference, the communication models for an Intel 
Paragon and a Thinking Machines CM-5, corresponding to the communication primitives used 
in the cost expressions, are shown in Table 4.2. Conditionals appearing in the cost expressions 
represent costs that will be incurred based on specific distribution decisions (e.g., P2 >  1 is true 
if the second mesh dimension is assigned more than one processor).
Once the communication graph has been constructed, a split point is determined by comput­
ing a maximal cut of the communication graph. The maximal cut removes the largest commu­
nication constraints from a given phase to potentially allow better individual distributions to be 
selected for the two resulting split phases. Since we also want to ensure that the cut divides the 
program at exactly one point to ensure only two subphases are generated for the recursion, only 
cuts between two successive statements will be considered. Since the ordering of the nodes is 
delated to the- linear ordering of statements in a program, this guarantees that the nodes on one 
side of the cut will always all precede or all follow the node most closely involved in the cut. 
The following algorithm is used to determine which cut to use to split a given phase.
For simplicity of this discussion, assume for now that there is at most only one edge between 
any two nodes. For multiple references to the same array, the edge weight can be considered to 
be the sum of all communication operations for that array. Also, to better describe the algorithm, 
view the communication graph G =  (V, E) in the form of an adjacency matrix (with source 
vertices on rows and destination vertices on columns).
(1) For each statement Si {i € [1,(|V’| — 1)]} compute the cut of the graph between state­
ments Si and Si+i by summing all the edges in the submatrices specified by [Si, Si\ x
*^1 v |] and [Sf+ij *S,|vr|] x [*$)., Si\
(2) While computing the cost of each cut also keep track of the current maximum cut.
45
Table 4.2: Communication primitives
Intel Paragon TMC CM-5
Transfer(m) 50 4- 0.018m
23+ 0 .12m m <  16 
86+0.12m m >  16
________ __________
Shift(m) 2 * Transfer(m)
46
(3) If there is more than one cut with the same maximum value, choose from this set the cut 
that separates the statements at the highest nesting level. If there is more than one cut 
with the same highest nesting level, record the earliest and latest maximum cuts with that 
nesting level (forming a cut window).
(4) Split the phase using the selected cut.
1
2
4
5
(a) Adjacency matrix
destination
1 2 3 4 5
9 4
15
7 7
12
0
computed cuts
1 7 2 = 25 
2 /3  = 31 
13/4 = 411 
4 /5  = 19
(b) Actual representation
destination
1 2 3 4 5
9 T
15
( " ) ( T T l )
1 01 A . ,
(—^ 
+
0
(c) Efficient computation
Figure 4.6: Example graph illustrating the computation of a cut
In Figure 4.6, the computation of the maximal cut on a smaller example graph with arbitrary 
weights is shown. The maximal cut is found to be between vertices 3 and 4 with a cost of 41. 
This is shown both in the form of the sum of the two adjacency submatrices in Figure 4.6(a), 
and graphically as a cut on the actual representation in Figure 4.6(b).
In Figure 4.6(c), the cut is again illustrated using an adjacency matrix, but the computation 
is shown using a more efficient implementation which only adds and subtracts the differences 
between two successive cuts using a running cut total while searching for the maximum cut 
in sequence. This implementation also provides much better locality than the full submatrix 
summary when analyzing the actual sparse representation since the differences between two 
successive cuts can be easily obtained by traversing the incoming and outgoing edge lists (which 
correspond to columns and rows in the adjacency matrix respectively) of the node immediately 
preceding the cut. This takes 0 ( E)  time on the actual representation, only visiting each edge 
twice -  once to add it and once to subtract it.
A new distribution is selected for each of the resulting phases inheriting any unspecified 
distributions (due to an array not appearing in a subphase) from the parent phase. The process is
47
Figure 4.7: Partitioned communication graph for ADI2D 
(Statement numbers correspond to Figure 4.3.)
then continued recursively using the costs from the newly selected distributions corresponding 
to each subphase. As shown in Figure 4.4, each level of the recursion is carried out in branch 
and bound fashion such that a phase is split only if the sum of the estimated execution times of 
the two resulting phases shows an improvement over the original.4 In Figure 4.7, the partitioned 
communication graph is shown for ADI2D after the phase decomposition is completed.
As mentioned in the cut algorithm, it is also possible to find several cuts which all have the 
same maximum value and nesting level forming a window over which the cut can be performed. 
This can occur since not all statements generate communication resulting in either edges with 
zero cost or regions over which the propagated costs conserve edge flow, both of which will 
maintain a constant cut value. To handle cut windows, the phase should be split into two sub­
phases such that the lower subphase uses the earliest cut point and the upper subphase uses the 
latest, resulting in overlapping phases. After new distributions are selected for each overlapping 
subphase, the total cost of executing the overlapped region in each subphase is examined. The 
overlap is then assigned to the subphase that resulted in the lowest execution time for this region. 
If they are equivalent, the overlapping region can be equivalently assigned to either subphase.
4 A further optimization can also be applied to bound the size of the smallest phase that can be split by requiring 
its estimated execution time to be greater than a “minimum cost” of redistribution.
48
Currently, this technique is not yet implemented for cut windows. We instead always select the 
earliest cut point in a window for the partitioning.
To be able to bound the depth of the recursion without ignoring important phases and dis­
tributions, the static partitioner must also obey the following property. A partitioning technique 
is said to be monotonic if it selects the best available partition for a segment of code such that 
(aside from the cost of redistribution) the time to execute a code segment with a selected distri­
bution is less than or equal to the time to execute the same segment with a distribution that is 
selected after another code segment is appended to the first. In practice, this condition is satis­
fied by the static partitioning algorithm that we are using. This can be attributed to the fact that 
conflicts between distribution preferences are not broken arbitrarily, but are resolved based on 
the costs imposed by the target architecture [57].
It is also interesting to note that if a cut occurs within a loop body, and loop distribution 
can be performed, the amount of redistribution can be greatly reduced by lifting it out of the 
distributed loop body and performing it in between the two sections of the loop. Also, if depen­
dencies allow statements to be reordered, statements may be able to move across a cut boundary 
without affecting the cost of the cut while possibly reducing the amount of data to be redis­
tributed. Both of these optimizations can be used to reduce the cost of redistribution but neither 
will be examined in this thesis.
4.3.4 Phase and phase transition selection
After the program has been recursively decomposed into a hierarchy of phases, a Phase 
Transition Graph (PTG) is constructed. Nodes in the PTG are phases resulting from the decom­
position while edges represent possible redistribution between phases as shown in Figure 4.8(a). 
Since it is possible that using lower level phases may require transitioning through distributions 
found at higher levels (to keep the overall redistribution costs to a minimum), the phase tran­
sition graph is first sectioned across phases at the granularity of the lowest level of the phase
49
Start
LevelO ' Leviel 1 " Level 2
m 1xP
00
m Px1 □
0 0 0
lli jil 0—
iv.vfrr:;
\k 1xP
p a L .
IV
0 h 0
VI VI VI
0 H 0
0 0 0
\k Px1 A
0 s
XI XI
0 H|
Stop
(a) Initial phase transition graph
(b) After performing loop peeling 
Figure 4.8: Phase transition graph for ADI2D
50
decomposition5 Redistribution costs are then estimated [84] for each edge and are weighted by 
the execution count of the surrounding code.
If a redistribution edge occurs within a loop structure, additional redistribution may be in­
duced due to the control flow of the loop. To account for a potential “reverse” redistribution 
which can occur on the back edge of the iteration, the phase transition graph is partitioned around 
the loop body and the first iteration of loop containing the phase transition is peeled off and the 
phases of the first iteration of the body re-inserted in the phase transition graph as shown in Fig­
ure 4.8(b). Redistribution within the peeled iteration is only executed once while that within the 
remaining loop iterations is now executed (N  -  1) times, where N  is the number of iterations 
in the loop. The redistribution, which may occur between the first peeled iteration and the re­
maining iterations, is also multipled by (N  — 1) in order to model when the back edge causes 
redistribution (i.e., when the last phase of the peeled iteration has a different distribution than 
the first phase of the remaining one).
Once costs have been assigned to all redistribution edges, the best sequence of phases and 
phase transitions is selected by computing the shortest path on the phase transition graph. This 
is accomplished in 0 { V 2) time (where V  is now the number of vertices in the phase transition 
graph) using Dijkstra’s single source shortest path algorithm [96].
After the shortest path has been computed, the loop peeling performed on the PTG can be 
seen to have been necessary to obtain the best solution if the peeled iteration has a different tran­
sition sequence than the remaining iterations. Even if the peeled iteration does have different 
transitions, not actually performing loop peeling on the actual code will only incur at most one 
additional redistribution stage upon entry to the loop nest. This will not overly affect perfor­
mance if the execution of the entire loop nest takes significantly longer than a single redistri­
bution operation, which is usually the case especially if the redistribution considered within the 
loop was actually accepted when computing the shortest path.
Using the cost models for an Intel Paragon and a Thinking Machines CM-5, the distribu­
tions and estimated execution times reported by the static partitioner for the resulting phases
5 Sectioned phases that have identical distributions within the same horizontal section of the PTG are actually 
now redundant and can be removed, if desire'd, without affecting the quality of the final solution.
51
Table 4.3: Detected phases and estimated execution times (sec) for ADI2D 
(Performance estimates correspond to 32 processors.)
Level 0 
Level 1
Level 2
described as ranges of operational phases are shown in Table 4.3. The performance parame­
ters of the two machines are similar enough that the static partitioning actually selects the same 
distribution at each phase for each machine. The times estimated for the static partitioning are 
slightly higher than those actually observed, resulting from a conservative assumption regarding 
pipelines6 made by the static cost estimator [8], but they still exhibit similar enough performance 
trends to be used as estimates. For both machines, the cost of performing redistribution is low 
enough in comparison to the estimated performance gains that a dynamic distribution scheme 
is selected, as shown by the shaded area in Figure 4.8(b).
Pseudo-code for the dynamic partitioning algorithm is presented in Figure 4.9 to briefly 
summarize both the phase decomposition and phase transition selection procedures as described. 
As distributions for a given phase are represented as a set7 of variables, each of which having 
an associated distribution, a masking union set operation is used to inherit unspecified distribu­
tions (dist |o) disti). A given variable’s distribution in the dist set will be replaced if it also has 
a distribution in the disti set thus allowing any unspecified distributions in subphase i (disti) to 
be inherited from its parent (dist).
Since the use of inheritance during the phase decomposition process implicitly maintains 
the coupling between individual array distributions, redistribution at any stage will only affect
6 Initially, a BLOCK, BLOCK distribution was selected by the static partitioner for (only) the first step of the phase 
decomposition. As the static performance estimation framework does not currently take into account any over­
lap between communication and computation for pipelined computations we decided that this decision was due 
to the conservative performance estimate. For the analysis presented for ADI2D, we bypassed this problem by 
temporarily restricting the partitioner to only consider 1-D distributions.
7As will be more rigorously defined later in Section 5.1.3.
Op. Phases(s) Distribution Intel Paragon TMC CM-5
I-XII * ,BLOCK 1 x 32 22.151461 39.496276
I-VIII
IX-XII
* ,BLOCK 1 x 32 
BLOCK,* 32 x 1
1.403644
0.602592
2.345815
0.941550
I-HI
iv -v m
BLOCK,* 32 x 1 
* ,BLOCK 1 x 32
0.376036
0.977952
0.587556
1.528050
52
Partition (p r o g r a m )
1 c u t l i s t  <— 0
2 d i s t  < -  Static-Partitioning ( p r o g r a m )
3 p h a s e s  <— DECOMPOSE-PHASE( p r o g r a m ,  d i s t , c u t l i s t )
4 p t g  < -  Select-Redistribution (phases, c u t l i s t )
5 Assign distributions based on shortest phase recorded in p t g
DECOMPOSE-PHASE ( p h a s e ,  d i s t , c u t l i s t )
1 Add p h a s e  to list of recognized phases
2 Construct the communication graph for the p h a s e
3 c u t  f -  Max-Cu t(p/mse)
4 if VALUE(cw^) =  0 0  No communication in p h a s e
5 then return
6 Relocate c u t  to highest nesting level of identical cuts
7 p h a s e  1, p h a s e 2 p h a s e
8 t> Note: if c u t  is a window, p h a s e  ^  and p h a s e 2 will overlap
9 d i s h  < -  Static-Partitioning ( p h a s e  A
10 dist2  < -  Static-Partitioning (phase2)
11 t> Inherit any unspecified distributions from parent
12 d i s t j  <— d i s t  |o) d i s t \
13 d i s t £  <— d i s t  |oj d i s t 2
14 if (cost(p/iasei) +  c o s t ( p h a s e 2 ) )  < cost ( p h a s e )
15 then t> If c u t  is a window, p h a s e  ^ and p h a s e 2 overlap
16 if LAST_STMTNUM(p/msei) > FIRST_STMTNUM(p/mse2)
17 then RESOLVE-OVERLAP(c«£,phasel ,phase2)
18 L ist-Insert(cui, c u t l i s t )
19 p h a s e ^ l e f t  =  DECOMPOSE-PHASE( p h a s e u  d i s t u  c u t l i s t )
20 p h a s e - b r i g h t  = DECOMPOSE-PHASE(p/iase9, d i s t 2 , c u t l i s t )
21 else p h a s e - b l e f t  =  NULL
22 p h a s e - b r i g h t  — NULL
23 return ( p h a s e )
Select-Redistribution  (jp h a s e s , c u t l i s t )
1 if c u t l i s t  =  0
2 then return
3 p t g  < -  CONSTRUCT-PTG( p h a s e s ,  c u t l i s t )
4 Divide p t g  horizontally at the recursion lowest level
5 for each l o o p  in p h a s e s
6  do if l o o p  contains a cut at its nesting level
7 then Divide p t g  at l o o p  boundaries
8 Peel ( l o o p ,  p t g )
9  Estimate the interphase redistribution costs for p t g
10 Compute the shortest phase transition path on p t g
11 return ( p t g )
Figure 4.9: Pseudo-code for the partitioning algorithm
53
the next stage. This can be contrasted to the technique proposed by Bixby, Kennedy, and Kre- 
mer [66] which first selects a number of partial candidate distributions for each phase specified 
by the operational definition. Since their phase boundaries are chosen in the absence of flow 
information, redistribution can affect stages at any distance from the current stage. This causes 
the redistribution costs to become binary functions depending on whether or not a specific path 
is taken, therefore, necessitating the need for 0-1 integer programming. In [67] they do agree, 
however, that 0-1 integer programming is not necessary when all phases specify complete dis­
tributions (such as true in our case). In their work, this occurs only as a special case in which 
they specify complete phases from the innermost to outermost levels of a loop nest. For this 
situation they show how the solution can be obtained using a hierarchy of single source shortest 
path problems in a bottom-up fashion (as opposed to solving only one shortest path problem 
after performing a top-down phase decomposition as in our approach).
Up until now, we have not described how to handle control flow other than for loop con­
structs. More general flow (caused by conditionals or branch operations) can be viewed as sep­
arate paths of execution with different frequencies of execution. The same techniques that have 
been used for scheduling assembly level instructions by selecting traces of interest [97] or form­
ing larger blocks from sequences of basic blocks [98] in order to optimize the most frequently 
taken paths can also be applied to the phase transition. Once a single trace has been selected 
(using profiling or other criteria) its phases are obtained using the phase selection algorithm 
previously described but ignoring all code off the main trace. Once phases have been selected, 
all off-trace paths can be optimized separately by first setting their stop and start nodes to the 
distributions of the phases selected for the points at which they exit and re-enter the main trace. 
Each off-trace path can then be assigned phases by applying the phase selection algorithm to 
each path individually. Although this specific technique is not currently implemented in the 
compiler, but will be addressed in future work, other researchers have also been considering it 
as a feasible solution for selecting phase transitions in the presence of general control flow [99].
54
4.3.5 Hierarchical Component Affinity Graph (HCAG)
To further extend the partitioning algorithms to perform interprocedural analysis as well as 
facilitate the computation of static alignments and communication cost estimates within a spec­
ified region of the program, we have also developed a hierarchical form of the extended compo­
nent affinity graph (CAG) [16] as currently used in the static partitioner [57]. Since alignment 
preferences are only affected by the relationship of the subscript expressions between the left- 
and right-hand sides of an assignment statement, this information does not change with respect 
to the current distribution under consideration. Rather than computing an entire affinity graph 
for each new phase revealed during the partitioning process, the HCAG can be computed once 
for the entire program and reused throughout the analysis. This is achieved by precomputing 
intermediate alignment graphs and storing them for later use within the levels created by com­
pound nodes (e.g., ‘loop’ or ‘if’ statements) in the program’s abstract syntax tree (AST).
The HCAG is constructed in a single recursive traversal of the AST performing these oper­
ations:
(1) pre-order: record individual alignment preferences for the statement
(2) in-order: store the current state of the affinity graph and then set it to NULL (for com­
pound nodes)
(3) post-order: merge the statement’s alignment preferences with the current state of the 
affinity graph.
The graph that is stored at the root of the AST corresponds to the CAG for the entire program 
whereas each compound statement header will contain the CAG summarizing all statements 
within the corresponding subtree. Now, all that is required to obtain the CAG for a given phase 
is to simply merge the HCAG components from only the topmost level of the region of the AST 
containing the given phase. This saves a considerable amount of time which would otherwise 
be required to obtain the CAG when examining each new phase.
Furthermore, by also recursively traversing on the call graph for function and subroutine 
calls encountered during the traversal, the HCAG can also be used to perform interprocedu­
55
ral alignment analysis. Upon encountering the first call site for a given function, the HCAG is 
computed for the function as previously described and stored retaining all parameters symbol­
ically in the alignment cost expressions. At all following call sites, the stored HCAG is simply 
retrieved and the current values of the parameters substituted in the cost expressions. This tech­
nique ignores any interprocedural side effects but, unless the access patterns are highly side- 
effect dependent, it will still accurately model access patterns across function calls.
In Figure 4.10, the HCAG is shown for ADI2D. For clarity, only a few of the alignment costs 
are labeled and only the statements corresponding to loop headers are shown. If all statements 
were shown in the AST, statements contained within a loop body would appear as a connected 
sequence of right children attached as the left child of the loop header. In the AST, the body of 
a compound statement is attached as the left child while the right child follows statements in se­
quence. In the figure, intermediate affinity graphs are illustrated using a small inset at each level 
of the HCAG while the different nesting levels of the HCAG are indicated by shaded outlines. 
The affinity graphs represent each dimension of an array as a separate node while alignment 
preferences (with an associated cost which is incurred if the alignment is not obeyed) connect 
different dimensions of the arrays.
One thing to notice for ADI2D is how most of the intermediate affinity graphs are identical 
at the two innermost levels of the HCAG. This is due to the fact that the innermost two loops in 
ADI2D are almost always perfectly nested causing the outer loop to not contribute any further 
alignment information than that already obtained from within the inner loop. The structures of 
the top two levels of this HCAG are also identical since the preferences of the two intermediate 
graphs merged to form the topmost level were already present in both subgraphs. It is important 
to note that even though the structure is the same for the top two levels, the alignment costs are 
actually different. Even though there aren’t any alignment conflicts in this example, one last 
observation to make is how useful the HCAG can be for analyzing code in which an alignment 
conflict may exist. If a conflict exists when considering a large portion of the program but dis­
appears when smaller regions are examined, appropriate alignments can be efficiently obtained 
for each of the competing regions by using the HCAG in combination with the techniques de­
scribed in the previous section.
56
(P2  P i
0 - 0
o— o
>  l ) T r a n s f e r ( ^ ^ ) ) / 2
Array: u uh b
M2))/2 dim 1: C k ^ O ^ O
dim 2: p 4 3 5 ^ Q _
: p r e f  i { u \ , u h i )
p r e f 8 ( u i , u h i )  +  p r e f l 5 ( u i , u h i )
pref15(uu uhi)
p r e f 16( u i , u h i )  4- p r e f  4 0 { u i , u h \ )  
+ p r e f  46 ( u i , u h i )
(100 • P 2 { P \  >  l)T ransfer(
- 1 0 0 - 2 ( P 2 >  l ) S h ; f t ( ^ ) ) / 2
pref 4&{u\,uh\)
( P 2 >  l ) T r a n s f e K ^ )
pref 4i,4o(ui ’ uhi)
■ (100 P2(Pi > 1 
| —100 • 2 ( P 2 >  1
Figure 4.10: Hierarchical Component Affinity Graph (HCAG) for ADI2D 
(Statement numbers correspond to Figure 4.3)
57
As there are still other implementation issues within the static partitioner which remain to 
be resolved before it can perform interprocedural partitioning, the HCAG has not been imple­
mented in the compiler. One possibility in particular, is to integrate this data structure with the 
Hierarchical Task Graph (HTG) [100] already constructed by Parafrase-2. As this would give us 
the opportunity to study techniques for automatically selecting both data and task distributions 
in a unified framework, this is beyond the scope of this thesis and will be the focus of future 
work in this area.
4.4 Summary
In this chapter we have described the techniques we have developed for optimizing data 
distributions for distributed-memory multicomputers. First, the program is recursively decom­
posed into a hierarchy of candidate phases. Then, taking into account the cost of redistributing 
the data between the different phases, the most efficient sequence of phases and phase transitions 
is selected. By basing these techniques upon the existing algorithms previously implemented 
in PARADIGM to obtain static data distributions, the synergy of the combined approach allows 
us to select either static or dynamic data distributions in unified framework. An experimental 
evaluation of the resulting data distributions will be presented later in Chapter 6.
58
CHAPTER 5
OPTIMIZING DATA REDISTRIBUTION
In this chapter, we present an interprocedural data-flow framework which provides a means 
to convert between a distribution-based representation (as specified by the data partitioner and 
required for compilation) and redistribution-based form (as specified by HPF) in order to output 
the selected distributions in the form of an HPF program. Initially, this was all that we set out 
to do, but quickly discovered that it was also possible to use the same framework in reverse to 
convert a redistribution-based representation (an HPF program) into a distribution-based form. 
Not only is this necessary step in order to compile dynamic HPF programs, but as will be seen in 
this chapter, it is also possible to optimize the amount of redistribution that must be performed 
during this transformation by removing unnecessary redistribution specified by the programmer 
as well as that due to the HPF semantics of function calls. As will described in more detail 
in Section 5.2.3, this framework also allows us to convert certain dynamic HPF programs into 
equivalent static versions through a process we refer to as Static Distribution Assignment (SDA) 
providing dynamic HPF support for existing subset HPF compilers.
In HPF (as previously mentioned in Chapter 1), dynamic distributions can be described ex­
plicitly using executable redistribution directives (REDISTRIBUTE or REALIGN, which specify 
where new distributions become active) or implicitly by calling functions (which require dif­
ferent data distributions than the calling function). To actually compile an HPF program into 
an efficient form, however, both the redistribution operations as well as the possible distributions 
for the individual blocks of code must be known. Since the HPF REDISTRIBUTE and REALIGN 
directives only specify redistribution information, both the intra- and interprocedural control
59
flow through a program must be examined to decide exactly which redistribution operations 
were last performed in order to determine which distributions are active at any given point in 
the program.
For interprocedural control flow (function calls), the distribution of the actual arguments 
which appear in a call to a function will not necessarily match the distribution of the dummy 
arguments within the function. In HPF, several different forms of distribution directives for 
dummy arguments are provided [2]. The distribution directive for a dummy argument may be:
prescriptive - The distribution prescribes the mapping of the dummy argument. Passing an 
actual argument with a different distribution will require redistributing the argument.
transcriptive - The distribution of the dummy argument is inherited or transcribed from the 
actual argument.
descriptive - The distribution describes the mapping of the dummy argument with the claim 
that no redistribution will take place. If an interface is present in the calling function, then 
this becomes a prescriptive directive.
Both prescriptive as well as descriptive arguments (with a provided interface) can have the 
same effect as explicit REDISTRIBUTE or REALIGN directives whereas transcriptive arguments 
provide a mechanism for writing functions which can accept any type of distribution. As purely 
descriptive directives (without an interface) do not require any additional support from an HPF 
compiler, they will not be discussed further in this thesis.
The basic motivation for the framework presented in this thesis is best summarized in the 
following excerpt from the HPF language specification [5]:
An overriding principle is that any mapping or remapping o f arguments is not vis­
ible to the caller This is true whether such remapping is implicit (in order to con­
form to prescriptive directives, which may themselves be explicit or implicit) or ex­
plicit (specified by REALIGN or REDISTRIBUTE directives). When the subprogram 
returns and the caller resumes execution, all objects accessible to the caller after 
the call are mapped exactly as they were before the call. It is not possible for a sub­
program to change the mapping of any object in a manner visible to its caller, not 
even by means o f REALIGN and REDISTRIBUTE.
60
Sequential Fortran 77 program Dynamic HPF program 
n o  d is tr ib u tio n  o r  re d is trib u tion  e xp lic it a n d  im p lic it re d is trib u tion
Optimized static HPF program Optimized dynamic HPF program 
e x p lic it d is tr ib u tio n  a n d  re d is trib u tion  e xp lic it re d is trib u tio n
Figure 5.1: Overview of the array redistribution data-flow analysis framework 
(shaded areas indicate components described in this chapter)
To provide the correct semantics in the presence of implicit and explicit redistributions (or 
remappings) and generate efficient code, it is necessary for the compiler to perform interproce­
dural array data-flow analysis. The compiler should also generate efficient code avoiding the 
use of either a simple copy-in/copy-out strategy or a complete redistribution of all arguments 
upon every entry and exit of a function. As will be shown, it is possible to optimize such inter­
procedural redistribution simultaneously with intraprocedural redistribution operations within 
the framework presented in this thesis.
5.1 Data Redistribution Analysis
An overall view of the array redistribution data-flow analysis framework we have developed 
is shown in Figure 5.1. In addition to serving as a back end to the automatic data partitioning 
system [101] (described in Chapter 4), the framework is also capable of analyzing (and opti­
61
mizing) existing HPF programs providing a mechanism to generate fully explicit dynamic HPF 
programs [102].
The intermediate form of a program within the framework is one in which the distribution of 
every array at every point in the program as well as the redistribution required to move from one 
point to the next are explicitly known. The different paths through the framework involve passes 
which process the available distribution information in order to obtain the missing information 
required to move from one representation to another.
In HPF, dynamic distributions are described by specifying the transitions between differ­
ent distributions (through explicit redistribution directives or implicit redistribution at function 
boundaries). After converting dynamic HPF programs (which are well behaved -  those in which 
every use or definition of an array only has one reaching redistribution) to a fully static version, 
through a process we call static distribution assignment (SDA), which will be explained later in 
this thesis, both the redistribution and distribution are explicitly specified. With this framework, 
it is also possible to convert arbitrary HPF programs into an optimized HPF program containing 
only explicit redistribution directives and descriptive function arguments. The data partitioner, 
on the other hand, explicitly assigns different distributions to individual blocks of code serv­
ing as an automated mechanism for converting sequential Fortran programs into efficient HPF 
programs. In this case, the framework can be used to synthesize explicit interprocedural redis­
tribution operations in order to preserve the meaning of what the data partitioner intended while 
using HPF semantics.
As will be shown in the remainder of this thesis, by analyzing the different program repre­
sentations using this framework, it is possible to automatically:
(1) determine which distributions hold over specific sections of a program when given redis­
tribution directives
(2) optimize both the inter- and intraprocedural transitions between dynamic distributions 
while still maintaining the original semantics of the HPF program when given distribu­
tions for each point in the program
62
(3) determine when the distribution pattern specified by an HPF program causes a given array 
to be assigned multiple distributions due to different redistribution operations on multiple 
paths within a function or as a result of parameter aliasing (resulting in a non-conforming 
HPF program)
(4) convert (well-behaved) dynamic HPF programs into equivalent static forms through a pro­
cess we refer to as static distribution assignment (SDA), which can be used to extend the 
capabilities of existing subset HPF compilers.
The core of this analysis is built upon two separate interprocedural data-flow problems which 
perform:
•  distribution synthesis (Section 5.2.1), and
•  redistribution synthesis (Section 5.2.2).
These two data-flow problems are both based upon the problem of determining both the 
inter- and intraprocedural reaching distributions for a program. Before giving further details 
of how all of these transformations are accomplished through the use of these two data-flow 
problems, we will first describe the idea of reaching distributions and the basic representations 
we use to perform this analysis.
5.1.1 Computing reaching distributions
The problem of determining which distributions reach any given point taking into account 
control flow in the program is very similar to the computation of reaching definitions (a forward 
data-flow problem):
A definition d is said to reach a point p if there is a path in the control flow graph 
(CFG) from the point immediately following d to p, such that d is not killed along 
that path by another definition [73].
In classic compilation theory a control flow graph consists of nodes (basic blocks) represent­
ing uninterrupted sequences of statements and edges representing the flow of control between 
basic blocks. For determining reaching distributions, an additional restriction must be added to
63
this definition. Not only should each block B  be viewed as a sequence of statements with flow 
only entering at the beginning and leaving at the end, but the data distribution for the arrays 
defined or used within the block is also not allowed to change. In comparison to the original 
definition of a basic block, this imposes tighter restrictions on the extents of a block. Using this 
definition of a block in place of a basic block results in what we refer to as the distribution flow 
graph (DFG).
Using this view of a block in a DFG and by viewing array distributions as definitions, the 
same data-flow framework used for computing reaching definitions [73] can now be used to 
obtain the reaching distributions by defining the following sets for each block in a function:
• DIST(B) - set of distributions present when executing block B
•  REDIST(-B) - set of redistributions performed upon entering block B
•  GEN(B) - set of distributions generated by executing block B
•  KILL(S) - set of distributions killed by executing block B
•  IN(F?) - set of distributions that exist upon entering block B
•  OUT(B) - set of distributions that exist upon leaving block B
•  DEF(B), USE(B) - set of variables defined or used in block B
It is important to note that GEN and KILL are specified as the distributions generated or 
killed by executing block B  as opposed to entering (redistribution at the head of the block) or 
exiting (redistribution at the tail of the block) in order to allow both forms of redistribution. 
GEN and KILL are initialized by DIST or REDIST (depending on the current applications as 
described in Section 5.2) and may be used to keep track of redistributions that occur on entry 
(e.g., HPF redistribute directives or functions with prescriptive distributions) or exit (e.g., calls 
to functions which internally change a distribution before returning). To perform interprocedu­
ral analysis, the function itself also has IN and OUT sets, which contain the distributions present 
upon entry and summarize the distributions for all possible exits.
Once the sets have been defined, the following data-flow equations are iteratively computed 
for each block until the solution OUT(B) converges for every block B  (where PRED(B) are
64
the nodes which immediately precede B  in the flow of the program):
IN(B) =  (J OUT(P) (5.1)
p e PRED(b)
OUT(B) =  GEN(B) [J (IN(B) -  KILL(B)) (5.2)
Since the confluence operator is a union, both IN and OUT never decrease in size and the 
algorithm will eventually halt. By processing the blocks in the flow graph in a depth-first order, 
the number of iterations performed will roughly correspond to the level of the deepest nested 
statement, which tends to be a fairly small number on real programs [73].
As can be seen from Eqs. (5.1) and (5.2), the DEF and USE sets are actually not used to 
compute reaching distributions, but will have other uses for optimizing redistribution which will 
be explained in more detail in Section 5.2.2).
5.1.2 Constructing the distribution flow graph
(a) Distribution split (b) Redistribution split
Figure 5.2: Splitting CFG nodes to obtain DFG nodes
Since the definition of the DFG is based upon the CFG, the CFG can be easily transformed 
into a DFG by splitting basic blocks at points at which a distribution changes as shown in Fig­
ure 5.2. This can be due to an explicit change in distribution, as specified by the automatic data 
partitioner, or by an actual HPF redistribution directive. If the change in distribution is due to 
a sequence redistribution directives, the overall effect is assigned to the block in which they
65
are contained; otherwise, a separate block is created whenever executable operations are inter­
spersed between the directives.
When splitting a basic block, the existing control flow information can be updated locally to 
avoid computing the entire distribution flow graph from scratch. For reasons which simplify its 
implementation, SPLIT-NODE (B, stmt) will actually create B 2 while modifying B  to become 
B i . The statements now included in B 2 still have pointers to the original flow graph node in 
which they are contained which must be updated, but nothing must be done to the statements 
included in B\. This avoids handling the special case of creating a new head when splitting the 
first basic block in a function (since the head never changes when B  is modified to become Bi).
A separate function Split-Node-After is also provided to split a DFG node after a given 
statement. Both versions Split-Node and Split-Node-After only split the node if there is 
actually a statement before or after (respectively) the statement of interest.
5.1.3 Representing distribution sets
To represent a distribution set in a manner that would provide efficient set and comparison 
operations, the bulk of the distribution information associated with a given variable is stored in 
its symbol table entry as a distribution table, and bit vectors are used within the sets to specify 
distributions which were currently active for a given variable. Since a separate symbol table en­
try is created for each variable within a given scope, this provides a clean interface for accepting 
distributions from the HPF front end [7] as any distribution present within a function due to re­
distribution can be added to the table corresponding to the current scope of the variable while 
inserting the directives into the abstract syntax tree.
As shown in Figure 5.3, the actual distribution sets are maintained as linked lists with a sep­
arate node representing each variable with a bit vector (corresponding to the entries in the distri­
bution table for that variable) to indicate which distributions are currently active for the variable. 
To maintain sufficient efficiency while still retaining the simplicity of a list, the list is always 
maintained in sorted order by the address of the variable’s entry in the symbol table in order to 
facilitate comparison and set operations between sets. This allows us to implement operations
66
B c A
010. . .011001 o^lo.. .oooioj. 000...010010n
Distribution Set
Figure 5.3: Distribution set using bit vectors
on two sets by merging them in only 0 (n )  comparison or combining operations (where n is 
the length of the list). The operation could also involve a comparison or combination of two 
constant length bit vectors1 to maintain the distribution information with 0 (1 ) complexity.
Since these sets are now actually sets of variables each containing a set representing their 
active distributions, SET„ar will be used to specify the variables present in a given distribution 
set SET. For example, the notation SETvar can be used to indicate the inverse of the distributions 
for each variable contained within the set as opposed to an inverse over the universe of all active 
variables (which would be indicated as SET).
In addition to providing full union, intersection, and difference operations (U, f)» —) which 
operate on both levels of the set representation (between the variable symbols in the sets as well 
as between the bit vectors of identical symbols) masking versions of these operations (|o), |o), 
are also provided which operate at only the symbol level. In the case of a masking union (alo)6), a 
union is performed at the symbol level such that any distributions for a variable appearing in set 
a will be replaced by distributions in set b. This allows new distributions in b to be added to a set 
while replacing any existing distributions in a. Masking intersections (a |o| b) and differences 
(a b) act somewhat differently in that the variables in set a are either selected or removed 
(respectively) by their appearance in set b. These two operations are useful for implementing 
existence operations (e.g., avar\bvar = a (o| b, avar\bvar = a & b).
1Bit vector manipulation routines can also be written to maintain bit vectors with a variable length [103], but 
to simplify the initial implementation they are fixed to 32 bits (therefore allowing 32 different distributions to be 
present for any given symbol).
67
5.2 Interprocedural Data Redistribution Analysis
Since the semantics of HPF require that all objects accessible to the caller after the call are 
distributed exactly as they were before the call, it is possible to first completely examine the 
context of a call before considering any distribution side effects due to the call. It may seem 
strange to say that there can be side effects when we just said that the semantics of HPF pre­
clude it. To clarify this statement, such side effects are allowed to exist, but only to the extent 
that they are not apparent outside of the call. As long as the view specified by the program­
mer is maintained, the compiler is allowed do whatever it can to optimize both the inter- and 
intraprocedural redistributions so long as the resulting distributions used at any given point in 
the program are not changed.
Referring back to Figure 5.1 the core of this analysis is built upon two separate interprocedu­
ral data-flow problems which perform distribution synthesis and redistribution synthesis. These 
two data-flow problems are based upon the problem of determining both the inter- and intrapro­
cedural reaching distributions for a program. An example call graph is shown in Figure 5.4 to 
help illustrate the flow of these two phases of the interprocedural analysis.
1
2
3
4
5
6
7
8 
9
10
(a) Top-down (pre-order) 
for calling context
1
2
3
4
5
6
7
8 
9
10
(b) Bottom-up (post-order) 
for side-effects
sub3 
| sub3* 
subl
sub3*
sub4
sub2
sub3*
sub4*
sub2*
main
Figure 5.4: Example call graph and depth-first traversal order
If distribution information is not present (i.e., HPF input), distribution synthesis is first per­
formed in a top-down manner over a program’s call graph to compute which distributions are
68
present at every point within a given function. By establishing the distributions that are present 
at each call site, the input distributions are obtained for each of the functions which it calls as 
well as causes a function to be cloned if a new set of input distributions is used. Redistribu­
tion synthesis is then applied in a bottom-up manner over the call graph to analyze where the 
distributions are actually used and synthesizing the redistribution required within a function.
Since this analysis is interested in the effects between an individual caller/callee pair, and 
not in summarizing the effects from all callers before examining a callee, it is not necessary to 
perform a topological traversal for the top-down and bottom-up passes over the call graph. In 
this case, it is actually more intuitive to perform a depth-first pre-order traversal of the call graph 
(shown in Figure 5.4(a)) to fully analyze a given function before proceeding to analyze any of 
the functions it calls and to perform a dept-first post-order traversal (shown in Figure 5.4(b)) to 
fully analyze all called functions before analyzing the caller.
One other point to emphasize is that these interprocedural techniques can be much more 
efficient than analyzing a fully inlined version of the same program since it is possible to prune 
the traversal at the point a previous solution is found for a function in the same calling context. In 
Figure 5.4, asterisks indicate points at which a function is being examined after having already 
been examined previously. If the calling context is the same as the one used previously, the 
traversal can be pruned at this point reusing information recorded from the previous context. 
Depending on how much reuse occurs, this factor can greatly reduce the amount of time the 
compiler spends analyzing a program in comparison to a fully inlined approach.
To summarize the process shown in Figure 5.1, the distribution synthesis pass
(1) determines the reaching distributions for every point in the function according to the in­
traprocedural control flow, and
(2) determines which input distributions will be present at each call site (calling context) 
cloning functions when necessary
while the redistribution synthesis pass
(1) restricts the distributions to those used at each point in the program, and
69
(2) optimizes redistributions by examining both the intraprocedural flow of distributions and 
interprocedural side effects of each function call in the caller’s context.
After both determining the reaching distributions and optimizing the required redistribution, the 
internal representation can then be converted into either a static or dynamic HPF program with 
fully explicit redistribution.
The algorithms for performing distribution synthesis will be presented in Section 5.2.1 while 
redistribution synthesis will be presented in Section 5.2.2. The static HPF conversion, known as 
static distribution assignment (SDA), is covered in detail in Section 5.2.3. Since the dynamic 
HPF conversion only entails generating redistribution directives based on the contents of the 
REDIST sets, it will not be discussed further in this section.
5.2.1 Distribution synthesis
When analyzing HPF programs, it is necessary to first perform distribution synthesis in order 
to determine which distributions are present at every point in a program. Since HPF semantics 
specify that any redistribution (implicit or explicit) due to a function call is not visible to the 
caller, each function can be examined independently of the functions it calls. Only the input 
distributions for a given function and the explicit redistribution it performs have to be considered 
to obtain the reaching distributions for a function.
Given an HPF program, nodes (or blocks) in its DFG are delimited by the redistribution 
operations which appear in the form of HPF REDISTRIBUTE or REALIGN directives. As shown 
in Figure 5.5, the redistribution operations assigned to a block B  represent the redistribution that 
will be performed when entering the block on any input path (indicated by the set REDIST(B)) 
as opposed to specifying the redistribution performed for each incoming path (REDIST(S, S i) 
or REDIST(S, B 2) in the figure).
If the set GEN(S) is viewed as the distributions which are generated and KILL(S) as the 
distributions which are killed upon entering the block, this problem can now be cast directly 
into the reaching distribution data-flow framework by making the following assignments:
70
Figure 5.5: Distribution synthesis 
(converting redistributions to distributions)
Data-flow initialization:
REDIST(B) =  from directives DIST(B) =  0
GEN(B) =  REDIST(B) KILL(B) =  REDIST„ar(B)
OUT(B) =  REDIST(B) IN(B) =  0
Data-flow solution:
DIST(B) =  OUT (B)
The full interprocedural algorithm for converting redistribution into distribution informa­
tion is shown in Figure 5.6. This algorithm recurses at line 20 in R2D-FUNC to perform the 
top-down pruning traversal previously shown in Figure 5.4(a) computing the data-flow solution 
for each function at each step in the traversal. Before examining calls made within a function, 
TN(function) is computed at line 19 in R2D-FUNC for both the parameters as well as all of the 
globals used within the function call.2 The distributions of all of the actual parameters are ini­
tially assumed to be inherited and are masked with any prescriptive distributions that may exist 
for the function. One final point to make is that even though functions are cloned during the 
top-down traversal, they are actually recorded at line 22 as each call site is examined after the 
traversal of the given function is complete.
2The sumf cn pass in Parafrase-2 [6] is first run to summarize all global references made within every function.
71
REDIST2DIST [ p r o g r a m )
1 f u n c t i o n  <— MAIN.FUNCTION( p r o g r a m )
2 R2D-FUNc ( f u n c t i o n ,  STATIC-DIST( f u n c t i o n ) )
R2D-FUNc ( f u n c t i o n ,  i n d i s t )
1 if a solution already exists for i n d i s t
2 then return f u n c t i o n
3 else f u n c t i o n  CLONE ( f u n c t i o n )
4 t> Obtain dataflow solution for f u n c t i o n
5 R2D-INIT( f u n c t i o n ,  i n d i s t )
6 repeat
7 for each n o d e  in FLOW-GRAPH (/imctaon)
8 do update IN/OUT data-flow equations for n o d e
9 until convergence
10 >  Record DIST solution
11 for each n o d e  in FLOW-GRAPH(/wnc/icm)
12 do DIST ( n o d e )  OUT( n o d e )
13 OUT ( f u n c t i o n )  i — DIST(STOP JSfODE ( f u n c t i o n ) )
14
15 O Perform top-down traversal over call graph
16 for each c a l l  in C ALL .LIST ( f u n c t i o n )
17 do i n d i s t  «- (actual parameters mapped to dummy parameters
18 +  required globals) from OUT(NODE(ca/0)
19 i n d i s t  <— i n d i s t  (oj PRESCRIBED _DIST(FUNCTION (c a l l ))
20 c l o n e  <— R2D-FUNC( c a l l ,  i n d i s t )
21 if FUNCTION(ca/Z) ^  c l o n e
22 then FUNCTION( c a l l )  -f- c l o n e  t> Record call to clone
23 return f u n c t i o n
R2D-INIT ( f u n c t i o n ,  i n d i s t )
1 reset IN/OUT ( f u n c t i o n )
2 IN ( f u n c t i o n )  4— i n d i s t
3 for each n o d e  in FLOW_GRAPH(/unc£«on)
4 do reset REDIST/DIST(node)
5 R2D-HPF-REDIST-INIT( f u n c t i o n )  > Add HPF redistribution
6 for each n o d e  in FLOW_GRAPH(/imc£ion)
7 doGEN(node) REDIST(node)
8 KILL(node) <- lNVERTVAR(GEN( n o d e ) )
9 IN(node) 0
10 OUT(node) <— GEN ( n o d e )
R2D-HPF-REDIST-INIT (f u n c t i o n )
1 for each HPF_DIR(sim£) in f u n c t i o n
2 do n o d e  <— FLOW-GRAPH ( s t m t )
3 if REDIST occurs after a normal statement in block
4 then n o d e  <— SPLIT-NODE ( n o d e ,  s t m t )
5 REDIST(node) «- REDIST(node) (gj (redist specified by HPF directive)
Figure 5.6: Interprocedural analysis for distribution synthesis
72
It is important to note that R2D-HPF-INIT processes executable HPF directives in the or­
der in which they appear in the program and adds them to any existing REDIST sets using a 
masking union operation. This captures the sequential ordering of the HPF directives, which is 
important if the programmer should (accidentally) redistribute the same array multiple times in 
a sequence of redistribution directives. Only the last distribution for a given sequence should be 
recorded in the REDIST set for the associated block. The overall effect of a sequence of consec­
utive redistribution directives is therefore assigned to the entry of the block. If the redistribution 
operation occurs after a normal statement in the block, however, a new DFG node is created.
According to the the HPF standard, a REALIGN operation only affects the array being re­
aligned while a REDISTRIBUTE operation should redistribute all arrays currently aligned to the 
given array being redistributed (in order to preserve any previous specified alignments). In the 
current implementation, R2D-HPF-INIT only records redistribution information for the array 
immediately involved in a REDISTRIBUTE operation. This results in only redistributing the ar­
ray involved in the directive and not all of the alignees of the target to which it is aligned. In 
the future, the implementation could be easily extended to support the full HPF interpretation 
of REDISTRIBUTE by simply recording the same redistribution information for all alignees for 
the target of the array involved in the operation.
5.2.2 Redistribution synthesis
After the distributions have been determined for each point in the program, the redistribution 
can be optimized. Instead of using either a simple copy-in/copy-out strategy or a complete re­
distribution of all arguments upon every entry and exit of a function, any implicit redistribution 
around function calls can be reduced to only that which is actually required to preserve the HPF 
semantics. Any unnecessary redistribution operations (implicitly specified by HPF semantics 
or explicitly specified by a programmer) that result in distributions which would not otherwise 
be used before another redistribution operation occurs are completely removed in this pass.
Blocks are now delimited by changes in the distribution set. As shown in Figure 5.7, the set 
of reaching distributions previously computed for a block B  represent the distributions which
73
Figure 5.7: Redistribution synthesis 
(converting distribution to redistribution)
are in effect when executing that block (indicated by the set DIST( ()£?)). By first restricting 
the D IST(£) sets to the variables defined or used within block B, a redistribution operation 
will only be performed between two blocks if there is an intervening definition or use of that 
variable before the next change in distribution.
This also has the effect of generating redistribution when it is needed (demand-driven, or 
lazy, redistribution), in order to ensure that the distribution is actually used in a different dis­
tribution before performing a redistribution. Although it will not be examined here it would 
also be possible to take a lazy redistribution solution and determine the earliest possible time 
that the redistribution could be performed (eager redistribution) in order to redistribute an ar­
ray when a distribution is no longer in use. The area between the eager and lazy redistribution 
points forms a window over which the operation can be performed to obtain the same effect. 
As will be shown later, it would be advantageous to position multiple redistribution operations 
in overlapping windows to the same point in the program in order to aggregate the commu­
nication thereby reducing the amount of communication overhead (as previously described in 
Section 3.3). As the lazy redistribution point is found using a forward data-flow (reaching dis­
tributions) problem, it would be possible to find the eager redistribution point by performing 
some additional bookkeeping to record the last use of a variable as the reaching distributions 
are propagated along the flow graph; however, such a technique is not currently implemented
74
in PARADIGM. In comparison to other approaches, interval analysis has also been used to de­
termine eager/lazy points for code placement, but at the expense of a somewhat more complex 
formulation [104].
If the set GEN(F?) is viewed as the distributions which are generated and KILL(B) as the 
distributions which are killed upon leaving block B, this problem can now be cast directly into 
the reaching distribution data-flow framework by making the following initializations:
data-flow initialization:
REDIST(B) =  0 DIST(B)
GEN(B) =  DIST(B) KILL(B)
IN(B) =  0 OUT(B)
DIST(B) fl (DEF(B) U USE(B)) 
DIST var(B)
DIST (B)
data-flow solution: 
R E D IS T (£ ,P ) =  
REDIST(B) =
DISTwar(£ ) | (OUTwar(P) -  DISTvar{B)) ^  0 (VP e  PRED(B))
U  REDIST(B, P) 
p e PRED(b)
DISTuar(£?) | (JNmr(B) -  DlSTvar(B)) + 0
We define GEN(B) and KILL(B) as the distributions which are generated or killed upon
leaving block B  due to the fact that we have chosen to use a caller redistributes model. As will be
seen later, this exposes many interprocedural optimization opportunities and is also necessary
to support function calls which may require redistribution on both their entry and exit. Since 
REDIST(B) is determined from both the DIST and IN sets, DIST(B) represents the distribu­
tions needed for executing block B, while the GEN, KILL sets will be used to represent the exit
distribution (which may or may not match DIST).
The full interprocedural algorithm for converting distribution to redistribution information 
is shown in Figure 5.8. It should be noted that this algorithm traverses the expanded call graph
that was formed by any cloning performed during distribution synthesis. This algorithm re­
curses at line 6 in D2R-FUNC to perform the bottom-up pruning traversal previously shown in 
Figure 5.4(b) computing the data-flow solution for each function at each step in the traversal. 
After all call sites have been examined for a function, both OUT ( c a l l )  and I N  ( c a l l )  are exam­
ined in D 2R-C a ll-In it  (shown in Figure 5.9) to determine if there was an implicit change in
75
DIST2REDIST [ p r o g r a m )
1 for each f u n c t i o n  in p r o g r a m  t> Initialize all functions as not yet visited
2 do VISITED ( f u n c t i o n )  4— FALSE
3 f u n c t i o n  4— MAIN3FUNCTION(progr a m )
4  D2R-FUNC ( f u n c t i o n )
D2R-FUNC ( f u n c t i o n )
1 if VISITED(/unc/ion) t> Check for previous DIST2REDIST solution
2 then return
3 else VISITED (/unction) TRUE
4 t> Perform bottom-up traversal over call graph
5 for each c a l l  in C ALL -LIST ( f u n c t i o n )
6 do D2R-FUNC( c a l l )
7
8 O Obtain dataflow solution for f u n c t i o n
9 D2R-INIT ( f u n c t i o n )
10 repeat
11 for each n o d e  in FLOW_GRAPH(/unciion)
12 do update IN/OUT data-flow equations for n o d e
13 until convergence
14 >  Record REDIST(node)
15 for each n o d e  in FLOW_GRAPH(/unction)
16 do REDIST(node) 4- DIST(node)|(lN(node) -  DIST(node)) ^  0
17 o u t d i s t  4- parameters and globals from STOP-NODE ( f u n c t i o n )
18 OUT ( f u n c t i o n )  4— o u t d i s t
D2R-INIT ( f u n c t i o n )
1 for each n o d e  in FLOW_GRAPH(/unction)
2 do reset REDIST,GEN/KILL/IN/OUT(node)
3 for each n o d e  in FLOW_GRAPH(/uncfion)
4 do D> Restrict distributions to DEF/USE for n o d e
5 DIST(node) 4- DIST(node) f | DEFUSE(node)
6 GEN ( n o d e )  4— DIST( node) t> Initialize GEN set
7 DlST-GROW-lNVARlANT ( f u n c t i o n )  t> Record invariants
8 D2R-INIT-CALL (f u n c t i o n ) 0  Record IN/OUT sets for each call
9 for each n o d e  in FLOW_GRAPH(/uncfion)
10 do KILL(node) 4- INVERTVar(GEN(node))
11 IN(node) 4— 0
12 OUT(node) 4- GEN(node)
Figure 5.8: Interprocedural analysis for redistribution synthesis
76
D2R-C ALL-INIT (f u n c t i o n )
1 for each c a l l  in CALL_LIST( f u n c t i o n )
2 do i n d i s t  (dummy parameters mapped to actuals
3 • +  globals) from IN(FUNCTION(c a l l ) )
4 o u t d i s t  <r- (dummy parameters mapped to actuals
5 +  globals) from OUT(FUNCTION(ca//))
6 n o d e  <- NODE ( c a l l )
7 s t m t  <— STMT ( c a l l )
8 if o u t d i s t  DIST(node)
9 then n e w n o d e  <— SPLIT-NODE-AFTER-STM T(node, s t m t )
10 GEN( node) GEN(node) (oj o u t d i s t
11 GEN [ n e w n o d e )  f-  DIST ( n e w n o d e )
12 i f  i n d i s t  $  DIST ( n o d e )
13 then if c a l l  occurs after a normal statement in block
14 then n o d e  <— SPLIT-NO DE(node, s t m t )
15 DIST(node) <- DIST(node) [oj i n d i s t
DlST-GROW-lNVARIANT ( f u n c t i o n )
1 for each n o d e  in f u n c t i o n
2  do TMPDIST( n o d e )  <- DIST( node)
3 for each s t m t  in f u n c t i o n
4  do D i s t - G r o w - S t m t ( s t m t )
5 for each n o d e  in f u n c t i o n
•6 do i n v d i s t  =  TMPDIST( node) | only one distribution per variable
7 if i n v d i s t  ^  0
8 then t> Add to existing DIST and GEN sets for node
9 DIST(node) DIST(node) (oj i n v d i s t
10 GEN(node) 4- GEN(node) |oJ DIST(node)
DIST-GROW-STMT ( s t m t )
1 n o d e  <— FTOWJjrRAPH(s£m£)
2 d i s t  <r- DIST ( n o d e )  1> distributions present at node
3 g e n d i s t  •*— GEN(node) D> distributions generated at node
4 if d i s t  7^  0 O Add distributions to parent statments
5 then while s t m t  PARENT(stmt)
6 do n o d e  «- FLOW_GRAPH( s t m t )
1  TMPDIST(node) <- TMPDISTf node) (J d i s t
8 TMPDIST(node) 4- TMPDIST( node) (J g e n
Figure 5.9: Initialization for redistribution synthesis
77
distribution at the call site. If so, the node containing the call site is appropriately split3 and the 
GEN and DIST sets for the node are masked by the OUT and IN distributions sets, respectively, 
as required for the call.
5.2.2.1 Optimizing invariant distributions
Besides performing redistribution only when necessary, it is also desirable to only perform 
necessary redistribution as infrequently as possible. An algorithm for growing semantically in­
variant distribution regions4 before synthesizing the redistribution operations is shown in Fig­
ure 5.9. This algorithm records all distributions that do not change within a nested statement 
(loop or if structures) on the parent statement (or header) for that structure. This has the effect 
of moving redistribution operations which result in the invariant distribution out of nested struc­
tures as far as possible. An example of this optimization will be illustrated later in Section 5.3.4.
As a side effect, loops which are considered to contain an invariant distribution no longer 
propagate previous distributions for the invariant arrays. Since redistribution is moved out of the 
loop, this means that for the (extremely rare) special case of a loop invariant distribution (which 
was not originally present outside of the loop) contained within a undetectable zero trip loop, 
only the invariant distribution from within the loop body is propagated even though the loop nest 
was never executed. As this is only due to the way invariant distributions are handled, the data­
flow handles non-invariant distributions as expected for zero trip loops (an extra redistribution 
check may be generated after the loop execution).
One last point to note is that the algorithm presented in Figure 5.9 is conservative since it 
only pulls out distributions that are completely invariant, not distributions that are invariant with 
respect to the start and end of an iteration but change distributions intermediately. Although, not
3If a node is split at a call that is contained within a larger expression and which refers to any array that changed 
distribution during the call, the compiler must split the expression into multiple statements at the call site using 
temporary variables to obtain the partial results and preserve the order of execution, or perform redistribution at 
the end of function before returning which may require more cloning based on the required return distribution.
invariant redistribution with respect to the semantics of HPF can technically become non-invariant when re­
turn distributions from function calls within a loop nest are allow to temporarily exist in the caller’s scope (and 
therefore are recorded before considering any interprocedural side effects due to function calls with different re­
turn distributions). Such regions can be still treated as invariant, with the proper support as will be addressed in 
Section 5.3.1, since this is the view HPF provides to the programmer.
78
shown here, this would require computing a separate “nested” DEF/USE set over the loop body 
for each loop header, performing a union of the (unrestricted) reaching distributions for only the 
entry and exit nodes of the loop, and then restricting this resulting set by the nested DEF/USE 
set.
5.2.2.2 Multiple active distributions
Even though it is not specifically stated as such in the HPF standard, we will consider an 
HPF program in which every use or definition of an array has only one active distribution to be 
well-behaved. Since PARADIGM cannot currently compile programs which contain references 
with multiple active distributions, this property is currently detected by examining the reaching 
distribution sets for every node (limited by DEF/USE) within a function. A warning is issued if 
any set contains multiple distributions for a given variable stating that the program is not well- 
behaved. An example of this situation will be illustrated later in Section 5.3.4.
In the presence of function calls, it is also possible to access an array through two or more 
more paths when parameter aliasing is present. If there is an attempt to redistribute one of the 
aliased symbols, the different aliases now have different distributions even though they actually 
refer to the same array. This form of multiple active distributions is actually considered to be 
non-conforming in HPF [2] as it can result in consistency problems if the same array were al­
lowed to occupy two different distributions. As it may be difficult for the programmers to make 
this determination, this can be automatically detected by determining if the reaching distribu­
tion set contains different distributions for any aliased arrays.5 An algorithm is presented in 
Figure 5.10 which examines the reaching distribution sets for every node within a function for 
this condition.
5.2.3 Static Distribution Assignment (SDA)
To utilize the available memory on a given parallel machine as efficiently as possible, only 
the distributions that are active at any given point in the program should actually be allocated
5The param _alias pass in Parafrase-2 [6] is first run to compute the alias sets for every function call.
79
I l l e g a l - A l i a s i n g - D i s t  ( d i s t s e t )
1 i l l e g a l  <— FALSE
2 s r c d i s t  d i s t s e t
3 s r c s y m  <— DYN.VAR( s r c d i s t )
4 while s r c d i s t
5 do t g t d i s t  <— DYN_NEXT(srcrfzsi)
6 t g t s y m  DYN.VAR( t g t d i s t )
7 while t g t d i s t
8 do D> Check for aliasing present between s r c s y m  and t g t s y m
9 if (s r c s y m  and t g t s y m  in same alias set) and
10 not E q u i v a l e n t - D i s t s  (s r c d i s t , t g t d i s t )
11 then i l l e g a l  TRUE
12 t g t d i s t  DYNJ4EXT(£<?£cfo£)
13 s r c d i s t  <— DYN_NEXT( s r c d i s t )
14 return i l l e g a l
E q u i v a l e n t - D i s t s  ( s r c d i s t ,  t g t d i s t )
1 e q u i v a l e n t  TRUE
2 s r c s y m  D Y N J V A R ( s r c d i s t )
3 t g t s y m  DYN.VAR
4 C> Both should have same number of unique distributions
5 if DYN_NUMDIST(s r c d i s t )  ^  DYNJMUMDIST( t g t d i s t )
6 then e q u i v a l e n t  «— FALSE
7 1> All distributions should be present for both symbols
8 for each dist in srcdist
9 do if S_DYN_DIST(srcsym,dist) £ S_DYNJDIST(tgtsym, *)
10 then e q u i v a l e n t  FALSE
11 return e q u i v a l e n t
Figure 5.10: Algorithms for detecting non-conforming HPF programs
80
space. It is interesting to note that as long as a given array is distributed among the same total 
number of processors, the actual space required to store one section of the partitioned array is 
the same no matter how many array dimensions are distributed.6 By using this observation, it 
is possible to statically allocate the minimum amount of memory by associating all possible 
distributions of a given array to the same area of memory.
Static Distribution Assignment (SDA) (inspired indirectly by the Static Single Assignment 
(SSA) form [105]) is a process we have developed in which the names of array variables are du­
plicated and renamed statically based on the active distributions represented in the correspond­
ing DIST sets. As names are generated, they are assigned a static distribution (by which we 
mean this new name will not change distribution during the course of the program) correspond­
ing to the currently active dynamic distribution for the original array. Redistribution now takes 
the form of moving data from a statically distributed source array to another statically distributed 
destination array (as opposed to rearranging the data within a single array).
To statically achieve the minimum amount of memory allocation required, all of the renamed 
duplicates of a given array are declared to be “equivalent.” The EQUIVALENCE statement in For­
tran 77 allows this to be performed at the source level in a somewhat similar manner as assigning 
two array pointers to the same allocated memory as is possible in C or Fortran 90, Redistribution 
directives are also now replaced with actual calls to a redistribution library.
Because the different static names for an array share the same memory, this implies that the 
communication operations used to implement the redistribution should read all of the source 
data before writing to the target. In the worst case, an entire copy of a partitioned array can be 
buffered at the destination processor before it is actually received and moved into the destination 
array. However, as soon as more than two different distributions are present for a given array, 
the EQUIVALENCE begins to pay off, even in the worst case, in comparison to separately allo­
cating each different distribution. If the performance of buffered communication is insufficient 
for a given machine (due to the extra buffer copy), non-buffered communication could be used
6 Taking into account distributions in which the number of processors allocated to a given array dimension does 
not evenly divide the size of the dimension, it can be equivalently said that there is a given amount of memory 
which can store all possible distributions with very little excess.
81
REAL A(N, N)
!HPF$ DISTRIBUTE (CYCLIC, *) : : A
REAL A$0(N, N), A$1(N,N)
!HPF$ DISTRIBUTE (CYCLIC, *) :: A$0 
!HPF$ DISTRIBUTE (BLOCK, BLOCK) :: A$1
A(i, j) = ...
!HPF$ REDISTRIBUTE (BLOCK, BLOCK) :: A
EQUIVALENCE (A$0, A$l) 
INTEGER A$cid 
A$cid = 0
=A(i, j)
A$0(i, j) = ...
CALL reconfig(A$l, 1, A$cid) 
... = A$1(i, j)
(a) Before SDA (b) After SDA
Figure 5.11: Example of static distribution assignment
instead thereby precluding the use of EQUIVALENCE (unless some form of explicit buffering is 
performed by the redistribution library itself).
In Figure 5.11, a small example is shown to illustrate this technique. In this example, a 
redistribution operation on A causes it to be referenced using two different distributions. A sep­
arate name is statically generated for each distribution of A, and the redistribution directive is 
replaced with a call to a run-time redistribution library [84]. The array accesses in the program 
can now be compiled by PARADIGM using techniques developed for programs which only 
contain static distributions [12,89] by simply ignoring the communication side effects of the 
redistribution call. Although it is not needed for this example, a separate ID variable, A$cid, is 
also maintained to indicate the current configuration of the corresponding array. When different 
distributions reach a redistribution point through different control-flow paths, this ID variable 
is necessary to determine the source distribution of an array. In this example, the ID variable 
is also assigned statically generated constants which correspond to the different distributions. 
It is also possible to use instead symbolic constants that index into a version of the compiler’s 
symbol table, containing the actual distribution for each configuration, made available at run 
time. Such a run-time symbol table, allows the compiler to generate interprocedural SDA code
82
by adding ID variables into a function’s parameter list for each distributed array to properly 
handle parameter mapping across function boundaries.7
If more than one distribution is active at any given point in the program, the program is con­
sidered to be not well-behaved, and the array variable can not be directly assigned a static dis­
tribution. Furthermore, in order to even generate code for multiple active distributions, it must 
be able to handle all combinations of active distributions for the referenced variables. If code 
specialization is performed (which is currently not performed by PARADIGM), it is possible to 
generate a large number (the product of the number of distributions for each array in the code 
block) of versions for a given block of code. It should be noted that this situation can only occur 
when accepting HPF programs that were written by hand using the REDISTRIBUTE or REALIGN 
directives. This problem will not arise when dealing with HPF programs that were automati­
cally generated by the automatic data partitioning techniques previously described. In this case, 
the data partitioner assigns full distribution descriptions to individual blocks of code as opposed 
to indirectly specifying them through redistribution operations.
In certain circumstances, however, it may be possible to perform code transformations to 
make an HPF program well-behaved. For instance, a loop that contained multiple active distri­
butions on the entry to its body due only to a distribution from the loop back edge (caused by 
redistribution within the loop) that wasn’t present on the loop entry would not be well-behaved. 
If the first iteration of that loop were peeled off, the entire loop body would now have a single 
active distribution for each variable and the initial redistribution into this state would be per­
formed outside of the loop. This and other code transformations which help reduce the number 
of distributions reaching any given node will be the focus of further work in this area.
5.3 Implementation
In this section we will discuss some of the implementation details related to the data-flow 
framework and present results of applying these techniques on several examples.
7We are currently in the process of developing the run-time system to fully support the interprocedural SDA 
code which the compiler already generates.
83
t
BLOCK A
\ (a)
1
(a) Source code (b) Flow graph
Figure 5.12: Example loop shown with flow transitions
5.3.1 Source level data flow
The data-flow equations (Eqs. (5.1) and (5.2)) are computed for all blocks in the same man­
ner except for one special consideration that must be made for DO loop headers. Since the result­
ing information will be used in a source-to-source platform, care must be taken to ensure that 
all flow paths are realizable at the source level. In Figure 5.12, the source code for a loop and its 
corresponding flow graph are shown illustrating where the transitions (a )-(e) between the basic 
blocks8 actually appear in the source code. As the transition along the loop backflow edge (d) 
does not have an explicit location in the source code, this edge requires special treatment. By 
directing the flow from edge (d) into the OUT set of the loop header, instead of its IN set, this 
backedge flow is not observed as an input to the DO loop header, but correctly appears at the en­
try of both the loop body, as well as at the entry of the code following the loop (e.g., block B). 
The points indicated as (b,d) and (d,e) in Figure 5 .12(a) therefore correspond to the locations at 
which the effects of the backflow edge will appear combined with the effects of edges (b) and 
(e) in the regenerated source code. Even though this example illustrates the flow for a DO loop, 
this approach extends to applying data-flow solutions at the source level for all types of iterative 
loops.9
8 The ENDDO statement is normally included as part of the loop body, but it is shown separately here to emphasize 
flow between the loop body and the end of the loop.
9In contrast, loops formed using unstructured constructs (i.e., GOTO operations that branch back to a label) do 
not require any special treatment since the target label and the associated statement can be separated forming a 
region in which the back transition (d) can be placed.
84
Invariant distributions recorded in the GEN on loop headers, which are no longer truly in­
variant due to optimizations removing intermediate redistributions around function calls within 
a loop body, raise a subtle point with respect to this treatment. As the distributions along the 
loop backflow must take precedence over loop invariants when iterating upon the loop body, 
this ordering is maintained by masking the GEN set by the loop backflow before adding it to 
the OUT set for the flow edge (b) from the header into the loop body. However, as loop invari­
ant distributions take precedence over loop backflow upon exiting the loop body, the reverse is 
done, masking the OUT set by the GEN set flow for the loop exit edge (e).
5.3.2 Virtual clones
Because there is considerable state associated with the body of a function, it is much more 
efficient to record the cloned data as a separate context for the function and delay the actual 
cloning until later, a process which we refer to as virtual cloning.
For this work, this requires maintaining cloned copies of only the REDIST, DIST, LN, OUT 
sets on each DFG node within a function as well as the IN, and OUT sets for the function itself. 
To support virtual clones in general, the specification of a function now becomes a pair (function 
symbol table entry, clone number). References to functions (i.e., call sites) also record which 
clone is to be used for the call in the current virtual clone context.
By creating a “virtual” clone, all of the other state associated with the function (e.g., control- 
flow and dependence graphs) can be reused during interprocedural analysis within each virtual 
clone’s context. When the actual source is to be regenerated, only the abstract syntax tree (AST) 
is actually cloned making specific modifications based on the information recorded as virtual 
clones.
5.3.3 Array reshaping
One area that is not fully addressed by the current implementation is array reshaping, which 
can occur through function parameters at call boundaries by linearizing a multi-dimensional ar­
ray or passing a slice of an array (with or without an offset). As array reshaping does not always
85
result in a valid HPF distribution in general, it might be necessary to label invalid distributions 
as irregular, and only map distributions for the cases in which reshaping results in a valid HPF 
distribution. In either case, array reshaping can be easily handled by the presented framework, 
but further work is still required to extend the current implementation to support distribution 
mapping when mapping the actual parameters to the dummy parameters while taking into ac­
count the possibility of array reshaping.
5.3.4 Examples
In Figure 5 .13(a), a synthetic HPF program is presented which performs a number of differ­
ent tests (described as comments in the input code) of the optimizations performed by the frame­
work. In this program, one array, x, is redistributed both explicitly using HPF directives and 
implicitly through function calls using several different interfaces. Two of the functions, f u n d  
and func2, have prescriptive interfaces which may or may not require redistribution (depend­
ing on the current configuration of the input array). The first function differs from the second 
in that it also redistributes the array such that it returns with a different distribution than which 
it was called. The last function, f  unc3, differs from the first two in that it has an (implicit) tran- 
scriptive interface. Calls to this function will cause it to inherit the current distribution of the 
actual parameters.
Several things can be noted when examining the optimized HPF shown10 in Figure 5 .13(b). 
First of all, the necessary redistribution operations required to perform the implicit redistribution 
at the function call boundaries have been made explicit in the program. Here, the interprocedu­
ral analysis has completely removed any redundant redistribution by relaxing the HPF seman­
tics allowing distributions caused by function side effects to exist so long as they do not affect 
the original meaning of the program. For the transcriptive function, f  unc3, the framework has 
generated two separate clones, func3$0 and func3$ l, corresponding to two different active 
distributions at a total of three different calling contexts.
10The HPF output, generated by PARADIGM, has been slightly simplified in the following figures to improve 
their clarity. Unnecessary alignment directives have been removed and consecutive redistribution operations have 
been collapsed into one.
86
PROGRAM test 
INTEGER x(10,10)
c *** For tests involving statement padding
INTEGER a
!HPF$ PROCESSORS :: square(2,2)
!HPF$ DYNAMIC, DISTRIBUTE (BLOCK, BLOCK) :: x 
c *** Use of initial distribution
x(l,l) = 1
c *** Testing loop invariant redistribution
DO i = 1,10 
DO j = 1,10 
a = 0
!HPF$ REDISTRIBUTE (BLOCK, CYCLIC) :: x
x(i,j) = 1 
a = Q 
ENDDO 
ENDDO 
a = 0
c *** Testing wmecessary redistribution
!HPF$ REDISTRIBUTE (BLOCK, CYCLIC) :: x 
if (x(i,j) .gt. 1) then
c *** Testing redistribution in a conditional
!HPF$ REDISTRIBUTE (BLOCK, BLOCK) :: x 
x(i,j) = 2 
call func3(x,n) 
else
x(i,j) = 3 
endif
c *** Uses with multiple reaching distributions
x (1 , 1 )  = 2 
call funcl(x,n)
DO i = 1,10 
DO j = 1,10 
x(j,i) = 2 
ENDDO 
ENDDO
!HPF$ REDISTRIBUTE (CYCLIC(3), CYCLIC) :: x 
c *** Testing chaining of function arguments
call funcl(x,n) 
call func2(x,n) 
call funcl(x,n)
c *** Testing loop invariant due to return
DO i » 1,10 
DO j = 1,10
c *** Test cloning for transcriptive functions
call func3(x,n)
ENDDO 
ENDDO 
a = 1
c *** Testing unused distribution
!HPF$ REDISTRIBUTE (BLOCK, CYCLIC) :: x 
a = 0
c *** Testing "semantically killed" distribution
!HPF$ REDISTRIBUTE (CYCLIC(3), CYCLIC) :: x 
call func3(x,n)
END
integer function fund (a,n)
c * * *  Prescriptive function with different return
integer n, a(n, n)
!HPF$ DYNAMIC, DISTRIBUTE (BLOCK, CYCLIC) :: a 
a(l,1) = 1
!HPF$ REDISTRIBUTE (BLOCK, CYCLIC) :: a 
a(l,2) = 1
!HPF$ REDISTRIBUTE (CYCLIC, CYCLIC) :: a 
a(l ,3) = 1 
end
integer function func2(y,n)
c *** Prescriptive function with identical return
integer n, y(n,n)
!HPF$ DYNAMIC, DISTRIBUTE (CYCLIC, CYCLIC) :: y
y(l.'l) = 2end
PROGRAM test 
INTEGER x(10,10)
INTEGER a, n
! HPF$ PROCESSORS :: square(2,2)
!HPF$ DYNAMIC, DISTRIBUTE (BLOCK, BLOCK) :: x 
x (1,1) = 1
!HPF$ REDISTRIBUTE (BLOCK, CYCLIC) ONTO square :: x 
DO i = 1,10 
DO j = 1,10
a = 0
x(i,j) = 1 
a = 0 
END DO 
END DO 
a = 0
IF (x(i,j) .GT. 1) THEN
!HPF$ REDISTRIBUTE (BLOCK, BLOCK) ONTO square :: x 
x(i,j) = 2 
CALL func3$0(x,n)
ELSE
x(i,j) = 3 
END IF
c *** WARNING: too many dists (2) for x
x(l,l) = 2
!HPF$ REDISTRIBUTE (BLOCK, CYCLIC) ONTO square :: x 
CALL funcl(x,n)
DO i = 1,10 
DO j - 1,10
c *** WARNING: too many dists (2) for x
x(j,i) = 2 
END DO 
END DO
!HPF$ REDISTRIBUTE (BLOCK, CYCLIC) ONTO square :: x 
CALL funcl(x,n)
CALL func2(x,n)
!HPF$ REDISTRIBUTE (BLOCK, CYCLIC) ONTO square :: x 
CALL funcl(x,n)
!HPF$ REDISTRIBUTE (CYCLIC(3), CYCLIC) ONTO square :: x 
DO i = 1,10 
DO j = 1,10
CALL func3$l(x,n)
END DO 
END DO 
a = 1 
a = 0
CALL func3$l(x,n)
END
INTEGER FUNCTION fund(a,n)
INTEGER n, a(n,n)
!HPF$ DYNAMIC, DISTRIBUTE (BLOCK, CYCLIC) :: a 
a(l,l) = 1 
a(l,2) = 1
!HPF$ REDISTRIBUTE (CYCLIC, CYCLIC) ONTO square :: a 
a(l,3) = 1 
END
INTEGER FUNCTION func2(y,n)
INTEGER n, y(n,n)
!HPF$ DYNAMIC, DISTRIBUTE(CYCLIC,CYCLIC) ONTO square :: y
y(l,l) = 2 
END
INTEGER FUNCTION func3$l(n,z)
INTEGER n, z(n,n)
!HPF$ DISTRIBUTE(CYCLIC(3),CYCLIC) ONTO square :: z 
z(l.l) = 3 
END
INTEGER FUNCTION func3$0(n,z)
INTEGER n, z(n,n)
!HPF$ DISTRIBUTE(BLOCK,BLOCK) ONTO square :: z 
z(l,l) = 3 
END
integer function func3(z,n) 
c *** (implicitly) Transcriptive function
integer n, z(n,n) 
z (1,1) = 3 
end
(a) Before optimization (b) After optimization
Figure 5.13: Synthetic example for interprocedural redistribution optimization
87
Two warnings were also generated by the compiler, inserted by hand as comments in Fig­
ure 5 .13(b), indicating that there were (semantically) multiple reaching distributions to two uses 
of x in the program. The first use actually does have two reaching distributions due to a con­
ditional with redistribution performed on only one path. The second use, however, occurs after 
a call to a prescriptive function, f u n d ,  which implicitly redistributes the array to conform to 
its interface. Because a redistribution operation can only generate a single distribution, x can 
actually have only one reaching distribution, but semantically it still has two -  hence the second 
warning.
As there are several other optimizations performed on this example, which have been de­
scribed previously, we will not describe them in more detail here, but the reader is directed to 
the comment descriptions in the code for further information.
An example of a more realistic program is the ADI2D program (previously used to illustrate 
the data distribution optimizations in Chapter 4) shown in Figure 5.14 both before and after an­
alyzing it using this framework. To make the program more challenging for the framework to 
analyze, the main computation in the program is broken into separate functions. In this program, 
two functions (c o l.so lv e , and row .solve) are called -  one using transcriptive arguments so 
that it will inherit their distributions from the calling routine and the other using prescriptive 
arguments. In addition to the implicit redistribution that occurs when calling row .solve, each 
function also performs redistribution during its execution, potentially changing the input distri­
bution and, if so, returning a different distribution than that with which it was called.
Examining the transformed output in Figure 5.14(b), the redistribution originally performed 
upon entering the loop in the main program has been lifted out of the nest as it was found to 
be invariant, being performed instead, only once outside the entire loop body. Another impor­
tant point to note is that even though all redistribution operations are now explicit, there are 
no redistribution operations performed around the two call sites in the main program. Upon 
first glance, this does not seem correct since the second function row_solve had a prescrip­
tive distribution that was different from the active distribution in the calling context of the main 
program. There is not any redistribution performed before calling row .solve because the first 
function c o l.so lv e  actually redistributed the arrays from their initial state, leaving them in the
88
program ADI2d PROGRAM adi2d
parameter (N = 512, maxiter = 100) PARAMETERS = 512, maxiter = 100)
double precision u(N,N), uh(N,N), b(N,N), alpha DOUBLE PRECISION u(n,n), uh(n,n), b(n,n)
!hpf$ processors :: linear(32) DOUBLE PRECISION alpha
!hpf$ distribute (*, block) onto linear :: u, uh, b !HPF$ PROCESSORS :: linear(32)
!HPF$ DISTRIBUTE (*, BLOCK) ONTO linear :: u, uh, b
c *** Initialization removed c *** Initialization removed
alpha = 4 * (2.0 / N) alpha = 4 * (2.0 / n)
do k = 1, maxiter !HPF$ REDISTRIBUTE (BLOCK, *) ONTO linear :: u, uh, b
!hpf$ redistribute (block, *) :: u, uh, b DO k = 1,maxiter I
call col_solve(u, uh, b, N, alpha) CALL col.solve(u,uh,b,n,alpha) i
call row_solve(u, uh, b, N, alpha) CALL row_solve(u,uh,b,n,alpha)
enddo END DO
end uh(l,l) = u(l,l)
END
subroutine col_solve(u, uh, b, N, alpha)
double precision u(N,N), uh(N,N), b(N,N), alpha SUBROUTINE col solve(u,uh,b,n,alpha)
!hpf$ inherit :: u, uh, b DOUBLE PRECISION u(n,n), uh(n,n), b(n,n) !
c *** forward and backward sweeps along columns DOUBLE PRECISION alpha
do j = 2, N - 1 !HPF$ PROCESSORS :: linear(32)
do i = 2, N - 1 !HPF$ DISTRIBUTE(BLOCK,*) ONTO linear :: u, uh, b 1
b(i,j) = (2 + alpha) DO j = 2,n - 1 *
uh(i,j) = (alpha - 2) * u(i,j) DO i = 2,n - 1
k + u(i,j+l) + u(i,j-l) b(i,j) = (2 + alpha)
enddo uh(i,j) = (alpha - 2) * u(i,j)
enddo k + u(i,j + 1) + u(i,j - 1)
!hpf$ redistribute (*, block) :: u, uh, b END DO
do j = 2, N - 1 END DO
uh(2,j) = uh(2,j) + u(l,j) !HPF$ REDISTRIBUTE (*, BLDCK) ONTO linear :: u, uh
uh(N-l,j) = uh(N~l,j) + u(N,j) DO j = 2,n - 1
enddo uh(2,j) = uh(2,j) + u(1,j)
uh(n-l,j)=uh(n-l,j)+u(n,j) j
do j = 2, N - 1 END DO i
do i = 3, N-l !HPF$ REDISTRIBUTE (*, BLOCK) ONTO linear :: b
b(i,j) = b(i,j) - 1 / b(i-l,j) DO j = 2,n - 1
uh(i,j) = uh(i,j) + uh(i-l,j) / b(i-l,j) DO i = 3,n - 1
enddo b(i,j) = b(i,j) - 1 / b(i - 1,j)
enddo uh(i,j) = uh(i,j) + uh(i - 1,j) / b(i - 1,j) .
do j = 2, N - 1 END DO
uh(N-l.j) = uh(N-l,j) / b(N-l,j) END DO
enddo DO j = 2,n - 1
do j = 2, N - 1 uh(n - 1,j) = uh(n - l,j) / b(n - 1,j)
do i = N-2, 2, -1 END DO
uh(i,j) = (uh(i,j) + uh(i+l,j)) / b(i,j) DO j =* 2,n - 1
enddo DO i = n - 2,2, -1 j
enddo uh(i,j) = (uh(i,j) + uh(i + 1,j)) / b(i,j) i
end END DO
END DO
subroutine row_solve(u, uh, b, N, alpha) END
double precision u(N,N), uh(N,N), b(N,N), alpha
!hpf$ distribute (*, block) :: u, uh, b SUBROUTINE row_solve(u,uh,b,n,alpha) !
c *** forward and backward sweeps along rows DOUBLE PRECISION u(n,n), uh(n,n), b(n,n) ,
do j = 2, N - 1 DOUBLE PRECISION alpha
do i » 2, K - 1 !HPF$ PROCESSORS :: linear(32) 1
b(i,j) = (2 + alpha) !HPF$ DISTRIBUTE (*, BLOCK) ONTO linear :: u, uh, b
u(i,j) = (alpha - 2) * uh(i,j) DO j = 2,n - 1
k + uh(i+l,j) + uh(i-l,j) DO i = 2,n - 1
enddo b(i,j) = (2 + alpha)
enddo u(i,j) = (alpha - 2) * uh(i,j)
!hpf$ redistribute (block, *) :: u, uh, b k + uh(i + l.j) + uh(i - 1,j)
do i = 2, N - l END DO
u(i,2) = u(i,2) + uh(i,l) END DO
u(i,N-l) = u(i,N-l) + uh(i,N) !HPF$ REDISTRIBUTE (BLOCK, *) ONTO linear :: u, uh
enddo DO i = 2,n - 1
u(i,2) = u(i,2) + uh(i,l)
do j = 3, N-l u(i,n - 1) = u(i,n - 1) + uh(i,n) ’
do i = 2, N - 1 END DO j
b(i,j) = b(i,j) - 1 / b(i,j-1) !HPF$ REDISTRIBUTE (BLOCK, *) ONTO linear :: b
u(i,j) = u(i,j) + u(i,j-1) / b(i,j-1) DO j = 3,a - 1
enddo DO i = 2,n - 1
enddo b(i,j) = b(i,j) - 1 / b(i,j - 1)
do i = 2, N - 1 u(i,j) = u(i,j) + u(i,j - 1) / b(i,j - 1)
u(i,N-1) = u(i,N-l) / b(i,N-1) END DO
enddo END DO
do j = N-2, 2, -1 DO i => 2,n - 1
do i = 2, N - l u(i,n - 1) = u(i,n - 1) / b(i,n - 1)
u(i,j) = (u(i,j) + u(i,j+l)) / b(i,j) END DO
enddo DO j = n - 2,2, -1
enddo DO i = 2,n - 1
end u(i,j) = (u(i,j) + u(i,j + 1)) / b(i,j) 1END DO
END DO
END
(a) Before optimization (b) After optimization
Figure 5.14: 2-D ADI with interprocedural redistribution optimization
89
state which the prescriptive interface for row_solve required for its arguments. As compared 
to a full copy-in/copy-out approach, the redistribution framework has maintained HPF seman­
tics while still completely eliminating a pair of redistribution operations -  redistributing from 
(* , BLOCK) to (BLOCK, * ) after returning from c o l.so lv e  and redistributing from (BLOCK, *) 
to (* , BLOCK) before calling row_solve.
It is also interesting to note the placement of the redistribution performed within the two 
functions co l_so lve  and row_solve. Even though the redistribution of three arrays was orig­
inally specified as occurring at a specific point in the function, the redistribution of two of the 
arrays, u and uh, is performed at the original point while the redistribution of b is performed 
following an intervening loop nest. Here we are observing how the data-flow framework is lo­
cating redistribution operations just before the array is used in the new configuration (just in 
time, or lazy, redistribution). As previously discussed in Section 5.2.2, there is actually a win­
dow between the eager and lazy redistribution points over which the operation can be performed 
to obtain the same effect. If the overlapping of the eager/lazy windows of these three operations 
were considered, they could instead be collected at a single point, allowing their communication 
to be aggregated. As there is a tradeoff between the amount of buffering required to perform the 
communication and the amount of communication overhead eliminated due to the aggregation, 
this will be examined further in future work.
To further examine the optimization of loop invariant redistribution operations, another ex­
ample is shown in Figure 5.15. In this example, a redistribution operation on A is performed 
at the deepest level of a nested loop. If there are no uses of A before the occurrence of the re­
distribution (and no further redistribution performed in the remainder of the loop), then A will 
always have a (BLOCK, BLOCK) distribution within the loop body. This situation is detected by 
the framework and the redistribution operation is re-synthesized to occur outside of the entire 
loop nest. It could be argued that even when it appeared within the loop, the underlying redis­
tribution library could be written to be smart enough to only perform the redistribution when it 
is necessary (i.e., only on the first iteration) so that we have not really optimized away N2 redis­
tribution operations. Even in this case, this optimization has still completely eliminated (N2-l)
90
!HPF$ DISTRIBUTE (CYCLIC, *) :: A !HPF$ DISTRIBUTE (CYCLIC, *) :: A
DO i = 1,N !HPF$ REDISTRIBUTE (BLOCK, BLOCK) :: A
DO j = 1,N DO i = 1,N
DO j = 1,N
!HPF$ REDISTRIBUTE (BLOCK, BLOCK) :: A
A(i,j) = ... A(i,j) = ...
ENDDO END DO
ENDDO’ END DO
(a) Before optimization (b) After optimization
Figure 5.15: Example of loop invariant redistribution
check operations that would have been performed at run time to determine if the redistribution 
was required.
5.4 Summary
In this chapter, we have presented an interprocedural data-flow framework that provides a 
means to convert between a distribution-based representation (as specified by the data parti- 
tioner) and redistribution-based form (as specified by HPF) while optimizing the amount of re­
distribution that must be performed. In addition to supporting the data partitioner, this frame­
work also allows us to optimize dynamic HPF programs (using REDISTRIBUTE and REALIGN 
directives) as well as convert such programs into equivalent static versions through a process we 
refer to as Static Distribution Assignment (SDA) providing dynamic HPF support for existing 
subset HPF compilers. Both the capability of generating redistribution-based codes as well as 
the SDA transformation will be instrumental in the evaluation of data distributions, which are 
presented in Chapter 6.
91
CHAPTER 6
EXPERIMENTAL RESULTS
In the three previous chapters, we described the algorithms for optimizing communication, 
data distribution and redistribution that we have implemented in the PARADIGM compiler frame­
work. In this chapter we demonstrate these techniques using a number of benchmark kernels 
and larger example programs. The effect of the optimizations on the performance of the test 
programs is evaluated using an Intel Paragon and Thinking Machines CM-5.
6.1 Optimizing Communication
A group of small scientific programs are used to examine the performance of the presented 
communication optimizations. These programs include individual Fortran 77 kernels and sub­
routines which range in size from roughly 20 to 50 lines of code and exhibit stencil, or nearest- 
neighbor, access patterns. Since nearest-neighbor patterns will generate significant amounts of 
communication that can severely degrade performance if they are not properly optimized, they 
are good candidates for evaluating the automatic message combining and placement optimiza­
tions while their size is small enough to be able to reason about the decisions the compiler makes 
during the compilation process. The selected program fragments, in increasing order of com­
plexity, include:
•  Jacobi’s Iterative Method [95]
•  ADI Integration, Livermore kernel 8 (ADI) [106]
92
Table 6.1: Data partitioning for initial tests
Array Dimensions Data Distribution
Jacobi 1000 x 1000 BLOCK, BLOCK
ADI 4 x 10,000 x 2 * , BLOCK, *
EXPL 10,000 x 7 BLOCK, *
EFLUX 5000 x 34 BLOCK,*
• 2-D Explicit Hydrodynamics, Livermore kernel 18 (EXPL) [106]
•  Euler Fluxes (EFLUX, from FLOS2 in the Perfect Club) [107]
Array dimensions are statically specified and loop bounds are determined at compile time 
when possible. Both the mesh configuration as well as the data partitioning of the arrays are au­
tomatically selected by the compiler. The sizes of the major arrays are shown in Table 6.1 along 
with the distributions chosen for each of the programs. Other than a few annotations which were 
added to EFLUX to specify that several scalar variables could be privatized.1 The evaluation of 
the optimizations is performed using 16 processors of an Intel Paragon and a Thinking Machines 
CM-5. Traces are also taken on the Paragon using the Portable Instrumented Communication 
Library (PICL) [15].
PICL traces are analyzed using the ParaGraph [109] visualization tool to examine the effect 
of each optimization. The ParaGraph “Space time diagram” is used to visualize the execution 
profile to examine the frequency and amount of communication that take place. The space time 
diagram displays processors along the vertical axis and time along the horizontal. Continuous 
horizontal lines indicate uninterrupted execution while a break in a line indicates that a proces­
sor has blocked awaiting communication. Communication operations are indicated as vertical 
lines between the cooperating processors’ execution profiles. Snapshot views of the execution 
of EXPL on an Intel iPSC/2 compiled with each of the selected optimizations are shown in each 
figure while performance data is presented for all test programs (see Figures 6.1 to 6.4).
1 For EFLUX, privatization [108] is used for all optimizations except for run-time resolution. Scalar expansion 
was performed for the evaluation of run-time resolution in order to improve the determination of ownership at run­
time no other directives were added to the programs.
93
Figure 6.1: Run-time resolution Figure 6.2: Reduced loop bounds and coalescing
□  Ovhd 
■  Busy
 ^  ^  ^ sip dp
Jacobi ADI EXPL EFLUX Jacobi ADI EXPL EFLUX
Intel Paragon Thinking Machines CM-5
Jacobi ADI EXPL EFLUX Jacobi ADI EXPL EFLUX
Intel Paragon Thinking Machines CM-5
Figure 6.3: Message vectorization Figure 6.4: Message aggregation
UNI Uniprocessor Execution VECT Message Vectorization
RT Run-time Resolution AGGR Message Aggregation
RLB Reduced Loop Bounds & Message Coalescing ALL All Optimizations applied
Traces taken from Explicit Hydrodynamics (Livermore kernel 18) on an Intel Paragon 
(each time unit represents 1 ms, except for Figure 6.1 in which each is 10 ms)
94
For comparison purposes, the reported execution times have been normalized to the serial 
execution of the corresponding program (i.e., ideal speedup will equal ^  =  0.0625) and are 
further separated into two quantities:
•  Busy -  amount of time spent on useful computation (where useful refers only to the code 
which carries out the actual computation)
•  Overhead -  time spent executing code related to computation partitioning and communi­
cation
The relative effectiveness of each optimization can therefore be determined by examining 
the amount of overhead eliminated as the optimizations are incrementally applied.
Although several of the supported communication libraries provide tracing options, such 
profiling will only measure the portion of the overhead which is actually spent in communica­
tion operations. Other related costs (such as buffer packing and unpacking and the evaluation of 
communication masks) would be assigned to the busy time possibly giving a false impression of 
how well each optimization is performing. To avoid this problem, the busy time is instead com­
puted by taking a partitioned version of each test program and removing all of the code related 
to communication and buffer management. The communication buffers are then initialized with 
appropriate data to keep the computation valid, and the elapsed time for the partitioned compu­
tation alone is measured. Even though the results of the partitioned computation alone will not 
match the actual solution computed by the full parallel program, the useful operation count will 
be identical, therefore, yielding a tight lower bound on how much improvement can be expected 
from the optimizations. Overhead is now simply the time that the parallel run takes in excess 
of the measured busy time.
6.1.1 Results of run-time resolution
A direct application of the owner computes rule without any further optimization leads to 
run-time resolution [9]. Since each processor must execute the entire iteration space (to deter­
mine if any other processor needs locally owned data), different instances of a communication 
operation for a given reference are effectively serialized across the processors in a mesh dimen­
95
sion (see the trace in Figure 6.1). Multiple communication operations appear to be pipelined, 
but the time spent between successive messages can be quite large.
Examining the execution time for each program, it can be seen that traversing the entire it­
eration space to compute ownership results in large amounts of overhead. The communication 
performed for run-time resolution is also very inefficient since it is comprised of a large num­
ber of single element (sometimes even redundant) messages resulting in high communication 
overhead. The net result is a reduction in performance when compared to the original program’s 
serial performance.
6.1.2 Results of m essage coalescing and loop bounds reduction
By applying loop bounds reduction and statically generating communication operations [12, 
89], the serialization present in the baseline resolution cases can be eliminated. The statically 
generated communication operations are effectively collapsed in time as compared to the seri­
alized communication present in run-time resolution (see the trace in Figure 6.2). By statically 
generating communication operations, the loop bounds can be statically partitioned instead of 
requiring every processor to check for the possibility of communication for each reference over 
the entire iteration space at run time.
The overhead is dramatically reduced as all ownership and communication are now stati­
cally determined. Note, however, that the size of the messages is still identical to that in run­
time resolution (single elements), but redundant messages have been eliminated by application 
of the coalescing optimization. It is still apparent that there is an excessive amount of small 
messages being communicated since communication overhead is still a dominant factor in the 
execution time (as seen in Figure 6.3).
6.1.3 Results of m essage vectorization
It is also possible to vectorize the communication operations after loop bounds reduction 
and message coalescing have been applied. Using dependence information to determine when 
vectorization is applicable, communication operations can be lifted out of the innermost loops
96
thereby reducing the communication frequency, as shown in the trace in Figure 6.3. Recall that 
this also increases the size of the messages, but since there is no change in the transmission rate 
for larger messages on either the Paragon or the CM-5, this results is a large gain in performance 
as many communication operations have been replaced by a single operation. As a secondary 
effect, it is also possible that the target node compiler can do a better job of optimizing the core 
computation since the communication operations have been moved as far out of the innermost 
loops as possible separating it from the bulk of the computation.
For ADI, it is actually possible to vectorize all communication completely out of the entire 
loop nest. Because there is no synchronization due to communication within the loop nest, the 
reduction of the array bounds in the absence of communication resulted in super-linear speedup 
for the CM-5. This can be attributed to a gain in cache performance due to the reduction in the 
size of the working set as a result of the data partitioning.
6.1.4 Results of m essage aggregation
Aggregation also reduces the frequency of communication by grouping multiple commu­
nication operations, increasing the size of the resulting message. (See the trace in Figure 6.4.) 
Applying aggregation after loop bounds reduction and message coalescing, a performance gain 
is seen for ADI, EXPL, and EFLUX. There was no improvement using aggregation on Jacobi’s 
method since it communicated only one message for each source/destination pair.
By applying aggregation after vectorization, the communication overhead is further reduced 
as seen in the “thinning” of communication when comparing the traces in Figures 6.3 and 6.4. 
Since groups of messages to be communicated at the same point in the program are combined, 
the performance improvement is related to the number of different array sections that must be 
communicated. The scope of improvement is much smaller than vectorization for the test pro­
grams since the number of communication operations at any given point (anywhere from one to 
three) is much smaller in comparison to the iteration counts over which the messages were vec­
torized (related to the size of the array bounds along processor boundaries, see Table 6.1). If the 
properties were reversed, (a large number of arrays were to be communicated at a given point
97
but were not vectorizable, such as is possible with recurrences), it is possible that aggregation 
could have a larger impact on program performance than vectorization for some programs.
6.1.5 Results of all non-pipelined optimizations
Figure 6.5 shows the relative speedup and the amount of overhead while Table 6.2 reports the 
efficiencies measured for each optimization. Eliminating the overhead of traversing the entire 
iteration space to compute ownership, the combination of static communication generation and 
loop bounds reduction attained roughly half of the total available performance for most of the 
programs. By applying message vectorization it was possible to reduce most of the remaining 
communication overhead resulting in near-ideal performance while message aggregation was 
beneficial for those programs that contained references to a number of different arrays.
It is also interesting to compare the performance of the Paragon with results previously ob­
tained for the Intel iPSC/2 (second generation hypercube) and the iPSC/860 (third generation 
hypercube) [13]. The iPSC/2 has a 16MHz i386/387, the iPSC/860 has a 40MHz i860, and the 
Paragon has a 50MHz i860 with a second i860 for handling communication operations. The two 
hypercubes have identical interconnection networks with similar communication bandwidths, 
but have dramatically differing computational capabilities. The Paragon has similar computa­
tion performance compared to the iPSC/860 but with a much higher communication bandwidth. 
When comparing the iPSC/860 to the iPSC/2, the change in communication performance is 
about a factor of 5 while the change in computation is about a factor of 100. When comparing 
the Paragon to the iPSC/860, the change in communication is about a factor of 20 for large mes­
sages while the computation capabilities improve by only roughly 25%. For this reason, on the 
iPSC/860 the ratio of the cost of communication to computation is fairly high compared to the 
iPSC/2 or the Paragon. Comparing Figure 6.5(a) to the previous results [13], the Paragon and 
the iPSC/2 exhibit similar ratios between the busy and overhead times while the iPSC/860 tends 
to exhibit higher percentages of overhead (which can be expected when examining the above 
factors). From these observations, it is apparent that performing communication optimizations 
becomes more critical when the computational capabilities are increased more quickly than the
98
Sp
ee
du
p 
Sp
ee
du
p
Table 6.2: Program efficiencies vs. optimization
RT RLB AGGR VECT ALL
Paragon
Jacobi 1.4% 48.9% 48.9% 98.1% 98.1%
ADI 0.7% 70.4% 75.7% 92.7% 99.1%
EXPL 1.2% 62.9% 67.3% 94.2% 95.3%
EFLUX 1.2% 63.4% 64.4% 83.3% 94.3%
CM-5
Jacobi 0.7% 16.8% 16.8% 103.9% 103.9%
ADI 4.7% 65.2% 76.9% 126.7% 126.9%
EXPL 3.9% 57.9% 64.2% 107.2% 109.0%
EFLUX 1.5% 65.8% 68.0% 101.7% 101.9%
(a) Intel Paragon (16 processors)
(b) Thinking Machines CM-5 (16 processors) 
Figure 6.5: Comparison of combined optimizations
99
communication (iPSC/2 -* iPSC/860) or the communication capabilities decrease in compari­
son to computation (Paragon —y iPSC/860).
To examine their scalability, each of the fully optimized test programs were also executed 
with varying numbers of processors. In Table 6.3, each program’s mesh configuration is shown 
as selected by the automatic partitioning pass while the speedup curves for each run on the 
Paragon and the CM-5 are shown in Figure 6.6. The super-linear speedup is again observed for 
ADI on the CM-5 and can be attributed to the cache effect previously described in Section 6.1.3.
The performance of the test programs was also compared to an existing data-parallel com­
piler available on the CM-5 by converting them into a data-parallel language known as Connec­
tion Machine Fortran (CMF 2.1-2).2 The reduction in performance of the programs compiled 
with CMF can be attributed to the fact that it uses an SIMD (Single Instruction Multiple Data) 
model of execution for compilation (carried over from the CM-2). Simulating an SIMD exe­
cution model on a MIMD multicomputer (such as the CM-5) does not require every operation 
to be executed in full synchronization, but entails synchronizing the processors between blocks 
of computation at points of communication. This comparison demonstrates how, for a MIMD 
style architecture such as the CM-5, more performance can be gained through the use of a less 
synchronous model of computation (SPMD) with the application of these optimizations.
6.1.6 Results of m essage pipelining
To evaluate the quality of the strip size estimate developed in Section 3.4, a group of pro­
grams which require pipelining to extract parallelism are selected. These programs include in­
dividual Fortran 77 kernels and subroutines which range in size from roughly 15 to 60 lines 
of code (the amount of computation performed within the inner loop increasing with the com­
plexity of the programs) and all exhibit inner-loop cross-iteration dependencies which give rise 
to pipelined communication. Also, since the pipelining transformation has not yet been com­
pletely integrated into the compiler, their size is small enough to be able to examine the loop 
partitioning performed by PARADIGM and insert the pipelined communication operations by
2The CM-5 vector units were not utilized for either the CMF or optimized message-passing runs since the C 
and Fortran node compilers could not generate vector code for the message passing programs.
100
Sp
ee
du
p 
Sp
ee
du
p
Table 6.3: Mesh configurations
Test Procs. Mesh Test Procs. Mesh
ADI
EXPL
EFLUX
1 l x l
Jacobi
1 1 x 1
2 2 x 1 2 2 x 1
4 4 x 1 4 4 x 1
8 8 x 1 8 4 x 2
16 16 x 1 16 4 x 4
32 32 x 1 32 8 x 4
64 64 x 1 64 8 x 8
(b) ADI Integration (LK 8)
(c) Explicit Hydrodynamics (LK 18) (d) Euler Fluxes (from FL052)
Figure 6.6: Performance comparison
101
Table 6.4: Data partitioning for pipeline tests
L Array Dimensions Data Distribution
SOR 100 512 x 512 BLOCK, *
IMPL 100 512 x 512 * , BLOCK
ADI2D 100 512 x 512 BLOCK, *
BLTS 22 [5x]5 x 32 x 24 x 24 [* ,]* , BLOCK, * , *
hand based on information provided by the compiler. The selected program fragments, in in­
creasing order of the amount of computation performed within the pipelined loop, include:
• Successive Over-Relaxation Iterative Method (SOR) [95]
•  Implicit Hydrodynamics, Livermore kernel 23 (IMPL) [106]
•  2-D Alternating Direction Implicit (ADI2D3) iterative method [95]
• Block Lower Triangular Solver (BLTS, from APPLU in the NAS benchmarks) [110]
SOR and IMPL fit the two-level model quite well while ADI2D and BLTS contain differ­
ent properties. For ADI2D, even though there is an outer loop, there is actual no chaining be­
tween successive pipelines since they are separated by parallel sections containing bidirectional 
shift communication operations (which tend to serve as primitive self-synchronizing barriers). 
Therefore, L is set to 1 for the two-level ADI2D estimate. For BLTS, the computation does not 
contain any forward references so there is not any synchronization between successive pipelines. 
For this reason, the non-synchronizing form of the two-level estimate (Eq. (3.4)) will be used 
for BLTS.
»
The sizes of the major arrays and the chosen distributions are shown in Table 6.4 for each 
of the programs.4 Each of the programs was executed with varying strip sizes to compare the
3 Even though their names are similar, it is important to note that the computation performed in ADI2D is very 
different from the computation performed in the ADI kernel previously used in Section 6.1. One main difference 
is that several of the accesses within ADI2D exhibit cross-iteration dependencies while ADI is fully parallel.
4In order to prevent poor serial performance from cache-line aliasing due to the power of two problem size, 
the arrays were also padded with an extra element at the end of each column. This optimization, although here 
performed by hand, is automated by aggressive serial optimizing compilers such as the KAP preprocessor from 
KAI.
102
Table 6.5: Comparison of the granularity estimates to the optimal granularity 
(Reporting both the strip size and time in seconds)
Procs. Fine Optimal FineOpt. One-Level %A0pt Two-Level %A0pt
4 9.00 32 4.28 2.10 10 4.49 4.9% 20 4.31 0.7%
SOR 8 7.44 32 2.58 2.88 10 2.84 10.1% 27 2.60 0.8%
16 6.69 44 1.71 3.91 10 2.02 18.1% 37 1.73 1.2%
4 56.86 40 17.67 3.22 8 21.71 22.9% 16 18.44 4.4%
IMPL 8 30.13 34 11.18 2.69 8 13.06 16.8% 22 11.41 2.1%
Paragon
16 16.85 34 6.23 2.70 8 7.26 16.5% 30 6.24 0.2%
4 181.44 6 175.04 1.03 9 175.87 0.5% 10 176.20 0.7%
ADI 8 96.47 4 90.78 1.06 9 92.04 1.4% 10 92.44 1.8%
16 54.49 4 48.89 1.11 9 50.58 3.5% 9 50.58 3.5%
4 0.50 10 0.42 1.19 1 0.50 20.2% 6 0.42 0.5%
BLTS 8 0.27 6 0.23 1.17 1 0.26 16.5% 6 0.23 0 %
16 0.16 6 0.13 1.25 1 0.16 20.7% 7 0.15 8.9%
4 22.05 32 10.02 2.20 10 10.61 5.9% 20 10.12 1.0%
SOR 8 17.99 34 5.82 3.09 10 6.53 12.2% 28 5.88 1.0%
16 15.83 64 3.48 4.55 10 4.28 23.0% 37 3.50 0.6%
4 114.58 64 20.98 5.46 8 30.23 44.1% 17 23.97 14.3%
IMPL 8 57.58 64 10.90 5.28 8 15.65 43.6% 23 11.71 7.4%
CM-5
16 36.64 58 5.66 6.47 8 8.80 55.5% 31 5.97 5.5%
4 133.91 10 113.13 1.18 9 113.33 0.2% 11 113.21 0.1%
ADI 8 81.44 10 61.78 1.32 9 61.84 0.1% 10 61.78 0 %
16 54.26 10 32.86 1.65 9 33.29 1.3% 10 32.86 0 %
4 0.84 11 0.72 1.17 1 0.84 16.9% 6 0.72 0.4%
BLTS 8 0.59 8 0.43 1.38 1 0.59 37.8% 7 0.44 1.4%
16 0.49 6 0.27 1.80 1 0.49 77.7% 8 0.28 0.7%
actual performance of the pipelined programs to the estimates. Both the two-level strip size 
estimate (Eq. (3.3)) as well as the simpler one-level strip size estimate (Eq. (3.5))* developed in 
the Fortran D project, will be compared to the experimental results.
The optimal granularity is compared with the two forms of strip size estimates in Table 6.5. 
Both the selected strip size and resulting execution time are presented for each. The speedup 
achieved using the optimal granularity as compared to fine-grained execution (strip size of one) 
is also reported as well as the change in performance for each of the estimates as compared to 
the optimal.
103
For both the Paragon and the CM-5, the execution time using the two-level estimate is usu­
ally within a few percent of the optimal (except for IMPL on the CM-5 for which the estimate 
of the computational cost was not very accurate). Note that even though strip sizes predicted 
by the two-level estimate are not exactly the same as for the optimal, the execution times tend 
to be very similar because there is a range of strip sizes for which the performance curves are 
fairly flat. For ADI2D, both models did quite well because the pipelines could not be chained 
for this program. Overall, however, the one-level estimate was, at times, more than 20% (and 
in some case over 70%) worse than the optimal, predicting strip sizes of roughly half to a third 
of the size of the two-level estimate. In fact, a further approximation is made in the Fortran D 
project, resulting in an estimate of which proves to be even farther from the minimum.
As can be expected, the major gain seen with the two-level estimate comes from modeling 
the chaining of consecutive pipelines. In the one-level estimate, the pipeline startup costs have 
a more significant contribution and tend to reduce the strip size in order to minimize the total 
execution time. When chaining is taken into account in the two-level estimate, the contribution 
of the startup phase is much less in comparison to the overall execution time. For both ma­
chines examined, however, modeling the communication rate did not have a significant effect 
t>n the two-level estimate since only small messages were communicated (from Table 3.1, the 
overhead of communication was two to three orders of magnitude greater than the transmission 
rate). For machines in which the ratio of the communication overhead to the transmission rate 
is decreased, the effect of the communication rate on the model will become more significant.
In Figure 6.7, speedup curves are shown for the measured data as well as that predicted by 
the two-level estimate (along with the predicted optimal granularity). The predicted speedup 
curves in Figure 6.7 were computed by dividing the pipeline estimate by the estimated serial 
time. Since there tends to be a range of strip sizes which all provide similar levels of perfor­
mance, predicting the trend is more important than exactly predicting the optimal point. With 
the correct trend, it is possible to automatically select a granularity that approaches the min­
imum execution time. The computation estimates for the inner loop instructions usually dif­
fered from the experimentally measured value by a constant factor, but even though they were
104
Sp
ee
du
p 
Sp
ee
du
p 
Sp
ee
du
p 
Sp
ee
du
p
Intel Paragon TMC CM-5
(a) Successive Over-Relaxation
(b) Implicit Hydrodynamics (Livermore kernel 23)
(c) 2-D Alternating Direction Implicit Iterative Method
(d) Sparse Block Lower Triangular Solver 
Figure 6.7: Performance comparison of pipelining vs. granularity 
(shown with performance estimates and the predicted optimal granularities)
105
(a) Fine-grain pipelining (b) Coarse-grain pipelining
Figure 6.8: Traces from SOR with pipelining on an Intel Paragon 
(each time unit represents 100 /is)
not extremely accurate it can be seen from the graphs that the pipeline model still followed the 
performance trend quite well.
Comparing the performance of the different applications, it is possible to see that as the 
amount of computation performed within the pipelined loop is increased, the optimal granular­
ity decreases, as would be expected. By further examining the estimates, it is also interesting to 
note that the granularity required to achieve the minimum execution time will slightly increase 
with increasing machine sizes when the data set size remains constant. Based on this observa­
tion, the optimal granularity should also increase for decreasing data set sizes when the machine 
size remains constant. Both will effectively reduce the amount of data owned by each proces­
sor and, therefore, also reduce the amount of computation that each processor must perform 
while the amount of communication remains constant.5 This serves to confirm how, intuitively, 
granularity must be increased to compensate for an increasing communication to computation 
ratio.
Finally, traces from the SOR kernel run on the Paragon (again using PICL [15] and Para­
Graph [109]) are shown in Figure 6.8. By examining these traces, the improvement in perfor­
mance can also be seen as the coarse grain execution (using the granularity selected by the two- 
level estimate) has completed the first pipeline and is working on the second while the fine grain
5It was not always possible to observe this trend in the measured data most likely due to improvements in cache 
utilization as the size of the working set is reduced.
106
execution is still working on the first. These figures also show the correlation of the execution 
profile to the framework presented in Section 3.4.
6.2 Optimizing Data Distribution and Redistribution
To further evaluate the quality of the data distributions selected using the techniques pre­
sented in Section 4.3 as implemented in the PARADIGM compiler, we analyze two programs 
which exhibit different access patterns over the course of their execution. These programs are 
individual Fortran 77 subroutines which range in size from roughly 60 to 150 lines of code:
•  2-D Alternating Direction Implicit iterative method (ADI2D) [95]
• Shallow water weather prediction benchmark [111]
6.2.1 2-D Alternating Direction Implicit (ADI2D) iterative method
In order to evaluate the effectiveness of dynamic distributions, the ADI2D program, with 
a problem size of 512 x 512,6 is compiled with a fully static distribution (one iteration shown 
in Figure 6.9(a)) as well as with the selected dynamic distribution7 (one iteration shown in Fig­
ure 6.9(b)). These two parallel versions of the code were run on an Intel Paragon and a Thinking 
Machines CM-5 to examine the performance of each on the different architectures.
The static scheme illustrated in Figure 6.9(a) performs a shift operation to initially obtain 
some required data and then satisfies two recurrences in the program using software pipelin­
ing [10,13]. Since values are being propagated through the array during the pipelined compu­
tation, processors must wait for results to be computed before continuing with their own part 
of the computation. The amount of computation performed before communicating to the next
6To prevent poor serial performance from cache-line aliasing due to the power of two problem size, the arrays 
were also padded with an extra element at the end of each column. This optimization, although here performed by 
hand, is automated by aggressive serial optimizing compilers such as the KAP preprocessor from KAI.
7In the current implementation, loop peeling is not performed on the actual code. As previously mentioned in 
Section 4.3.4, the single additional startup redistribution due to not peeling will not be significant in comparison 
to the execution of the loop (containing a dynamic count of 600 redistributions)
107
(a) Static (pipelined)
(b) Dynamic (redistribution)
Figure 6.9: Modes of parallel execution for ADI2D
processor in the pipeline will have a direct effect8 on the overall performance of a pipelined 
computation. Since we had previously shown, in Figure 6.7(c), that the performance of a static 
partitioning for ADI can be improved for both the CM-5 and Paragon, both a fine-grain and the 
optimal coarse-grain static partitioning will be compared with the dynamic partitioning.
The redistribution present in the dynamic scheme appears as three transposes9 performed 
at two points within an outer loop (the exact points in the program can be seen in Figure 4.8). 
Since the sets of transposes occur at the same point in the program, the data to be communicated 
for each transpose can be aggregated into a single message during the actual transpose. It has 
been previously shown that aggregating communication improves performance by reducing the 
overhead of communication [13], so we will also examine aggregating the individual transpose 
operations here.
In Figure 6.10, the performance of both dynamic and static partitionings for ADI2D is shown 
for an Intel Paragon and a Thinking Machines CM-5. For the dynamic partitioning, both aggre­
gated and non-aggregated transpose operations were compared. For both machines, it is appar­
ent that aggregating the transpose communication is very effective, especially as the program is
8 According to the ratio of communication vs. computation performance for a given machine.
9 This could have been reduced to two transposes at each point if we allowed the cuts to reorder statements and 
perform loop distribution on the innermost loops (between statements 17, 18 and 41, 42), but these optimizations 
are not examined here.
108
Figure 6.10: Performance of ADI2D
executed on larger numbers of processors. This can be attributed to the fact that the start-up cost 
of communication (which can be several orders of magnitude greater than the per byte transmis­
sion cost) is being amortized over multiple messages with the same source and destination.
For the static partitioning, the performance of the fine-grain pipeline was compared to a 
coarse-grain pipeline using the optimal granularity. The coarse-grain optimization yielded the 
greatest benefit on the CM-5 while still improving the performance (to a lesser degree) on the 
Paragon. For the Paragon, the dynamic partitioning with aggregation clearly improved perfor­
mance (by over 70% compared to the fine-grain and 60% compared to the coarse-grain static 
distribution). On the CM-5, the dynamic partitioning with aggregation showed performance 
gains of over a factor of two compared to the fine-grain static partitioning but only outperformed 
the coarse-grain version for extremely large numbers of processors. For this reason, it would 
appear that the limiting factor on the CM-5 is the performance of the communication.
As previously mentioned in Section 4.3.4, the static partitioner currently makes a very con­
servative estimate for the execution cost of pipelined loops [8]. For this reason a dynamic par­
titioning was selected for both the Paragon as well as the CM-5. If a more accurate pipelined 
cost model [13] were used, a static partitioning would have been selected instead for the CM-5. 
For the Paragon, the cost of redistribution is still low enough that a dynamic partitioning would 
still be selected for machine configurations with a large number of processors.
109
To complete our analysis of ADI2D, it is also interesting to estimate the cost of performing 
a single transpose in either direction (Px 1 1 xP) from the communication overhead present
in the dynamic runs. Ignoring any performance gains from cache effects, the communication 
overhead can be computed by subtracting the ideal run time (serial time divided by the selected 
number of processors) from the measured run time. Given that three arrays are transposed 200 
times, the resulting overhead divided by 600 yields a rough estimate of how much time is re­
quired to redistribute a single array as shown in Table 6.6.
From Table 6.6 it can be seen that as more processors are involved in the operation, the time 
taken to perform one transpose levels off until a certain number of processors is reached. After 
this point, the amount of data being handled by each individual processor is small enough that 
the startup overhead of the communication has become the controlling factor. Aggregating the 
redistribution operations minimizes this effect thereby achieving higher levels of performance 
than would be possible otherwise.
Table 6.6: Empirically estimated time (ms) to transpose a 1-D partitioned matrix 
(512 x 512 elements; double precision)
Intel Paragon TMC CM-5
processors individual aggregated individual aggregated
8 36.7 32.0 138.9 134.7
16 15.7 15.6 86.8 80.5
32 14.8 10.5 49.6 45.8
64 12.7 6.2 40.4 29.7
128 21.6 8.7 47.5 27.4
6.2.2 Shallow water weather prediction benchmark
Since not all programs will necessarily need dynamic distributions, we also examine another 
program which exhibits several different phases of computation. The Shallow water bench­
mark is a weather prediction program using finite difference models of the shallow water equa­
tions [111] written by Paul Swarztrauber from the National Center for Atmospheric Research.
110
As the program consists of a number of different functions, the program is first inlined since 
data partitioner is not yet fully interprocedural. Also, a loop which is implicitly formed by a 
GOTO statement is replaced with an explicit loop since the current performance estimation frame­
work does not handle unstructured code. The final input program, ignoring comments and dec­
larations, resulted in 143 lines of executable code.
In Figure 6.11, the phase transition graph is shown with the selected path using costs based 
on a 32 processor Intel Paragon with the original problem size of 257 x 257 limited to 100 iter­
ations. The decomposition resulting in this graph was purposely bounded only by the produc­
tivity of the cut, and not by a minimum cost of redistribution in order to expose all potentially 
beneficial phases. This graph shows that by using the decomposition technique presented in 
Figure 4.9, Shallow contains six phases (the length of the path from the start to stop node) with 
a maximum of four (sometimes redundant) candidates for any given phase.
As there were no alignment conflicts in the program, and only BLOCK distributions were nec­
essary to maintain a balanced load, the distribution of the 14 arrays in the program can be in­
ferred from the selected configuration of the processor mesh. By tracing the path back from the 
stop node to the start, the figure shows that the dynamic partitioner selected a two-dimensional 
(8 x 4) static data distribution. Since there is no redistribution performed along this path, the 
loop peeling process previously described in Section 4.3.4 is not performed on this graph to 
improve its clarity.10
As the communication and computation estimates are best-case approximations (they don’t 
take into account communication buffering operations or effects of the memory hierarchy), it is 
safe to say that for the Paragon, a dynamic data distribution does not exist which can out-perform 
the selected static distribution. Theoretically, if the communication cost for a machine were in­
significant in comparison to the performance of computation, redistributing data between the 
other phases revealed during the decomposition of Shallow would be beneficial. The dynamic 
partitioner performed more work to come to the same conclusion that a single application of 
the static partitioner would have found, but it is interesting to note that even though it was con-
10 As peeling is only used to model the possibility of implicit redistribution due to the back-edge of a loop, it is 
only necessary when there is actually redistribution present within a loop.
I ll
Figure 6.11: Phase transition graph and solution for Shallow 
(Displayed with the selected phase transition path and cummulative cost.)
112
Figure 6.12: Performance of Shallow
sidering any possible redistribution, it will still select a static data distribution if that is what is 
predicted to have the best performance.
To evaluate the performance of the selected distribution, a parallel MPI program is gener­
ated using the prototype PARADIGM compiler [11,89]. Since the PARADIGM compiler does 
not fully support statements with differing left-hand side index expressions for partitioned array 
dimensions, loop distribution was first performed by hand on several loops to remove the need 
for any masks around such statements. Shallow also contains a number of loops that simply 
copy data from one region of the array to another. When such a loop is parallelized, vector­
ized communication operations are generated by PARADIGM, which perform the data move­
ment directly between the local address spaces of the communicating processors. However, 
PARADIGM does not currently detect when such loops are made unnecessary by the data move­
ment of the communication so such loops were also removed by hand after the parallel program 
was generated.
In Figure 6.12, the performance of the selected static 2-D distribution (BLOCK, BLOCK) is 
compared to a static 1-D row-wise distribution (BLOCK, *) which appeared in some of the sub­
phases. The 2-D distribution matches the performance of the 1-D distribution for small numbers 
of processors while outperforming the 1-D distribution for both the Paragon and the CM-5 (by 
up to a factor of 1.6 or 2.7, respectively) for larger numbers of processors.
113
Kennedy and Kremer have also examined the Shallow benchmark, but predicted that a one­
dimensional (column-wise) block distribution was the best distribution [112] for up to 32 pro­
cessors of an Intel iPSC/860 hypercube (also showing that the performance of a 1-D column­
wise distribution was almost identical a 1-D row-wise distribution). They chose this distribution 
because they only exhaustively enumerated one-dimensional candidate layouts for the opera­
tional phases.11 As their operational definition already results in 28 phases (over four times as 
many in comparison to our approach), the complexity of their resulting 0-1 integer program­
ming formulation will also only increase further by considering multidimensional layouts.
6.3 Summary
In this chapter, we have presented an experimental evaluation of the optimization techniques 
presented in this thesis. All of these techniques have been implemented in the PARADIGM 
compiler framework except for the coarse-grain pipelining transformation and optimal granu­
larity selection technique which were performed by hand simulating the actions that would have 
been taken by the compiler. In the final chapter, Chapter 7, we present our conclusions on this 
’work, and discuss directions for future work in this area. 1
11 They justify this decision [67] since the Fortran D prototype compiler can not compile multidimensional dis­
tributions [80].
114
CHAPTER 7
CONCLUSIONS
In this thesis, we have presented our contributions which were focused in three main areas 
of performance optimization compilation techniques for distributed-memory multicomputers, 
each of which was shown to build upon the previous:
(1) Optimizing communication,
(2) Optimizing data distribution using the available communication optimizations, and
(3) Optimizing redistribution for dynamic data distributions.
7.1 Summary of Contributions
One of the most complex tasks facing a user when writing parallel programs for distributed- 
memory machines (using a message-passing programming model) is efficiently managing in­
terprocessor communication. In this thesis, it has been shown that this task can be performed by 
the compiler at a higher level through the use of static analysis and good estimates of the costs 
of communication and computation. It has also been shown that the application of the presented 
optimization techniques yields high performance on several different distributed-memory mul­
ticomputers. By applying the presented optimizations, it was possible to amortize communica­
tion overhead obtaining large gains in performance. As observed, the performance of the op­
timizations is directly influenced by the relative costs of communication and computation on a 
given machine. The estimates presented in this thesis account for variation in both computa­
115
tional power as well as the communication latency and bandwidth of different machines. By 
comparing the estimated optimal granularity for message pipelining to that for the measured 
data, it is also apparent that automatic selection of a granularity that gives near-optimal perfor­
mance is possible.
Currently, the PARADIGM compiler can generate programs based upon the owner-computes 
rule using either run-time resolution or loop bounds reduction through the use of static program 
analysis. The message coalescing, message vectorization, and message aggregation optimiza­
tions have all been implemented as described in this thesis. For this research, the coarse grain 
pipeline transformation and optimal granularity selection technique were performed by hand 
simulating the actions that would have been taken by the compiler. In future work, these tech­
niques will be integrated into the compiler.
We have also demonstrated how dynamic data partitionings can provide higher performance 
than purely static distributions for programs containing competing data access patterns. The 
distribution selection technique presented in this thesis provides a means of automatically de­
termining high quality data distributions (dynamic as well as static) in an efficient manner taking 
into account both the structure of the input program as well as the architectural parameters of the 
target machine. Heuristics, based on the observation that high communication costs are a result 
of not being able to statically align every reference in complex programs simultaneously, are 
used to form the communication graph. By removing constraints between competing sections 
of the program, better distributions can potentially be obtained for the individual sections and if 
the gains in performance are high enough in comparison to the cost of redistribution, dynamic 
distributions are formed. Communication still occurs, but data movement is now a dynamic re­
organization of ownership as opposed to simply obtaining any required remote data based on a 
(compromise) static assignment of ownership.
A key requirement in automating this selection process is to be able to obtain estimates of 
communication and computation costs which accurately model the behavior of the program un­
der a given distribution. Furthermore, by building upon existing static partitioning techniques, 
the number of phases examined as well as the amount of redistribution considered are kept to 
a minimum. Other than the additional analysis required for handling cut windows during the
116
phase decomposition process, and some additional support required for automatically perform­
ing loop peeling in the phase transition graph, these techniques have been fully implemented in 
the PARADIGM compiler.
In this thesis we have also presented an interprocedural data-flow technique that can be used 
to transform redistribution to distributions or distributions to redistribution as well as optimize 
redistribution while always maintaining the semantics of the original program. Aliasing infor­
mation is also used to ensure that the resulting distributions specified by the HPF program yield 
identical distributions for all aliased arrays. Other than for the support to physically perform re­
distribution in the middle of a statement (required for a return from a function call with different 
distributions) and support for array reshaping (which can occur through function parameters), 
these techniques have been fully implemented in the PARADIGM compiler.
HPF compilers that are currently available as commercial products [113-116] or those that 
have been developed as research prototypes [3, 80,117,118] do not yet support transcriptive 
argument passing or the REDISTRIBUTE and REALIGN directives as there is still much work re­
quired to provide efficient support for the HPF subset (which does not include these features). 
Since the techniques presented in this thesis can convert all of these features into HPF constructs 
which are in the HPF subset (through the use of SDA), this framework can be used to provide 
these features as a front end to an existing commercial or research compiler.
7.2 Future Work
In this research, we have covered a wide spectrum of interrelated optimization problems 
from communication optimization to distribution selection and redundant redistribution elimi­
nation. The techniques we have developed and implemented as part of the PARADIGM com­
piler framework form a solid basis from which a number of different research directions could 
be pursued in any of the areas of program optimization addressed in this thesis.
117
7.2.1 Optimizing communication
For this research, the coarse-grain pipeline transformation and granularity selection tech­
nique were performed by hand since the prototype PARADIGM compiler does not currently 
detect opportunities for pipelining. Future research will examine different approaches for de­
termining when the pipeline transformation is valid and automatically applying the granularity 
optimization technique as presented in this thesis.
The communication optimizations presented in this thesis currently optimize communica­
tion within a loop nest. By extending these techniques using existing data-flow analysis tech­
niques, [33,35] it would be possible to further optimize communication between separate loop 
nests by removing communication that can be determined to be redundant when taking into ac­
count the program’s control flow
Since communication is also currently generated as send/receive pairs that are placed at the 
same point in a program, this has the potential for causing more synchronization than is neces­
sary. By using data-flow analysis techniques, it should also be possible to move send operations 
to the earliest point in the program at which the data is read forming a window over which the 
communication operation can be performed. The earlier communication is started in this win­
dow, the higher the message buffer requirements become when taking into account overlapping 
windows from other communication operations. This trade-off between latency hiding and the 
memory requirements for the operating system to buffer incoming messages must be taken into 
account when applying such an optimization.
7.2.2 Optimizing data distribution
The data partitioning algorithms can also be further extended by defining a common inter­
face (similar in spirit to the libsum  pass in Parafrase-2 [119]) for defining input and output dis­
tributions and performance estimation profiles for library functions. Further investigation into 
the application of statement reordering and loop distribution during the selection of the cut in the 
phase decomposition step could also be performed to reduce the amount of required redistribu­
tion. To handle more general control flow during the phase selection process, the applicability
118
of existing static and profile-driven trace selection techniques should also be investigated. To 
be able to perform whole program partitioning analysis, future work will also address interpro­
cedural partitioning analysis using representations such as the Hierarchical Component Affinity 
Graph.
As we have concentrated on applications with regular access patterns that are amenable to 
static analysis, future work will also address data partitioning for programs that also contain ir­
regular references (i.e., subscripted subscripts in which the access patterns are not determined 
until run time) [120]. Although much of the static analysis will no longer apply, the data par- 
titioner may still be able to recommend alignments between irregularly accessed arrays which 
are indexed by the same run-time dependent pattern or to optimize the distribution for regular 
accesses while minimizing the worst-case effect for irregular references.
In this thesis we have only considered optimizing data distributions using only the data par­
allelism present in an application. Other recent work in the PARADIGM project that utilizes 
both task and data parallelism simultaneously [121] currently assumes that the task graphs [ 122] 
and data distributions for the individual tasks are selected separately. Using the techniques pre­
sented in this thesis, the data distributions for the individual tasks could be selected in isolation, 
but would ignore the intertask communication. Ideally, the data distributions for the individual 
tasks should be selected while taking into account any distribution preferences due to the task 
dependencies.
The current static cost estimation framework requires all program variables to be known 
constants at compile time [8]. As this is not always possible for most real programs which can 
have variable sized data sets that may be constant but are not determined until run time, it would 
be desirable to be able to manipulate compile-time unknowns symbolically within the cost ex­
pressions. Since most applications that use cost estimates only require comparisons to be made 
between two different costs, symbolic comparison could be performed in situations where nu­
merical costs are not available. Another option would be to integrate the performance estima­
tion framework with feedback from a parallel performance profiling system, similar to the joint 
work currently being performed between the Fortran D and Pablo projects [123]. By using a 
hybrid approach of static estimation and profiling techniques, it would also be possible to take
119
into account the effects due to the memory hierarchy to further improve the accuracy of the es­
timates.
7.2.3 Optimizing data redistribution
Further work will focus on the application of code transformations (such as loop peeling), 
which will help reduce the number of reaching distributions to uses of variables with multiple 
active distributions. When it is not possible to completely eliminate such regions, code repli­
cation approaches, similar to the selective function cloning techniques used in this thesis, will 
also be examined to allow an HPF compiler to efficiently compile such code regions. This type 
of situation is not currently supported in PARADIGM due to the difficulty in handling such sit­
uations.
Currently, redistribution communication is synthesized in a demand-driven fashion (just in 
time, or lazy, redistribution). Although not examined in this thesis, it would also be possible to 
take a lazy redistribution solution and determine the earliest possible time that the redistribution 
could be performed forming a window over which redistribution could occur. It would be advan­
tageous to position multiple redistribution operations in overlapping windows to the same point 
in the program in order to aggregate the communication thereby reducing the amount of com­
munication overhead. It would also be possible to consider split phase redistribution operations 
in order to hide communication latency by sending all data in an eager fashion and receiving 
it in a lazy fashion. The trade-offs of such an approach are similar to those found when hiding 
latency by separating communication send and receive pairs.
7.2.4 Distributed-shared memory architectures
Finally, as all of the research in this thesis has focused on purely distributed-memory archi­
tectures, future research could also address the application of these optimizations in the scope 
of distributed-shared memory architectures. In these architectures, memory is still physically 
distributed among the processors, but additional hardware in the memory system is used to pro­
vide a global name space (thereby providing a simpler mechanism to access remote memory
120
in addition to also being able to support message-passing programming models). As this type 
of architecture will still have differing costs between local and remote memory accesses, it is 
still beneficial to optimize interprocessor communication (now in the form of data prefetching 
and forwarding, as previously discussed in Section 2.1) as well as optimize the distribution and 
redistribution of program data. These techniques now reduce the amount of remote memory 
references generated by underlying hardware mechanisms as opposed to individual messages.
121
REFERENCES
[1] Office of Science and Technology Policy, Grand Challenges 1993: High Performance
Computing and Communications. Washington, D.C.: National Science Foundation,
1992.
[2] C. Koelbel, D. Loveman, R. Schreiber, G. Steele, Jr., and M. Zosel, The High Perfor­
mance Fortran Handbook. Cambridge, MA: The MIT Press, 1994.
[3] P. Banerjee, J. A. Chandy, M. Gupta, E. W. Hodges IY, J. G. Holm, A. Lain, D. J. 
Palermo, S. Ramaswamy, andE. Su, “The PARADIGM compiler for distributed-memory 
multicomputers,” IEEE Computer, vol. 28, no. 10, pp. 37^4-7, Oct. 1995.
[4] Message-Passing Interface Forum, “MPI: A message-passing interface standard,” Uni­
versity of Tennessee, Knoxville, TN, Tech. Rep. CS-94-230, 1994.
[5] High Performance Fortran Forum, “High Performance Fortran language specification, 
version 1.1,” Center for Research on Parallel Computation, Rice University, Houston, 
TX, Tech. Rep., Nov. 1994.
[6] C. D. Polychronopoulos, M. Girkar, M. R. Haghighat, C. L. Lee, B. Leung, and 
D. Schouten, “Parafrase-2: An environment for parallelizing, partitioning, synchroniz­
ing and scheduling programs on multiprocessors,” in Proceedings o f the 18th Interna­
tional Conference on Parallel Processing, St. Charles, EL, Aug. 1989, pp. 11:39-^48.
[7] E. W. Hodges IV, “High Performance Fortran support for the PARADIGM compiler,” 
M.S. thesis, Department of Electrical and Computer Engineering, University of Illinois, 
Urbana, IL, Oct. 1995, CRHC-95-23/UILU-ENG-95-2237.
[8] M. Gupta, “Automatic data partitioning on distributed memory multicomputers,” Ph.D. 
dissertation, Department of Computer Science, University of Illinois, Urbana, IL, Sept. 
1992, CRHC-92-19/UILU-ENG-92-2237.
[9] D. Callahan and K. Kennedy, “Compiling programs for distributed-memory multipro­
cessors,” The Journal o f Supercomputing, vol. 2, no. 2, pp. 151-169, Oct. 1988.
[10] S. Hiranandani, K. Kennedy, and C. Tseng, “Compiling Fortran D for MIMD distributed 
memory machines,” Communications of the ACM, vol. 35, no. 8, pp. 66-80, Aug. 1992.
122
[11] E. Su, D. J. Palermo, and P. Banerjee, “Processor Tagged Descriptors: A data structure 
for compiling for distributed-memory multicomputers,” in Proceedings of the 1994 In­
ternational Conference on Parallel Architectures and Compilation Techniques, Montreal, 
Canada, Aug. 1994, pp. 123-132.
[12] E. Su, A. Lain, S. Ramaswamy, D. J. Palermo, E. W. Hodges IV, and P. Banerjee, “Ad­
vanced compilation techniques in the PARADIGM compiler for distributed memory mul­
ticomputers,” in Proceedings of the 9th ACM International Conference on Supercomput­
ing, Barcelona, Spain, July 1995, pp. 424-433.
[13] D. J. Palermo, E. Su, J. A. Chandy, and P. Banerjee, “Compiler optimizations for dis­
tributed memory multicomputers used in the PARADIGM compiler,” in Proceedings of 
the 23rd International Conference on Parallel Processing, St. Charles, IL, Aug. 1994,
pp. 11:1-10.
[14] G. A. Geist, A. Beguelin, J. J. Dongarra, W. Jiang, R. Manchek, and V. S. Sunderam, 
PVM 3.0 User’s Guide and Reference Manual, Oak Ridge National Laboratory, Oak 
Ridge, TN, Feb. 1993.
[15] G. A. Geist, M. T. Heath, B. W. Peyton, and P. H. Worley, “PICL: A portable instru­
mented communication library, C reference manual,” Oak Ridge National Laboratory, 
Oak Ridge, TN, Tech. Rep. ORNL/TM-11130, July 1990.
[16] J. Li and M. Chen, “The data alignment phase in compiling programs for distributed- 
memory machines,” Journal of Parallel and Distributed Computing, vol. 13, no. 2, pp. 
213-221, Oct. 1991.
[17] W. Abu-Sufah, D. J. Kuck, and D. H. Lawrie, “Automatic program transformatoins for 
virtual memory computers,” in Proceedings of the 1979 National Computer Conference, 
June 1979, pp. 969-974.
[18] W. Jalby and U. Meier, “Optimizing matrix operations on a parallel multiprocessor with 
a hierarchical memory system,” in Proceedings of the 15th International Conference on 
Parallel Processing, St. Charles, EL, Aug. 1986, pp. 429-432.
[19] K. Gallivan, W. Jalby, and D. Gannon, “On the problem of optimizing data transfers for 
complex memory systems,” in Proceedings of the Second ACM International Conference 
on Supercomputing, Saint Malo, France, 1988, pp. 238—253.
[20] M. S. Lam, E. E. Rothberg, and M. E. Wolf, “The cache performance and optimizations 
of blocked algorithms,” in Proceedings of the Fourth International Conference on Archi­
tectural Support for Programming Languages and Operating Systems, Santa Clara, CA, 
Apr. 1991, pp. 63-74.
[21] M. J. Wolfe, “More iteration space tiling,” in Proceedings o f Supercomputing ’89, Reno, 
NV, Nov. 1989.
123
[22] M. E. Wolf, “Improving locality and parallelism in nested loops,” Ph.D. dissertation, 
Computer Systems Laboratory, Stanford University, Stanford, CA, Aug. 1992, CSL- 
TR-92-538.
[23] D. Callahan, K. Kennedy, and A. Porterfield, “Software prefetching,” in Proceedings 
of the Fourth International Conference on Architectural Support for Programming Lan­
guages and Operating Systems, Santa Clara, CA, Apr. 1991, pp. 40-52.
[24] T. C. Morwry, M. S. Lam, and A. Gupta, “Design and evaluation of a compiler algorithm 
for prefetching” in Proceedings o f the Fifth International Conference on Architectural 
Support for Programming Languages and Operating Systems, Boston, MA, Oct. 1992, 
pp. 62-73.
[25] W. Y. Chen, S. A. Mahlke, W. W. Hwu, T. Kiyohara, and P. P. Chang, “Tolerating data 
access latency with register preloading,” in Proceedings of the Sixth ACM International 
Conference on Supercomputing, Washington D.C., July 1992, pp. 104-113.
[26] R. L. Lee, P.-C. Yew, and D. H. Lawrie, “Data prefetching in shared memory multipro­
cessors,” in Proceedings o f the 16th International Conference on Parallel Processing, 
St. Charles, IL, Aug. 1987, pp. 28-31.
[27] E. H. Gomish, E. D. Granston, and A. V. Veidenbaum, “Compiler-directed data prefetch­
ing in multiprocessors with memory hierarchies,” in Proceedings o f the Fourth ACM In­
ternational Conference on Supercomputing, Amsterdam, The Netherlands, May 1990, 
pp. 354-368.
» [28] D. K. Poulsen, “Memory latency reduction via data prefetching and data forwarding 
in shared memory multiprocessors,” Ph.D. dissertation, Department of Electrical and 
Computer Engineering, University of Illinois, Urbana, IL, Sept. 1994, CSRD-1377.
[29] A. Rogers and K. Pingali, “Process decomposition through locality of reference,” in 
Proceedings o f the ACM SIGPLAN ’89 Conference on Programming Language Design 
and Implementation, Portland, OR, June 1989, pp. 69-80.
[30] M. Gemdt, “Updating distributed variables in local computations,” Concurrency: Prac­
tice and Experience, vol. 2, no. 3, pp. 171-193, Sept. 1990.
[31] V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer, “An interactive environment for 
data partitioning and distribution,” in Proceedings o f the Fifth Distributed Memory Com­
puting Conference, Charleston, SC, Apr. 1990, pp. 11:1160-1170.
[32] S. Hiranandani, K. Kennedy, and C.-W. Tseng, “Evaluation of compiler optimizations 
for Fortran D on MIMD distributed-memory machines,” in Proceedings o f the Sixth ACM 
International Conference on Supercomputing, Washington D.C., July 1992, pp. 1-14.
[33] C. Gong, R. Gupta, and R. Melhem, “Compilation techniques for optimizing communi­
cation on distributed-memory systems,” in Proceedings of the 22nd International Con­
ference on Parallel Processing, St. Charles, IL, Aug. 1993, pp. 11:39-46.
124
[34] S. P. Amarasinghe and M. S. Lam, “Communication optimization and code generation 
for distributed memory machines,” in Proceedings o f the ACM SIGPLAN ’93 Conference 
on Programming Language Design and Implementation, Albuquerque, NM, June 1993, 
pp. 126-138.
[35] M. Gupta, E. Schonberg, and H. Srinivasan, “A unified data-flow framework for optimiz­
ing communication,” in Proceedings of the 7th Workshop on Languages and Compilers 
fo r Parallel Computing, Ithica, NY, 1994, vol. 892 of Lecture Notes in Computer Science, 
pp. 263-282, Springer-Verlag, 1995.
[36] A. Rogers and K. Pingali, “Compiling for distributed memory architectures,” IEEE 
Transactions on Parallel and Distributed Systems, vol. 5, no. 2, pp. 281-298, Feb. 1994.
[37] Y. Muraoka, “Parallelism exposure and exploitation in programs,” Ph.D. dissertation, 
Department of Computer Science, University of Illinois, Urbana, IL, Feb. 1971, UICS 
71-424.
[38] R. M. Karp, R. E. Miller, and S. Winograd, “The organization of computations for uni­
form recurrence equations,” Journal o f the ACM, vol. 14, no. 3, pp. 563-590, July 1967.
[39] L. Lamport, “The parallel execution of DO loops,” Communications of the ACM, pp. 
83-93, Feb. 1974.
[40] R. H. Kuhn, “Optimization and interconnection complexity for: Parallel processors, 
single-stage networks, and decision trees,” Ph.D. dissertation, Department of Computer 
Science, University of Illinois, Urbana, EL, Feb. 1980, UICS 80-1009.
[41] M. J. Wolfe, “Optimizing supercompilers for supercomputers,” Ph.D. dissertation, Cen­
ter for Supercomputing Research and Development, University of Illinois, Urbana, IL, 
Oct. 1982, CSRD-329.
[42] D. I. Moldovan, “On the design of algorithms of VLSI systolic arrays,” Proceedings of 
the IEEE, vol. 71, no. 1, pp. 113-120, 1983.
[43] D. I. Moldovan, Parallel Processing: From Applications to Systems. San Mateo, CA: 
Morgan Kaufman Publishers, 1993.
[44] U. Banerjee, “Unimodular transformations of double loops,” in Proceedings o f the Third 
Workshop on Languages and Compilers for Parallel Computing, Irvine, CA, Aug. 1990, 
pp. 192-219, The MIT Press, 1991.
[45] U. Banerjee, Loop Transformations for Restructuring Compilers: The Foundations. 
Boston, MA: Kluwer Academic Publishers, 1993.
[46] H. T. Kung, “Why systolic architectures?,” IEEE Computer, vol. 15, no. 1, pp. 37^16, 
Jan. 1982.
i
125
[47] D. I. Moldovan and J. A. B. Fortes, “Partitioning and mapping algorithms into fixed size 
systolic arrays,” IEEE Transactions on Computers, vol. C-35, no. 1, pp. 1-12, 1986.
[48] R. Cytron, “Doacross: Beyond vectorization for multiprocessors,” in Proceedings o f the 
15th International Conference on Parallel Processing, St. Charles, IL, Aug. 1986, pp. 
836-814.
[49] D. Padua and M. Wolfe, “Advanced compiler optimizations for supercomputers,” Com­
munications o f the ACM, vol. 29, no. 12, pp. 1184-1201, Dec. 1986.
[50] P. S. Tseng, “A parallelizing compiler for distributed-memory parallel computers,” Ph.D. 
dissertation, Electrical and Computer Engineering, Carnegie Mellon University, Pitts­
burgh, PA, May 1989, CMU-CS-89-148.
[51] P S .  Tseng, “Compiling programs for a linear systolic array,” in Proceedings o f the 
ACM SIGPLAN ’90 Conference on Programming Language Design and Implementation, 
White Plains, NY, June 1990, pp. 311-321.
[52] H. Zima, H. Bast, and M. Gemdt, “SUPERB: A tool for semi-automatic MIMD/SIMD 
parallelization,” Parallel Computing, vol. 6, pp. 1-18, 1988.
[53] M. Mace, Memory Storage Patterns in Parallel Processing. Boston, MA: Kluwer Aca­
demic Publishers, 1987.
[54] K. Knobe, J. Lukas, and G. Steele, Jr., “Data optimization: Allocation of arrays to reduce 
communication on SIMD machines,” Journal o f Parallel and Distributed Computing, 
vol. 8, no. 2, pp. 102-118, Feb. 1990.
[55] J. Ramanujam and P. Sadayappan, “Compile-time techniques for data distribution in dis­
tributed memory machines,” IEEE Transactions on Parallel and Distributed Systems, 
vol. 2, no. 4, pp. 472-481, Oct. 1991.
[56] S. Wholey, “Automatic data mapping for distributed-memory parallel computers,” in 
Proceedings of the Sixth ACM International Conference on Supercomputing, Washington 
D.C., July 1992, pp. 25-34.
[57] M. Gupta and P. Banerjee, “PARADIGM: A compiler for automated data partitioning 
on multicomputers,” in Proceedings o f the 7th ACM International Conference on Super­
computing, Tokyo, Japan, July 1993.
[58] S. Chatterjee, J. R. Gilbert, R. Schreiber, and S. H. Teng, “Automatic array alignment 
in data-parallel programs,” in Proceedings o f the 20th ACM SIGPLAN Symposium on 
Principles o f Programming Languages, Charleston, SC, Jan. 1993, pp. 16-28.
[59] J. M. Anderson and M. S. Lam, “Global optimizations for parallelism and locality on 
scalable parallel machines,” in Proceedings o f the ACM SIGPLAN ’93 Conference on 
Programming Language Design and Implementation, Albuquerque, NM, June 1993, pp. 
112-125.
126
[60] D. Bau, I. Koduklula, V. Kotlyar, K. Pingali, and P. Stodghill, “Solving alignment using 
elementary linear algebra,” in Proceedings o f the 7th Workshop on Languages and Com­
pilers for Parallel Computing, Ithica, NY, 1994, vol. 892 of Lecture Notes in Computer 
Science, pp. 46-60, Springer-Verlag, 1995.
[61] D. E. Hudak and S. G. Abraham, “Compiler techniques for data partitioning of sequen­
tially iterated parallel loops,” in Proceedings of the Fourth ACM International Confer­
ence on Supercomputing, Amsterdam, The Netherlands, June 1990, pp. 187-200.
[62] D. E. Hudak and S. G. Abraham, Compiling Parallel Loops for High Performance Com­
puters -  Partitioning, Data Assignment and Remapping. Boston, MA: Kluwer Academic 
Publishers, 1993.
[63] B. Chapman, T. Fahringer, and H. Zima, “Automatic support for data distribution on 
distributed memory multiprocessor systems,” in Proceedings of the Sixth Workshop on 
Languages and Compilers for Parallel Computing, Portland, OR, Aug. 1993, vol. 768 of 
Lecture Notes in Computer Science, pp. 184-199, Springer-Verlag, 1994.
[64] T. Fahringer, “Automatic performance prediction for parallel programs on massively par­
allel computers,” Ph.D. dissertation, University of Vienna, Austria, Sept. 1993, TR93-3.
[65] B. Chapman, P. Mehrotra, and H. Zima, “Programming in Vienna Fortran,” Scientific 
Programming, vol. 1, no. 1, pp. 31-50, Aug. 1992.
[66] R. Bixby, K. Kennedy, and U. Kremer, “Automatic data layout using 0-1 integer pro­
gramming,” in Proceedings o f the 1994 International Conference on Parallel Architec­
tures and Compilation Techniques, Montreal, Canada, Aug. 1994, pp. 111-122.
[67] U. Kremer, “Automatic data layout for High Performance Fortran,” Ph.D. dissertation, 
Rice University, Houston, TX, Oct. 1995, CRPC-TR95559-S.
[68] J. Garcia, E. Ayguade, and J. Labarta, “A novel approach towards automatic data dis­
tribution,” in Proceedings of the Workshop on Automatic Data Layout and Performance 
Prediction, Houston, TX, Apr. 1995.
[69] U. Kremer, J. Mellor-Crummey, K. Kennedy, and A. Carle, “Automatic data layout for 
distributed-memory machines in the d programming environment,” in Automatic Paral­
lelization -  New Approaches to Code Generation, Data Distribution, and Performance 
Prediction, 1993, pp. 136-152.
[70] T. J. Sheffler, R. Schreiber, J. R. Gilbert, and W. Pugh, “Efficient distribution analysis via 
graph contraction,” in Proceedings o f the 8th Workshop on Languages and Compilers for  
Parallel Computing, Columbus, OH, Aug. 1995, vol. 1033 of Lecture Notes in Computer 
Science, pp. 377-391, Springer-Verlag, 1996.
[71] T. J. Sheffler, J. R. Gilbert, R. Schreiber, and S. Chatteijee, “Aligning parallel arrays 
to reduce communication,” in Frontiers ’95: The Fifth Symposium on the Frontiers of 
Massively Parallel Computation, McLean, VA, 1995, pp. 324-331.
127
[72] S. Chatterjee, J. R. Gilbert, R. Schreiber, and T. J. Sheffler, “Array distribution in data- 
parallel programs,” in Proceedings o f the 7th Workshop on Languages and Compilers for  
Parallel Computing, Ithica, NY, 1994, vol. 892 of Lecture Notes in Computer Science, pp. 
76-91, Springer-Verlag, 1995.
[73] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools. 
Reading, MA: Addison-Wesley Publishing, 1986.
[74] M. Burke and R. Cytron, “Interprocedural dependence analysis and parallelization,” in 
Proceedings o f the ACM SIGPLAN Symposium on Compiler Construction, Palo Alto, 
CA, June 1986, pp. 162-175.
[75] M. W. Hall, B. R. Murphy, and S. P. Amarasinghe, “Interprocedural analysis for paral­
lelization,” in Proceedings of the 8th Workshop on Languages and Compilers for Parallel 
Computing, Columbus, OH, Aug. 1995, vol. 1033 of Lecture Notes in Computer Science, 
pp. 61-80, Springer-Verlag, 1996.
[76] P. Havlak and K. Kennedy, “Experience with interprocedural analysis of array side ef­
fects,” in Proceedings o f Supercomputing ’90, New York, NY, Nov. 1990, pp. 952-961.
[77] R. Triolet, F. Irigion, and P. Feautrier, “Direct parallelization of call statements,” Pro­
ceedings o f the ACM SIGPLAN Symposium on Compiler Construction, vol. 21, no. 7, pp. 
176-185,July 1986.
[78] L. Choi and P.-C. Yew, “Interprocedural array data-flow analysis for cache coherence,” in 
Proceedings of the 8th Workshop on Languages and Compilers for Parallel Computing, 
Columbus, OH, Aug. 1995, vol. 1033 of Lecture Notes in Computer Science, pp. 81-95, 
Springer-Verlag, 1996.
[79] M. W. Hall, S. Hiranandani, K. Kennedy, and C. Tseng, “Interprocedural compilation of 
Fortran D for MIMD distributed-memory machines,” in Proceedings o f Supercomputing 
’92, Minneapolis, MN, Nov. 1992, pp. 522-534.
[80] C. W. Tseng, “An optimizing Fortran D compiler for MIMD distributed-memory ma­
chines,” Ph.D. dissertation, Rice University, Houston, TX, Jan. 1993, COMP TR93-199.
[81] F. Coelho and C. Ancourt, “Optimal compilation of HPF remappings (extended ab­
stract),” Centre de Recherche en Informatique, Ecole des mines de Paris, Fontainebleau, 
France, Tech. Rep. CRIA-277, Nov. 1995.
[82] E. T. Kalns and L. M. Ni, “Processor mapping techniques toward efficient data redistribu­
tion,” in Proceedings o f the 8th International Parallel Processing Symposium, Cancun, 
Mexico, Apr. 1994, pp. 469-476.
[83] S. D. Kaushik, C.-H. Huang, R. W. Johnson, and P. Sadayappan, “An approach to 
communication-efficient data redistribution,” in Proceedings of the 8th ACM Interna­
tional Conference on Supercomputing, Manchester, U.K., July 1994, pp. 364-373.
128
[84] S. Ramaswamy and P. Banerjee, “Automatic generation of efficient array redistribution 
routines for distributed memory multicomputers,” in Frontiers ’95: The Fifth Symposium 
on the Frontiers o f Massively Parallel Computation, McLean, VA, Feb. 1995, pp. 342- 
349.
[85] R. Thakur, A. Choudhary, and G. Fox, “Redistribution of arrays in HPF,” in Proceedings 
of the 1994 Scalable High Performance Computing Conference, Knoxville, TN, May
1994, pp. 309-318.
[86] A. Wakatani and M. Wolfe, “A new approach to array redistribution: Strip mining redis­
tribution,” in PARLE ’94, July 1994.
[87] A. Wakatani and M. Wolfe, “Optimization of array redistribution for distributed memory 
multicomputers,” Parallel Computing, 1995.
[88] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, “Active messages: A 
mechanism for integrated communication and computation,” in Proceedings of the 19th 
International Symposium on Computer Architecture, Queensland, Australia, May 1992, 
pp. 256-266.
[89] E. Su, D. J. Palermo, and P. Banerjee, “Automating parallelization of regular computa­
tions for distributed memory multicomputers in the PARADIGM compiler,” in Proceed­
ings o f the 22nd International Conference on Parallel Processing, St. Charles, EL, Aug. 
1993, pp. 11:30-38.
[90] T. Fahringer, R. Blasko, and H. P. Zima, “Automatic performance prediction to support 
parallelization of Fortran programs for massively parallel systems,” in Proceedings o f the 
Sixth ACM International Conference on Supercomputing, Washington D.C., July 1992, 
p p .347-356.
[91] V. Sarkar, “Determining average program execution times and their variance,” in Pro­
ceedings o f the ACM SIGPLAN ’89 Conference on Programming Language Design and 
Implementation, Portland, OR, June 1989, pp. 298-312.
[92] J. Li and M. Chen, “Index domain alignment: Minimizing cost of cross-referencing be­
tween distributed arrays,” in Frontiers ’90: The Third Symposium on the Frontiers of 
Massively Parallel Computation, College Park, MD, Oct. 1990, pp. 424-433.
[93] M. Gupta and P. Banerjee, “Compile-time estimation of communication costs on multi­
computers,” in Proceedings o f the Sixth International Parallel Processing Symposium, 
Beverly Hills, CA, Mar. 1992, pp. 470^175.
[94] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction to Parallel Computing: De­
sign and Analysis o f Algorithms. Redwood City, CA: Benjamin/Cummings Publishing,
1995.
129
[95] G. Golub and J. M. Ortega, Scientific Computing: An Introduction with Parallel Com­
puting. San Diego, CA: Academic Press, 1993.
[96] R. E. Tarjan, Data Structures and Network Algorithms. Philadelphia, PA: Society for 
Industrial and Applied Mathematics, 1983.
[97] J. A. Fisher, “Trace scheduling: A technique for global microcode compaction,” IEEE 
Transactions on Computers, vol. c-30, pp. 478-490, July 1981.
[98] W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Waiter, R. A. Bringmann, R. G. 
Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery, “The 
Superblock: An effective technique for VLIW and superscalar compilation,” The Journal 
of Supercomputing, vol. 7, no. 1, pp. 229-248, Jan. 1993.
[99] E. Ayguade, J. Garcia, M. Girones, M. L. Grande, and J. Labarta, “Data redistribution in 
an automatic data distribution tool,” in Proceedings of the 8th Workshop on Languages 
and Compilers for Parallel Computing, Columbus, OH, Aug. 1995, vol. 1033 of Lecture 
Notes in Computer Science, pp. 407-421, Springer-Verlag, 1996.
[100] M. Girkar and C. D. Polychronopoulos, “Automatic extraction of functional parallelism 
from ordinary programs,” IEEE Transactions on Parallel and Distributed Systems, vol. 
3, no. 2, pp. 166-178, Mar. 1992.
[101] D. J. Palermo, E. W. Hodges IV, and P. Banerjee, “Dynamic data partitioning for 
distributed-memory multicomputers,” Journal o f Parallel and Distributed Computing, 
1996, to appear in a special issue on Compilation Techniques for Distributed Memory 
Systems.
[102] D. J. Palermo, E. W. Hodges IV, and P. Banerjee, “Interprocedural array redistribution 
data-flow analysis,” in Proceedings o f the 9th Workshop on Languages and Compilers 
for Parallel Computing, Santa Clara, CA, Aug. 1996, to appear.
[103] A. Holub, Compiler Design in C. Englewood Cliffs, NJ: Prentice Hall, 1990.
[104] R. von Hanxleden and K. Kennedy, “G iv e-N-Take -  A balanced code placement 
framework,” in Proceedings o f the ACM SIGPLAN ’94 Conference on Programming 
Language Design and Implementation, Orlando, FL, June 1994, pp. 107-120.
[105] R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck, “Efficiently com­
puting static single assignment form and the control dependence graph,” ACM Transac­
tions on Programming Languages and Systems, vol. 13, no. 4, pp. 451^490, Oct. 1991.
[106] F. McMahon, “The Livermore Fortran kernels: A computer test of the numerical per­
formance range,” Lawrence Livermore National Laboratory, Livermore, CA, Tech. Rep. 
UCRL-53745, Dec. 1986.
130
[107] Perfect Club, “The Perfect Club benchmarks: Effective performance evaluation of su­
percomputers,” The International Journal o f Supercomputing Applications, vol. 3, no. 3, 
pp. 5-40, Fall 1989.
[108] D. J. Palermo, E. Su, E. W. Hodges IV, and P. Banerjee, “Compiler support for privati­
zation on distributed-memory machines,” in Proceedings of the 25 th International Con­
ference on Parallel Processing, Bloomingdale, IL, Aug. 1996, to appear.
[109] M. T. Heath and J. A. Etheridge, “Visualizing the performance of parallel programs,” 
IEEE Software, vol. 8, no. 5, pp. 29-39, Sept. 1991.
[110] D. Bailey, J. Barton, T. Lasinski, and H. Simon, “The NAS parallel benchmarks,” NASA 
Ames Research Center, Moffett Field, CA, Tech. Rep. RNR-91-002, 1991.
[111] R. Sadoumy, “The dynamics of finite-difference models of the shallow-water equations,” 
Journal o f the Atmospheric Sciences, vol. 32, no. 4, Apr. 1975.
[112] K. Kennedy and U. Kremer, “Automatic data layout for High Performance Fortran,” in 
Proceedings o f Supercomputing ’95, San Diego, CA, Dec. 1995.
[113] Applied Parallel Research, Inc., Placerville, CA, XHPF User’s Guide, version 2.0, 1995.
[114] M. Gupta, S. Midkiff, E. Schonberg, V. Seshadri, D. Shields, K.-Y. Wang, W.-M. Ching, 
and T. Ngo, “An HPF compiler for the IBM SP2,” in Proceedings of Supercomputing 
’95, San Diego, CA, Dec. 1995.
[115] D. B. Loveman, “The DEC High Performance Fortran 90 compiler front end,” in Fron­
tiers ’95: The Fifth Symposium on the Frontiers o f Massively Parallel Computation, 
McLean, VA, 1995, pp. 46-53.
[116] The Portland Group, Inc., Wilsonville, OR, PGHPF User’s Guide, 1995.
[117] S. P. Amarasinghe, J. M. Anderson, M. S. Lam, and A. W. Lim, “An overview of a com­
piler for scalable parallel machines,” in Proceedings o f the Sixth Workshop on Languages 
and Compilers for Parallel Computing, Portland, OR, Aug. 1993, vol. 768 of Feature 
Notes in Computer Science, pp. 253-272, Springer-Verlag, 1994.
[118] Z. Bozkus, “Compiling Fortran 90D/HPF for distributed memory MIMD computers,” 
Ph.D. dissertation, Syracuse University, Syracuse, NY, June 1995, SCCS-730.
[119] C. D. Polychronopoulos, M. Girkar, M. R. Haghighat, C. L. Lee, B. Leung, and 
D. Schouten, “Parafrase-2 manual,” Center for Supercomputing Research and Devel­
opment, University of Illinois, Urbana, IL, Tech. Rep., Aug. 1990.
[120] A. Lain, “Compiler and run-time support for irregular computations,” Ph.D. dissertation, 
Department of Computer Science, University of Illinois, Urbana, IL, Oct. 1995, CRHC- 
92-22/UILU-ENG-95-2236.
131
[121] S. Ramaswamy, “Simultaneous exploitation of task and data parallelism in regular scien­
tific applications,” Ph.D. dissertation, Department of Electrical and Computer Engineer­
ing, University of Illinois, Urbana, EL, Jan. 1996, CRHC-96-03/UILU-ENG-96-2203.
[122] M. Girkar and C. Polychronopoulos, “Partitioning programs for parallel execution,” 
in Proceedings of the Second ACM International Conference on Supercomputing, Saint 
Malo, France, 1988, pp. 216-229.
[123] V. S. Adve, J. Mellor-Crummey, M. Anderson, K. Kennedy, J.-C. Wang, and D. A. Reed, 
“An integrated compilation and performance analysis environment for data parallel pro­
grams,” in Proceedings of Supercomputing ’95, San Diego, CA, Dec. 1995.
132
VITA
Daniel Joseph Palermo received the B.S. degree in Computer and Electrical Engineering 
from Purdue University, West Lafayette, Indiana, in May of 1990 and the M.S. degree in Com­
puter Engineering from the University of Southern California, Los Angeles, California, in De­
cember of 1991. He will receive the Ph.D. degree in Electrical Engineering from the University 
of Illinois at Urbana-Champaign in 1996.
During the summer of 1988, Palermo worked as a technical aide in the area of distributed 
information systems at The Mitre Corporation in Bedford, MA. During the summer of 1989, 
he joined Texas Instruments in Lewisville, TX, to work on an expert systems project for pre­
briefing operational flight software, and during the summers of 1990 and 1991 he worked as a 
software engineer at TI in the area of image processing. From January 1992 to the present, he 
has worked as a research assistant in the Center for Reliable and High-Performance Computing 
at the University of Illinois.
After completing his doctoral dissertation, Palermo will join the Hewlett-Packard Convex 
Technology Center in Dallas, TX. His research interests include high-performance computing 
systems, parallelizing and optimizing compilers, and parallel languages and application pro­
gramming.
133
