Compilation techniques for multicomputers by Michael, Gavin Constantine
C o m p ila tio n  T echniques for 
M u ltic o m p u te rs
G avin  C o n stan tin e  M ichael
A thesis submitted for the degree of 
Doctor of Philosophy of 
The Australian National University
January 1996
Except where otherwise indicated, this thesis is my own original work.
Gavin Michael 
24 January 1996
A b s tra c t
This thesis considers problems in process and data partitioning when compiling 
programs for distributed-memory parallel computers (or multicomputers). These 
partitions may be specified by the user through the use of language constructs, 
or automatically determined by the compiler.
Data and process partitioning techniques are developed for two models of 
compilation. The first compilation model focusses on the loop nests present in a 
serial program. Executing the iterations of these loop nests in parallel accounts for 
a significant amount of the parallelism which can be exploited in these programs. 
The parallelism is exploited by applying a set of transformations to the loop 
nests. The iterations of the transformed loop nests are in a form which can be 
readily distributed amongst the processors of a multicomputer. The manner in 
which the arrays, referenced within these loop nests, are partitioned between the 
processors is determined by the distribution of the loop iterations. The second 
compilation model is based on the data parallel paradigm, in which operations 
are applied to many different data items collectively. High Performance Fortran 
is used as an example of this paradigm.
Novel collective communication routines are developed, and are applied to 
provide the communication associated with the data partitions for both com­
pilation models. Furthermore, it is shown that by using these routines the 
communication associated with partitioning data on a multicomputer is greatly 
simplified. These routines are developed as part of this thesis.
The experimental context for this thesis is the development of a compiler for 
the Fujitsu AP1000 multicomputer. A prototype compiler is presented. Experi­
mental results for a variety of applications are included.
A cknow ledgem ents
I should like to express my indebtedness and great appreciation to the following 
people who have helped during the execution of the work reported in this thesis...
T o Richard Brent, Brian Molinari and Robin Stanton for their encouragement 
and counsel.
T o  Chris Johnson, Brendan McKay, John Zigman for their fruitful discussions 
and advice.
T o Peter Bailey, Trevor Vickers and Emma Yarrow for their comments on 
various drafts.
T o Sally Begg and Kath Read for administrative assistance.
T o David Hawking, David Walsh and the programming staff for technical as­
sistance.
T o James Popple and Richard Walker for T^Xmcal assistance.
C on ten ts
A bstract iii
A cknow ledgem ents iv
C ontents v
Figures viii
Glossary of sym bols x
Chapter 1: Introduction 2
1.1 Compilation for multicomputers....................................................................  2
1.2 Research program m e......................................................................................  3
1.3 Thesis structure................................................................................................ 4
Chapter 2: Com pilers for m ulticom puters 6
2.1 Parallel architectures and exploiting parallelism ........................................  6
2.1.1 Parallel architectures..........................................................................  7
2.1.2 Moving information a ro u n d ..............................................................  9
2.1.3 Exploiting parallelism .......................................................................  9
2.2 Restructuring com pilers................................................................................  10
2.2.1 Loop transformation and restructuring...........................................  11
2.2.2 General program transformations..................................................... 12
2.2.3 Restructuring programs for m ulticom puters..................................  13
2.2.4 Case studies.......................................................................................... 14
2.3 Multicomputer compilers .............................................................................  15
2.3.1 Transformation driven com pilers..................................................... 16
2.3.2 Language driven compilers ..............................................................  17
2.3.3 Analysis driven com pilers.................................................................  19
v
Contents VI
C h ap ter 3: C om m unication  routines 22
3.1 Fujitsu AP1000 architecture.........................................................................  22
3.1.1 System configuration.........................................................................  22
3.1.2 Cell a rch itec tu re ................................................................................ 24
3.2 Stride message transfer................................................................................... 25
3.2.1 Addressing elements of a two-dimensional a rra y ............................ 25
3.2.2 Stride message transfer operation....................................................  27
3.3 Collective communication ro u tin e s .............................................................  30
3.3.1 Generalised addressing of array elements........................................  32
3.3.2 Distribute collective communication ro u tin e ..................................  35
3.3.3 Concatenate collective communication routine...............................  40
C h ap te r 4: Im plicit d a ta  p a rtition ing  47
4.1 Loop s t r u c tu r e ...............................................................................................  47
4.1.1 The iteration space............................................................................. 48
4.1.2 Array access function ......................................................................  50
4.2 Tiling loop iteration sp ace ............................................................................. 51
4.2.1 Normalising the loop n e s t ................................................................  51
4.2.2 Deriving the tile sp ace ......................................................................  52
4.2.3 Allocating tiles to p rocesso rs..........................................................  55
4.2.4 Allocating iterations to p rocesso rs.................................................  56
4.3 Implicit data distribution and retrieval.......................................................  57
4.3.1 Determining the array read s e t .......................................................  60
4.3.2 Deriving the array receive partition.................................................  63
4.3.3 Determining the array write s e t .......................................................  66
4.3.4 Deriving the array send p a rtitio n ....................................................  68
C h ap ter 5: Explicit d a ta  d istribu tion : the  HPF directives 71
5.1 Data distribution m o d e l................................................................................ 72
5.2 The align d irec tiv e ......................................................................................... 73
5.2.1 Exact match ...................................................................................... 73
5.2.2 Intra-dimension alignm ent................................................................  74
5.2.3 Inter-dimension alignm ent................................................................  75
5.3 The processors d irec tive ................................................................................ 75
5.4 The distribute d ire c tiv e ................................................................................ 76
5.4.1 Cyclic distribution fo rm a t................................................................  77
5.4.2 Block distribution f o r m a t ................................................................  78
5.4.3 * distribution fo rm at.........................................................................  80
5.5 Explicit data d is tr ib u tio n ............................................................................. 80
5.5.1 Allocating array data to abstract processors..................................  81
5.5.2 Allocating abstract processors to the target multicomputer . . .  84
5.5.3 Allocating array data to physical processors..................................  88
5.5.4 Allocating aligned array data to physical p ro cesso rs ................... 92
Contents vii
C h ap ter 6: E xperim enta l resu lts  96
6.1 Structure of the experimental com piler........................................................ 97
6.2 Performance study method ..........................................................................  99
6.2.1 Performance measurement.................................................................  99
6.2.2 Analysis and comparison ....................................................................100
6.3 Automatic data partitioning results.................................................................100
6.3.1 Matrix m u ltip lica tion .......................................................................... 100
6.3.2 Computation of the Mandelbrot f r a c t a l ............................................104
6.4 NCAR Community Climate M o d e l.................................................................108
6.4.1 Brief overview of the m o d e l.................................................................109
6.4.2 Parallel implementation of the model..................................................109
6.4.3 Analysis of the parallel m odel.............................................................. I l l
C h ap ter 7: Conclusions 115
7.1 Future research ...................................................................................................115
7.2 The contributions of this th e s is ....................................................................... 116
A ppendix  A: M em ory space allocation function 120
A ppendix  B: C om m unication operations specification 123
B.l Distribute communication routine................................................................... 123
B. 2 Concatenate communication ro u t in e ............................................................. 125
A ppendix  C: Pseudocode descrip tion  127
C. l Declaration of variables ...................................................................................127
C.1.1 In te g e rs ..................................................................................................127
C.1.2 Single dimension a r r a y s ...................................................................... 128
C .l.3 S tructures...............................................................................................128
C .l.4 Buffers..................................................................................................... 128
C.2 L o o p s ..................................................................................................................129
C.3 Conditional Statements ...................................................................................130
C. 4 Declaring ro u t in e s ............................................................................................ 130
A ppendix  D: Syntax of th e  H PF directives 132
D. l The distribute d ire c tiv e ...................................................................................132
D.2 The align d irec tive ............................................................................................ 134
D. 3 The processors d irec tiv e ...................................................................................135
A ppendix  E: A com piler generated  program  136
E. l Host p rogram ............................................  136
E.2 Host lib ra ry ........................................................................................................ 136
E.3 Cell p ro g ra m ..................................................................................................... 137
E.4 Cell l ib r a r y ........................................................................................................ 137
Bibliography 142
F ig u res
2.1 Parallel computer architectures.......................................................................... 7
3.1 Fujitsu AP1000 system design............................................................................. 23
3.2 Fujitsu A PI000 cell architecture ....................................................................... 24
3.3 Layouts for a two-dimensional a r r a y ................................................................  26
3.4 Access pattern for the stride message transfer o p e ra tio n ............................... 28
3.5 Stride message transfer operation....................................................................... 29
3.6 Function sdma .....................................................................................................  30
3.7 Data movements associated with collective communication routines............  31
3.8 Dope vector of an rc-dimensional a r ra y .............................................................  33
3.9 Stride vector of a m-dimensional memory sp ace ..............................................  35
3.10 Declaration of the distribute communication ro u tin e ..................................... 36
3.11 Examples of the distribute collective communication routine......................... 37
3.12 Function d i s t ........................................................................................................  39
3.13 Declaration of the concatenate communication prim itive............................... 41
3.14 Examples of concatenate collective communication routine............................ 42
3.15 Function d co n ca t..................................................................................................  45
4.1 Perfectly nested loop n e s t ...................................................................................  48
4.2 Example perfectly nested loop n e s t ...................................................................  49
4.3 Normalised loop n e s t ............................................................................................  52
4.4 Tiled loop n e s t .....................................................................................................  53
4.5 Example tiled loop n e s t ......................................................................................  54
4.6 Partition of a tiled loop nest executed by a processor.....................................  55
4.7 Example partition of a tiled loop nest executed by a processor ................... 56
4.8 Allocation of loop iterations to processors.......................................................  57
4.9 Array reference enclosed in a tiled loop nest......................................................  58
4.10 Tiled and partitioned iteration and data sp aces ..............................................  59
4.11 Array reference enclosed in a partition of a tiled loop n e s t ............................ 62
viii
Figures ix
4.12 Partitioning array data ......................................................................................  66
4.13 Retrieving modified array d a t a ..........................................................................  70
5.1 HPF Data Distribution M odel.............................................................................  72
5.2 Example CYCLIC array d istribu tion ....................................................................  78
5 .3  Example BLOCK array d i s t r i b u t i o n .......................................................................  79
5.4 HPF code fragment involving distribute directives............................................  82
5.5 Block index set resulting from a HPF distribution...........................................  83
5.6 Allocation of abstract processors to physical processors..................................  86
5.7 Example allocation of abstract processors onto physical p rocesso rs ............. 87
5.8 Allocation of array data to physical processors ..............................................  88
5.9 HPF code fragment involving the align and distribute directives...................  92
5.10 Transformation of the source array data space.................................................. 95
6.1 Architecture of the experimental com piler........................................................ 98
6.2 Execution times for matrix m ultiplication...........................................................102
6.3 Performance of matrix multiplication .................................................................103
6.4 Execution times for Mandelbrot fractal.................................................................106
6.5 Performance of Mandelbrot fractal p rog ram ........................................................107
6.6 Decomposition of the CCM1 d o m a in .................................................................... 110
6.7 Execution times for the NCAR CCM1 .................................................................112
6.8 Performance of the NCAR C CM 1.......................................................................... 113
G lossary  of sym bols
G e n e r a l  S p a c e s  
R  
R n 
Z
Z n
T he set o f all real num bers 
T he set o f all n -dim ensional real vectors 
T he set o f all n a tu ra l num bers (including 0) 
T he set o f all n -dim ensional in teger vectors
L o g ic a l r e l a t io n s
a => b 
a V b 
a A b 
3 a  
V a
a im plies 6; b if a; a only if b 
a or b 
a and  b
T here exists an a 
For all a
S e ts
{ a , 6 , c }
0
a £ B  
A H B  
A C B
{ a e X  | P( a ) }
A set conta in ing  a, b and  c 
T he empty  s e t , {} 
a is an  elem ent of B  
A  in tersection  B  
A is a  sub se t of B
T he subset of the  set X  satisfying the  p ro p erty  P
F u n c t io n s
T \ X  -> Y  
T ~ l
one-to-one
A function  T  assigning a value in  Y  to  every elem ent in 
T he inverse of T \  T ~ l : Y  —> X
T  is one to one if and  only if T ( a )  is different for every 
value of a; T ( x )  = T ( y )  => x = y.
T o  g T  com posed w ith  Q\ T  o Q{x) = T{Q{x) )
G eneral operators
n
E
V ector operators 
a 
0 
1
a < b 
a /  6 
a  * b 
a — b 
a  4" b 
a ■ b 
a  div b 
a mod b 
a
_b_
a '
b
Glossary of symbols xi
Product
Sum
The n-dimensional vector (ai ,  <22,..., an ) 
The n-dimensional vector (0 ,0 , . . . ,  0) 
The n-dimensional vector (1 ,1 , . . . ,  1) 
a i <  &i A 02 <  bo A . . .  A an <  bn 
A 0-2 ~f~ ^2 A . . .  A CLji bn 
(a \ 6i, 02^2, . . . ,  an bn )
(ai — &i, Ü2 — 62,. • •, an — &„)
(ai 4- 61, 02 4- 2^5 • • • 5 an 4- bn)
(di b\ 4- a262 + . . .  4- 0Ln bn)
(ai div &i, a2 div • • •, dn div &„)
(ai mod 61, a.2 mod 62, . . . ,  an mod bn )
C hapters
1In tro d u c tio n
1.1 C om pilation  for m u lticom pu ters
Multiprocessor systems are classified into two categories, namely shared-memory 
parallel computers (or multiprocessors) and distributed-memory parallel com­
puters (multicomputers). Multiprocessors provide all processors with uniform 
access to a common global memory. The Alliant FX/2800, the Sequent Balance 
[122], the Cedar [76] and the IBM RP3 [102] are examples of multiprocessors. In 
a multicomputer the memory is distributed among the processors, each of which 
has access only to its local component. For example, the Connection Machine 
CM-5 [124], the IBM SPl, the IBM SP2 and the Fujitsu AP1000 [64] are all 
multicomputers.
In the context of building massively parallel systems, multicomputers offer 
an advantage in terms of cost and scalability. When the system size grows the 
system does not suffer from the memory contention as in large multiprocessors be­
cause the physical memory modules are local to the processors. However, writing 
efficient programs for multicomputers is a great challenge for the programmer.
The following issues arise in programming multicomputers:
• Process partitioning: parallelism must be made explicit in the program. 
This involves breaking a computation into a collection of tasks which are 
assigned to different processors and executed in parallel.
• Data partitioning: since there is no global memory on a multicomputer, 
the data structures (such as arrays) in a program must be partitioned and 
distributed to the processors.
• Communication and coordination: the distributed tasks in a parallel com­
putation generally need to share data or synchronise with each other.
2
§ 1.2 Research programme 3
The naive approach to programming multicomputers involves programming 
each processor with a conventional language, such as C or Fortran, using explicit 
communication commands in the program for sending and receiving messages. 
There are at least two difficulties associated with this approach: (1) machine ef­
ficiency may be very low if the computation is dominated by communication and 
synchronisation: and (2) the user is forced to manually partition processes and 
data, explicitly manage communication among processes and be aware of all data 
movements in the program. The advantage of this approach is that the program­
ming model matches the machine architecture, requiring the implementation of 
fewr new compilation techniques. Almost all commercial multicomputers support 
this approach. For example, C and Fortran with message passing primitives are 
available on the Fujitsu AP1000 and the Connection Machine CM-5.
One approach to overcoming the difficulties associated with programming a 
multicomputer is to emulate a shared-memory facility, known as virtual shared 
memory [2, 16, 24, 28, 32, 67, 82, 87]. However the overhead of supporting such 
a virtual shared memory, especially when the multicomputer consists of a large 
number of processors, may greatly offset the benefits of this solution.
Another approach to programming multicomputers employs a compiler to do 
much of the work. The mechanisms by which parallelism is achieved are either 
explicit or implicit. Implicit parallelism relies on the compiler to automatically 
uncover and exploit parallelism through comprehensive analysis of the structures 
inherent in serial computations. Explicit parallelism requires the user to invoke 
the parallelism in a computation through the use of some language construct.
The compiler generates an appropriate parallel program for the target mul­
ticomputer. For most compilers this parallel program corresponds to the single 
program multiple data (SPMD) model [66], where all processors execute the same 
program but operate on distinct data items. The user is relieved of the bur­
den of generating explicit message passing primitives. Research efforts include 
the Fortran D compiler [57, 58], Superb [131], Parafrase-2 [105, 106] and High 
Performance Fortran (HPF) [53]. Commercially available compilers for multicom­
puters which use this approach include MIMDizer [117] and Aspar [62].
The compilation approach to programming multicomputers can be viewed as 
a means of restructuring existing sequential programs into a form more amenable 
to multicomputer architectures, as well as encouraging the development of new 
programs. However, questions related to compilation techniques for multicom­
puters are still largely unanswered. The many challenges and promises of the 
compilation approach are the motivations for this thesis.
1.2 Research programme
This thesis considers problems in process and data partitioning when compiling 
programs for multicomputers. These may be specified by the user, through the 
use of compiler directives, or automatically determined by the compiler. The
§1.3 Thesis structure 4
experimental context for this thesis is the development of a compiler for the 
Fujitsu AP1000 multicomputer.
To this end two models of compilation for multicomputers are examined. The 
first compilation model focusses on the loop nests present in a serial program. 
Executing the iterations of these loop nests in parallel accounts for a significant 
amount of the parallelism which can be exploited in these programs. The par­
allelism is exploited by applying a set of transformations to the loop nests. The 
iterations of the transformed loop nests are in a form which can be readily dis­
tributed amongst the processors of a multicomputer. The manner in which the 
arrays, referenced within these loop nests, are partitioned between the processors 
is determined by the distribution of the loop iterations. The second compilation 
model is based on the data parallel paradigm, in which operations are applied to 
many different data items collectively. High Performance Fortran is used as an 
example of this paradigm. HPF provides the user with explicit control over the 
data partitioning utilising data alignment and distribution compiler directives.
Collective communication routines are developed, and are applied to provide 
the communication associated with the data partitions for both compilation mod­
els.
1.3 T h esis  structure
The body of this thesis is divided into seven chapters.
• Chapter 2 discusses previous work of relevance to the development of tech­
niques for programming multicomputers. The discussion focusses on par­
allel language models, restructuring compilers and compilers developed 
specifically for multicomputers. Several major research projects are de­
scribed.
• In chapter 3, the collective communication routines distribute and concat­
enate are introduced. These form the basis of the data partitioning methods 
outlined in the subsequent chapters of this thesis. A new message handling 
mode for the Fujitsu AP1000, called stride message transfer This message 
handling mode subsequently supports the implementation of the collective 
communication routines on the Fujitsu AP1000. An overview of the ar­
chitecture of the Fujitsu AP1000 together with the implementation of the 
collective communication routines on this multicomputer is incorporated in 
this discussion.
• The implicit data partitioning of array data is presented in chapter 4. The 
primary task of this data partitioning method is to identify the elements 
of an array required by each processor in the multicomputer and to utilise 
the collective communication routine distribute to deliver these elements to 
each processor by a single operation. Furthermore, it is also demonstrated
§1.3 Thesis structure 5
that it is possible to retrieve these elements back into a single array by a 
single operation using the collective communication routine concatenate.
• The High Performance Fortran (HPF) data distribution model is introduced 
in chapter 5. The focus of this discussion is the implementation of this 
model using the distribute collective communication routine. A new meth­
odology for analysing the HPF alignment and distribute compiler directives 
is developed. This methodology is then used to implement the HPF data 
distribution model on the Fujitsu AP1000.
• Chapter 6 introduces an experimental compiler for the Fujitsu AP1000 
which implements the process and data partitioning techniques developed 
in chapters 4 and 5. The results of a series of test cases, conducted using 
this compiler, which are designed to demonstrate the effectiveness of these 
techniques are reported.
• In chapter 7, conclusions are drawn about the approach to data partitioning 
taken in this thesis. Some enhancements to the collective communication 
routines are suggested, and avenues of further research are identified. Fi­
nally, the nature of the contribution made by this thesis is discussed.
There are five appendices to this thesis. Proof that the memory space alloc­
ation function, developed in chapter 3, is one-to-one is presented in appendix A. 
Appendix B contains a formal specification in the specification language Z [120], 
of each of the collective communication routines presented in chapter 3. A 
description of the pseudocode used in chapter 3 to describe each collective com­
munication routine is presented in appendix C. The syntax of the HPF compiler 
directives, described in chapter 5, is presented in appendix D. A complete ex­
ample of the output from the experimental compiler for one of the test cases 
presented in chapter 6 is given in appendix E.
C om pile rs  for m u ltico m p u te rs
This chapter examines alternative approaches to the compilation of programs for 
multicomputers. A model which establishes the place of such compilation in the 
wider context of parallel computing is presented in §2.1. Restructuring compilers, 
designed to automatically transform a sequential program to a semantically equi­
valent parallel program for a multicomputer, are described in §2.2. A description 
of compilers developed specifically for multicomputers then follows.
2.1 P aralle l arch itectures and exp lo itin g  
parallelism
For descriptive purposes through this chapter the following three layered model 
has been adopted
Programming model/High level language
Communication model/Low level language
Parallel architecture
Whilst being limited this model is sufficient for the exposition. Compiling pro­
grams for multicomputers can be viewed as mapping the parallelism exploited in 
a program, using some high level language, to its implementation on a parallel 
architecture. The remainder of this section discusses each layer of the model.
6
§2.1 Parallel architectures and exploiting parallelism 7
Memory
Processors
• • •
Processor interconnect network
□
 □
 _
 
I 
|
□  r
□  L □ 
n 
_
1 
i
• • •
□
 □
 _
 
I 
i
(a) Architecture of a shared-memory parallel computer or mutlipro- 
cessor. A single bus is used to connect the processors and memory. 
Processors often have substantial cahces providing closely coupled 
private memory.
Processors
Memory
Processor interconnect network
I
□
□
□
□
□
□ . . .
□
□
. . .
r“nrr----nr. —T-rr-----------------
' Memory interconnect network >
(b) Architecture of a distributed-memory parallel computer or multicom­
puter. Each processor of a multicomputer is coupled with a local memory 
module. In some architectures the memory modules may be connected 
to assist in supporting a virtual shared memory structure.
Figure 2.1: The two categories of parallel computer architectures: shared- 
memory and distributed-memory.
2 .1 .1  P ara lle l arch itectu res
In this discussion parallel architecture refers to the method by which the pro­
cessors and memory are connected to realise the parallel computer. The parallel 
architecture must be taken into account by the software.
The simplest approach to building a parallel machine is to connect several 
processors and memory modules through a single shared bus to create a shared- 
memory parallel computer (or multiprocessor). Figure 2.1(a) illustrates this 
architecture. The processors coordinate and communicate through the common 
memory space (see §2.1.2). The major advantage of this approach is that there is
§2.1 Parallel architectures and exploiting parallelism 8
no need to explicitly move data from processor to processor. The greatest weak­
ness of bus-based multiprocessor designs is that the bandwidth of the bus does 
not scale. As the number of processors increases it becomes more likely that when 
one processor requires the bus, another will already be using it. This contention 
is the major limitation of the practical size of based-based multiprocessors. Given 
current implementation techniques, this limitation leads to a machine of a few 
tens of processors.
One solution to this single-bus bottleneck is to provide multiple communica­
tion paths between the processors and memory. A crossbar switch will connect 
each processor directly with every memory module, so contention is entirely elim­
inated. This is the most efficient interconnection mechanism. However, the cost 
of connecting n processors with m memory modules—which can be expressed 
as a function of the number of cross-points required in the switch—is O(nm), 
compared with 0(n)  for the bus. Hence, the architecture becomes costly to 
implement as the number of processors increases.
Instead of viewing the close connection between each processor and memory as 
merely a means of eliminating memory contention, a distributed-memory parallel 
computer (or multicomputer) uses it as a basic building block. Each processor of 
multicomputer is coupled with a local memory module. This processor/memory 
combination is replicated and interconnected by a network. Figure 2.1 schemat­
ically illustrates a typical multicomputer architecture. Each processor can only 
directly access its own local memory module. The processors typically com­
municate and coordinate by transmitting data between senders and receivers. 
Most frequently this is done by some form of message passing of which there are 
many different variants (see §2.1.2). As the number of processors in a multicom­
puter increases it does not suffer from the memory contention present in large 
multiprocessors because the memory modules are local to the processors. The 
implementation of this architecture leads to machines comprised of thousands of 
processors.
Of course, it is also possible to emulate shared memory on a multicomputer, 
referred to as virtual shared memory. Virtual shared memory masks both the sep­
arate processor address spaces and the communication between processors, and 
may be implemented in hardware or software. Systems such as Alewife [24], 
April [2], DASH [82] and KSR-1 [67] create this virtual shared address space 
through innovations in hardware and operating systems. These include using 
a memory interconnect network as shown in Figure 2.1(b). Alternatively, tech­
niques have been developed to provide the shared address space entirely through 
software. For example, Amber [28], Ivy [87], Munin [16] and Platinum [32] utilise 
the operating system to automatically detect and invoke the necessary commu­
nication between the processors.
§2.1 Parallel architectures and exploiting parallelism 9
2 .1 .2  M o v in g  in form ation  around
The processors of a parallel computer cooperate to solve a common task. A pro­
gram is run as a system of parallel processes, where each process is the execution 
of a sequential instruction stream and has its own private memory space. Dur­
ing execution the problem of communication between processes arise. There are 
wide variety of methods used to support communication between processes. The 
method chosen inherently reflects the underlying parallel architecture.
In a shared-memory architecture processors communicate via shared data 
structures. One process updates or writes a data value which is then read by 
another process. Since several processes may execute simultaneously and require 
access to the shared data, problems of data coherency arise. These are solved by 
serialising access to shared data; often managed through a protection mechanism, 
such as locks, semaphores [34] or critical regions [19], which indicate the data 
is being accessed by a process. An alternate solution is to adopt a coherent 
programming model such as monitors [60] or transactions [88, 95].
Processes may only communicate by transmitting data between different mem­
ories on a distributed-memory machine. In the simplest case, one process sends 
a message and another process receives it. Many extensions exist which ex­
pand the number of senders (scatter-gather routines) or receivers (broadcast or 
multicast routines). Collective communication, one such extension, is defined as 
communication which involves a group of processes, and is developed in chapter 3. 
There are a number of alternate mechanisms used to send or receive data. For 
example, a message may be buffered in the sender or receiver, or the message 
may be sent synchronously or asynchronously. These mechanisms are addressed 
and supported in the recent standardisation of a message passing interface for 
parallel computing (MPI) [94]. Message passing may also be used on top of a 
shared-memory architecture in order to support particular models of interaction, 
such as pipelining or client-server systems.
2 .1 .3  E x p lo itin g  p ara lle lism
The mechanisms by which parallelism is achieved can be either implicit or explicit.
Implicit parallelism is used primarily as a technique to gain speedup without 
user intervention. Typically this is performed at the compilation phase of a se­
quential program. Compilers which exploit implicit parallelism are known as 
restructuring compilers. These compilers are based on the premise that the ex­
ploitable parallelism that scale with the problem size is largely present in loops 
(see §2.2).
In contrast, explicit parallelism requires the user to invoke the parallelism in 
program execution through the use of some language construct. Elegant archi­
tecture independent parallel programming models such as such as Bird-Meerters 
formalism [119] and the Bulk-Synchronous bridging model [127]. However, these 
parallel programming models are inherently unsuited to assisting programmers to
§2.2 Restructuring compilers 10
write parallel programs as no language or compiler support is available. Altern­
atively, the Linda model [23] provides a global shared memory called the tuple 
space; processes can read, input and output tuples from the tuple space.
Imperative languages, such as C or Fortran, are probably more widely used for 
explicit parallel programming than any other language style. Since they were not 
originally designed as parallel languages, the ability to specify parallel computa­
tion is achieved by using them with message passing libraries, such as MPI, or by 
adding parallel extensions to their syntax. Some examples of the latter include 
C* [51, 52, 111], High Performance Fortran [53] and Fortran D [39, 56, 57, 58]. 
Numerous compilers which support this model of parallel programming have been 
developed (see §2.3).
2.2 R e s tru c tu rin g  com pilers
The earliest significant research in restructuring compilers is documented in 
[78, 77, 98, 99]. Restructuring compilers, such as Parafrase-2 [105, 106], PFC 
[8, 10], Ptran [5] and Parascope [20], transform code written in a conventional 
programming language into an equivalent program that can be executed on a 
parallel architecture. This is accomplished using program transformations, based 
upon dependence analysis [14, 108], to expose and exploit the hidden parallelism 
present in the input program.
For a transformation to be valid it must preserve the semantics of the pro­
gram. A dependence is a relationship between two computations that places 
constraints on their execution. Dependence analysis [14, 129] identifies these re­
lations, which are then used to determine whether a particular transformation 
can be applied without changing the semantics of the program. Dependence rela­
tions can be manipulated as the program is transformed so that the dependences 
in the transformed program can be obtained from the dependences in the original 
program.
For a restructuring compiler to apply a transformation to a program, it must 
do three things:
1. Decide upon a part of the program to transform and an appropriate program 
transformation to apply;
2. Verify that the transformation does not change the semantics of the pro­
gram;
3. Transform the program.
The following sections briefly describe commonly used program transforma­
tions. A thorough survey of program transformations is available in Bacon et al. 
[12]. A description of how restructuring compilers might be applied to multicom­
puter architectures, and two case studies which illustrate the use of program 
transformations in a restructuring compiler then follows.
§ 2.2 Restructuring compilers 11
2.2 .1  L oop  tran sform ation  and restru ctu r in g
Loop level parallelism accounts for most parallelism which can be exploited in 
scientific code by restructuring compilers [29]. Therefore, the major goal of re­
structuring compilers is to discover and exploit parallelism in loops.
Two iterations of a loop can be executed in parallel in there is no dependence 
relation from one iteration of the loop to another iteration of the loop. The trans­
formations discussed in this section are applied to enable the iterations of a loop 
to be executed in parallel. The transformations discussed here are summarised 
by Aho et al. [3].
Many of the transformations presented in this section are only applicable to 
perfectly nested loop nests. A loop nest is perfectly nested if the body of every 
loop other than the innermost loop consists of only the next loop in the nest (see 
§4.1).
Loop in te rc h a n g e
Loop interchange exchanges the position of two loops in a perfectly nested loop, 
generally moving one of the outer loops to the innermost position [10, 9, 129]. 
Interchange is one of the most powerful of the loop transformations because it 
can enable vectorisation or parallelisation of a loop nest.
Loop skew ing
Loop skewing is an enabling transformation that is primarily useful in combina­
tion with loop interchange [80, 96, 129]. Loop skewing is used to handle wavefront 
computations, so called because the updates of the array propagate like a wave 
across the iteration space.
An alternative method for handling wavefront computations is supernode par­
titioning [63].
S tr ip  m in in g
Strip mining is a method of adjusting the granularity of an operation [1, 7, 89]. 
One of the most common uses of strip mining is to choose the number of inde­
pendent computations in the innermost loop of a nest.
L oop til in g
Loop tiling is a multidimensional abstraction of strip mining. Tiling divides an 
iteration space into tiles and transforms the loop nest to iterate over them (see 
§4.2).
§ 2.2 Restructuring compilers 12
Loop d istrib u tion
Loop distribution (also known as loop fission or loop splitting) breaks a single 
loop into many. Each of the new loops has the same iteration space as the original 
loop, but contains a subset of the statements present in the original loop [75, 96]. 
Loop distribution may used to create a perfectly nested loop nest, to which other 
distributions (such as loop interchange) may be applied.
Loop unrolling
Loop unrolling replicates the body of a loop n times and iterates by step n instead 
of 1. It is a fundamental technique for exploiting instruction level parallelism. 
Most restructuring compilers unroll the innermost loop of a loop nest.
Loop coalescing
Loop coalescing combines a loop nest into a single loop [104]. It used primarily 
to improve the scheduling of a loop on a parallel machine.
Loop norm alisation
Loop normalisation converts all loops so that the iterations start at one (or zero) 
and are incremented by one [10]. The most important use of normalisation is to 
permit the compiler to apply other transformations that may require normalised 
iteration ranges.
Loop spreading
Spreading takes two serial loops and moves some of the computation from the 
second to the first so that the bodies of both loops can be executed in parallel [49].
Scalar expansion
Loops often contain variables that are used as temporaries within the loop body. 
Such variables create a dependence relation that prevents the loop being paral­
lelised. Allocating one temporary for each iteration removes the dependence and 
makes the loop a candidate for parallelisation [98, 129].
2 .2 .2  G en era l p rogram  tra n sfo rm a tio n s
A large number of transformations exist that are not tailored for loops. Fre­
quently, they are used as the first step in uncovering parallelism present in 
programs. Two important transformations of this type are discussed here.
§2.2 Restructuring compilers 13
C o n s ta n t  p ro p a g a tio n
Constant propagation [21, 128] is one of the most important transformations 
that a restructuring compiler can perform. Typically programs contain many 
constants; by propagating them through the program, the compiler can do a signi­
ficant amount of precomputation. More important, the propagation reveals many 
opportunities where other transformations can be applied. For example, loops 
may contain constants in their bounds which can be eliminated using constant 
propagation. These may then be transformed using transformations outlined in 
§ 2 .2 . 1 .
P ro c e d u re  in lin in g
Procedure inlining replaces a procedure call with a copy of the body of the called 
procedure [6, 13, 118]. Each occurences of a formal parameter is replaced with a 
version of the corresponding actual parameter. When a procedure is inlined, all 
overhead for the invocation is eliminated. More importantly it improves compiler 
analysis. In many circumstances a loop containing a procedure call cannot be 
parallelised because its behaviour is unknown. After the procedure is inlined the 
behaviour of the loop can be analysed and if possible parallelised.
An alternative to inlining is to perform interprocedural analysis. The ad­
vantage of interprocedural analysis is that it can be applied uniformly, since it 
does not cause code expansion the way inlining does. However, interprocedural 
analysis significantly increases the complexity of the restructuring compiler.
2 .2 .3  R estru ctu r in g  program s for m u ltico m p u ters
Restructuring compilers can assist the programming process on multicomputers. 
Clearly, the main goal of restructuring compilers is to detect and exploit hidden 
parallelism present in a program. On multicomputers non-local memory access 
costs are much higher than those for local memory. This must be taken into 
account when applying the transformations of the type present in restructuring 
compilers to programs for multicomputers.
Compilation techniques have been developed to improve data locality. These 
techniques can enhance temporal and spatial locality of scientific codes. The ex­
ploitation of parallel loops, detected using data dependence analysis, combined 
with these data locality optimisations is an effective approach to compiling pro­
grams for multicomputers.
The basic premise is that since parallel loops exhibiting data locality compute 
somewhat independent data sets, the distribution of the parallel loop iterations 
among processors can also be used to assign data to each processor. Where 
the data sets are not entirely independent grouping methods, such as tiling, are 
applied to reduce interprocessor communication between the loop iterations. This 
premise is further developed in chapter 4 as implicit data partitioning.
§ 2.2 Restructuring compilers 14
Though this compilation approach effectively reduces non-local memory ac­
cesses, some limitations still persist. Such an approach has proven inefficient 
because the resulting data partition may be highly complex and may frequently 
change between loop nests. Additionally, no compiler support is provided for 
generating or optimising the required communication. These problems are ad­
dressed in chapter 4 using the collective communication operations developed in 
chapter 3.
2.2.4 C ase s tud ies
Parafrase-2 and PFC are discussed to illustrate the use of program transformations 
in restructuring compilers.
P a ra fra se -2
Parafrase-2 was the first restructuring compiler to address the problem of sys­
tematically transforming sequential Fortran programs into parallel programs for 
a variety of architectures. These architectures include register-to-register and 
memory-to-memory vector processors, array processors and shared-memory ma­
chines.
Parafrase-2 is organised a series of passes through the input program. Each 
pass applies some program transformation. These passes are assembled into a 
sequence by the user. When applied to a Fortran program Parafrase-2 yields a 
semantically equivalent parallel program. Altering the sequence of these passes 
allows the resulting program to be tailored to a specific architecture.
Parafrase-2 performs comprehensive data dependence analysis, primarily the 
GCD and Banerjee tests [14], on the input program to detect hidden parallel­
ism. A series of transformations are then applied to create the final program. 
These transformations are classified into three classes: architecture independent 
transformations, intermediate transformations and machine dependent trans­
formations.
• Architecture independent transformations do not require any knowledge 
of the target architecture. These include scalar dataflow analysis, scalar 
renaming, scalar expansion and subroutine expansion.
• The intermediate transformations adapt the program to the requirements of 
the target architecture ignoring the specific attributes of the target machine.
• The machine dependent transformations adapt the program to the require­
ments of the target machine.
Parafrase-2 is used as the starting point for the experimental compiler de­
veloped as part of this thesis.
§2.3 Multicomputer compilers 15
PFC
PFC (Parallel Fortran Converter) is an automatic source-to-source vectoriser de­
signed to translate Fortran 77 into Fortran 8x. PFC was initially based on 
Parafrase.
PFC initially transforms the input program into an internal representation 
which is essentially based upon an abstract syntax tree and the associated symbol 
table. Each transformation performed by PFC manipulates this internal repres­
entation.
PFC transforms the input program into the final program in four steps, where 
each step consists of one or more passes over the program.
• Step one: interprocedural analysis is performed; the call graph is construc­
ted.
• Step two: standard transformations are applied; These include if-conversion, 
do-loop normalisation, induction variable recognition and substitution, for­
ward substitution, constant propagation, and recognition and deletion of 
dead code.
• Step three: dependence analysis is performed; dependence analysis applies 
the separability, GCD and Banerjee tests (the latter with an adaptation for 
triangular loops).
• Step four: vector code is generated; scalar expansion and renaming, and 
loop interchange have been built into the code generator.
2.3 M u lticom p u ter  com pilers
Compilation techniques for multicomputers start much later than the efforts tar­
geted at restructuring compilers [22, 30, 112, 131]. Multicomputer compilers 
achieve performance through the compile-time optimisation of both parallelism 
and interprocessor communication. For the purposes of this discussion they are 
classified as follows:
• Transformation driven: First generation systems that perform optimising 
transformations on naive run-time resolution code.
• Language driven: Aspects of multicomputer architectures can be modelled 
by extending a standard programming language. These extensions per­
mit the programmer to formulate parallel algorithms explicitly. Language 
driven systems rely on the language extensions to guide the compilation 
and optimisation process.
• Analysis driven: Compilers that rely on solely on compile-time analysis.
§2.3 Multicomputer compilers 16
It should be noted that such a classification is fluid. A number of systems are 
evolving, including simple language translators and complex tools that perform 
impressive compile-time analysis and optimisations. The following sections de­
scribe compilers representative of each of these groups.
HPF has been adopted as a defacto standard by many researchers developing 
multicomputer compilers. Therefore, it is used throughout this section as a point 
of comparison with the compilers presented.
2.3 .1  T ran sform ation  driven  com p ilers
The earliest multicomputer compilers are Superb and Id Nouvea. These compilers 
perform extensive compile-time analysis but adopt a rather cumbersome compil­
ation approach. Both systems first insert guards and element-wise messages to 
generate a naive program performing run-time resolution.
To produce a more efficient program, transformations are then applied to 
this naive program. In order to apply these transformations, the compiler must 
keep track of intermediate program versions. The presence of guards and explicit 
communication affects the program’s semantics in ways that are difficult to as­
certain. This is particularly the case in trying to assess whether the application 
of a program transformation would cause deadlock [48]
S u p e rb
Superb [45, 46, 131] is a semi-automatic parallelisation tool designed for MIMD 
multicomputers. Each node of the multicomputer contains a pipelined vector 
unit. Program transformation is performed in two distinct phases. Firstly, the 
coarse grained parallelism present in the program is determined, and the compu­
tation partitioned over a set of processors. Secondly, the code for each processor 
is vectorised. Dependence analysis [14, 129] guides these program transforma­
tions l .
Superb supports user-specified data distribution, through the use of contigu­
ous rectangular distributions. Unlike HPF [53], Superb does not support data 
alignment, cyclic distributions or dynamic data decomposition. Superb origin­
ated overlaps as a means to both specify and store nonlocal data accesses, and 
hence avoid locality checks during execution.
Once program analysis and transformation is complete, the necessary com­
munication is automatically generated and vectorised using data dependence 
information. Computation is partitioned via explicit guards, which may be elim­
inated by loop bounds reduction [47]. Superb performs interprocedural data flow 
analysis of parameter passing. The algorithm is similar to the that used in the 
Fortran D compiler but less involved, since Superb does not provide dynamic 
data decomposition.
Tn a manner similar to the ParaScope Editor [74].
§2.3 Multicomputer compilers 17
Id N ou vea
Id Nouvea [103, 110] is a functional language extended with single assignment 
arrays called I-structures. Simple user specified block data distributions are 
provided. Send and receive communication primitives are inserted to commu­
nicate each non-local array access, resulting from the specified data distribution. 
Global accumulation and reduction operators are supported. The use of the 
I-structure representation considerably simplifies the program analysis. Unlike 
other systems, the compiler typically generates a distinct program for each pro­
cessor.
2.3 .2  L a n g u a g e  d r iv e n  c o m p ile rs
Language driven compilers target programs containing explicit parallelism, data 
distribution and potential user specified communication. Compilation is lim­
ited to partitioning the computation across the processors, and automatically 
detecting the communication generated by the resulting process partition. Im­
plementation of the communication is based on matching language features or on 
user annotations to routines available in the run time system.
Some of these languages specify parallelism utilising a local view of computa­
tion, where the user specifies computation for a single data point of the problem, 
and relies on the compiler and run-time system to replicate this computation for 
all data points of the problem [51, 115]. Comparatively, HPF supports a global 
view of computation where the user specifies the entire computation, relying on 
the compiler to partition the computation across the processors of a multicom­
puter. This is illustrated in the discussion of developed for HPF in chapter 5.
C rystal
Crystal [31, 83, 84, 85, 86] is a high-level function language. As such, the compiler 
possesses significantly different program analysis techniques from those used for 
compiling imperative languages such as Fortran. The compiler performs compile­
time analysis and optimisation that pioneered both automatic data decomposition 
and techniques for collective communication generation.
During compilation, the compiler first separates programs into phases where 
each phase has a different computation structure, represented by an index do­
main. Heuristics are employed to align data arrays with these index domains, 
both within and across dimensions, and to then derive the control structure of the 
program. Communication patterns are then matched with the collective commu­
nication operations. Finally, the compiler generates a message-passing C program 
to be executed on each node processor of the target machine—iPSC/2 or nCUBE 
1 hypercubes.
§2.3 Multicomputer compilers 18
K ali
Kali [68, 69, 71, 72, 73, 93] is a Fortran based language. The compiler developed 
for Kali was the first system to support both regular and irregular computations 
on MIMD multicomputers. The user of the Kali system specifies the following 
information: the processor array on which the program is to be executed; the 
distribution of the data structures across these processors, using block, cyclic or 
user specified arbitrary distributions; and the parallel loops and their decompos­
ition onto the processor array. Interprocessor communication is then generated 
automatically based on the parallel loop and data decompositions.
C M  F o r tra n
CM Fortran [4, 123] is an implementation of Fortran 77 extended with vector 
notation, data layout and alignment specification. The user explicitly specifies 
data-parallelism via array assignments and the fo ra l l  loop construct. The 
compiler distributes the data arrays and generates communications. It also 
automatically aligns arrays based on the analysis of the usage patterns of ar­
ray occurrences. The compiler relies heavily on the underlying SIMD architecture 
of the Connection Machine, and hence does not address many issues that are 
critical to MIMD computers.
F o r tra n  D
Fortran D [39, 56, 57, 58] is a version of Fortran 77 augmented with powerful data 
decomposition specifications. The decomposition statement is used to declare a 
problem domain for each computation; the align statement is used to specify 
how arrays should be aligned with respect to one another; and the distribute 
statement is used to map the problem and its associated arrays to the phys­
ical machine. Fortran D is designed to provide an efficient machine-independent 
parallel programming model.
C*
C* [51, 52, 111] is an extensions of C that supports SIMD data parallel programs. 
It provides a local view of the computation. The only mechanism by which par­
allelism is achieved is through the simultaneous application of a single operation 
to an entire data set. Parallel computations are specified as actions on a domain. 
Data is described as either mono (local) or poly (distributed). There are no align­
ment or distribution specifications. The compiler automatically determines the 
data decomposition and generates the associated interprocessor communication.
Despite the SIMD semantics, C* can be efficiently compiled for both SIMD and 
MIMD architectures. Researchers have also examined synchronisation problems 
encountered when translating the SIMD programs into equivalent SPMD programs, 
as well as several communication optimisations [51, 107].
§2.3 Multicomputer compilers 19
DINO
DINO [113, 114] is a C-like explicit parallel language. The goal of DINO has 
been to simplify programming parallel numerical algorithms for multicomputers 
without hindering efficiency. Like C*, it provides the user with a local view 
of the computation. A construct called an environment is used to represent 
an abstract parallel machine. The compiler generates multiple processes per 
physical processor when large numbers of virtual processors are declared in the 
environment. Parallel computations are specified via composite functions.
The DINO system requires the user to explicitly specify the array distribu­
tions. DINO supports block, cyclic and stencil-based distributions with overlaps, 
but provides no alignment specifications. Nonlocal memory accesses resulting 
from the data distribution must be annotated by the user. The DINO compiler 
translates these references into communications. Special language constructs are 
provided for reductions.
DINO is a powerful and flexible language. It can be used to specify optimisa­
tions such as coarse grain pipelining and iteration reordering for pipelined and 
parallel computations [97]. However, its many features may prove burdensome 
to users. DIN02 has been proposed to address this problem. It provides a large 
set of additional language features to support parallel task creation, data and 
computation decomposition, synchronisation and communication.
PARTI and ARF
PARTI [116] is a set of run-time library routines that support irregular computa­
tions on MIMD multicomputer architectures. PARTI is the first system to propose 
and implement user specified irregular decompositions, and uses a hash code to 
support non-local memory accesses.
ARF [70, 130] provides a Fortran interface for accessing PARTI run-time rout­
ines. The ARF compiler provides block and cyclic distributions, and is the first 
compiler to support user specified irregular decompositions. The goal of ARF is 
to demonstrate that inspector/executors can be automatically generated by the 
compiler for user specified parallel loops.
2.3.3 A nalysis d riven  com pilers
Analysis driven compilers rely heavily on compile-time analysis rather than lan­
guage features or user annotations to exploit parallelism in a program. Typically 
they accept Fortran 77 or Fortran 90 programs with data decomposition annota­
tions, and perform extensive program analysis to automatically detect parallel 
operations.
Compared with a HPF compiler, these compilers shift much of the burden of 
parallelisation to the run-time system. Calls to library routines are inserted at 
compile-time and are subsequently invoked at run-time to calculate informations
§2.3 Multicomputer compilers 20
such as local loop bounds, array indices, and the data to communicate. As the 
burden on the compiler is reduced by this approach, the compiler has greater 
flexibility allowing it to handle complex cases such as work arrays or array re­
shaping at run-time. However, it does limit the extent of the optimisation that 
can be applied at compile-time for simple computations.
M IM D izer
MIMDizer[117] is an interactive parallelisation system for multicomputers. It 
performs data-flow and dependence analyses, and also supports powerful loop 
level transformations. Associated tools graphically display call graph, control 
flow dependence and profiling information.
Users may interactively select block or cyclic distributions for selected array 
dimensions. Parallelism in loops is exploited by applying code spreading. This can 
be done interactively or automatically. Distributed arrays are linearised in order 
to simplify ownership computations at run-time. MIMDizer automatically gener­
ates communications resulting from nonlocal memory accesses at the conclusion 
of the parallelisation process.
V ien n a  Fortran
Superb has been adapted for a new language called Vienna Fortran [27]. Vienna 
Fortran provides data alignment and distribution specifications similar to HPF. It 
supports explicit processor array declarations, irregular computations, perform­
ance estimation and automatic data decomposition [18, 25, 37]. Dynamic data 
decomposition is permitted.
Vienna Fortran allows the user to specify additional attributes for each dis­
tributed array [26]. Restore forces an array to be restored to its original decom­
position at the start of a procedure. Notransfer causes this remapping to be 
performed logically, rather than copying the values in the array. Nocopy guar­
antees that the procedure’s formal and actual parameters have the same data 
decomposition. This type of attribute not found in the HPF data distribution 
model.
AL, iW arp
AL is a language designed for the WARP distributed-memory systolic processor 
[125, 126]. The user specifies that an array is to be distributed by declaring it 
as a darray. The AL compiler applies data relations to automatically align and 
distribute each darray, detect and exploit parallelism, and generate the result­
ing interprocessor communication. Only one dimension of each darray may be 
distributed.
§2.3 Multicomputer compilers 21
The successor to AL, the iWARP compiler [50, 55, 79], addresses the difficult 
problem of simultaneously supporting data-parallelism, coarse-grain task paral­
lelism, and fine-grained systolic parallelism.
3
C om m u n ica tio n  ro u tin e s
This chapter establishes the machine-level context for the data partitioning tech­
niques developed in this thesis. Section 3.1 presents and overview of the ar­
chitecture of the Fujitsu AP1000 experimental multicomputer. A new message 
handling mode, called stride message transfer, is developed to support new col­
lective communication routines and is described in detail in §3.2. New collective 
communication routines distribute and concatenate, used to realise the data par­
titioning techniques presented in chapters 4 and 5, are introduced in §3.3.
3.1 Fujitsu  A P 1000 arch itecture
The Fujitsu API000 [64] is an experimental large-scale MIMD multicomputer 
system developed by Fujitsu Laboratories, Japan. System configurations range 
from 16 to 1024 individual SPARC processors or cells, connected by two inde­
pendent high-bandwidth communication networks: the B-net: and the T-net\ 
and a synchronisation network: the S-net. The cells do not share memory. The 
Fujitsu AP1000 is connected to and controlled by a host computer (typically a 
Sun SPARCServer).
3 .1 .1  S y ste m  configuration
Figure 3.1 illustrates the Fujitsu AP1000 system design. The T-net is used for 
point-to-point communication between cells, the B-net for host to cell or cell to 
cell 1-to-N communication, and the S-net allows the host and cells to perform 
barrier synchronisation.
All cells and the host computer are connected by the B-net. The B-net is 
used for broadcasting data, data distribution and data collection. It acts as if it 
were a single common bus, but is implemented using compound networks which
22
§3.1 Fujitsu AP1000 architecture 23
Torus Network (T-net, 25MB/s)
Figure 3.1: Fujitsu AP1000 system design.
include hierarchical common buses and a ring structured network. Up to eight 
cells share the lowest level bus. Four buses are connected to the ring network, so 
that up to 32 cells form a hierarchical common bus which is connected with a ring 
node. The host computer is connected by a host interface, which also appears 
as one of the ring nodes. The B-net has a 32-bit data path and a data transfer 
rate of 50 Mbyte/sec. The B-net protocol is similar to that of a simple common 
bus. Before data transfer, a cell or host attempting to transfer data must flag a 
request to the B-net controller, then wait until the request is granted. After the 
B-net is granted to the requester, data transfer can commence.
The T-net implements a two dimensional torus topology. Each port of the 
T-net has a 16-bit data interface and each link has a data transfer rate of 25 
Mbyte/sec. The message routing scheme implemented on the T-net is struc­
tured channel routing [61]. This routing incorporates the structured buffer pool 
algorithm into wormhole routing [33].
In wormhole routing the intermediate node stores only a few bytes of routing 
information, called a flow control digit or flit. When a routing node receives the 
message header, the node selects the channel of the next route. The node then 
transfers all subsequent blocks for that message header to the route selected. 
Although wormhole routing has the advantage of reduced latency, deadlock may 
occur and throughput may suffer because the message transfer blocks the channel.
These problems are overcome by incorporating the structured buffer pool 
algorithm, whereby a message does not block the channel it is using while it is
§3.1 Fujitsu AP1000 architecture 24
50 Mbyte/s 
S-net
LBUS
DRAM
Controller
SPARC 
IU + FPU
DRAM 
16 Mbyte
MSC
Cache
Memory
25 Mbyte/s 
25 Mbyte/s ( R T C 25 Mbyte/s 
25 Mbyte/s T-net
F ig u re  3.2: Fujitsu AP1000 cell architecture.
being transferred. Messages are routed in a fixed way, first in the x direction 
and then in the y direction. Provided there is no contention in the network, the 
latency is given by
160 x (D + N/4  + 1) nsec,
where D is the physical distance or hop count between the source and destination 
cell and N is the message size in bytes.
All cells and the host computer are also connected by the S-net, which is 
used for barrier synchronisation. The S-net has a tree topology and data is 
transferred serially. Data from each cell moves up to the root of the S-net along 
the tree branches. At intermediate nodes data is merged using a logical AND 
operator. Therefore, the logical AND of all the data is delivered to the root node. 
The result is then distributed to all cells along the tree from the root of the S-net. 
The S-net also allows cells to synchronise with the host.
3 .1 .2  C ell arch itectu re
Figure 3.2 illustrates the configuration of a Fujitsu AP1000 cell. Each cell 
has a 25 Mhz SPARC integer unit (IU), a 8.3 MFLOP floating point unit (FPU), 
a message controller (MSC), a routing controller (RTC), a B-net interface (BIF), 
16 Mbyte of dynamic memory (DRAM) and 128 Kbyte of cache memory. The IU, 
FPU, and cache memory are directly connected to the MSC. The MSC works as a
§3.2 St ride message transfer 25
direct-mapped cache memory controller with a copy-back policy. The line size is 
four words.
The MSC, RTC, BIF and DRAM controller in each cell are connected via the 
LBUS (a 32-bit 40 Mbyte/sec synchronous bus). Each cell has an external LBUS 
connector. This enables the installation of hardware options such as a high-speed 
input/output interface, disc interface, vector processors and additional memory.
3.2 S tride  m essage tra n sfe r
The stride message transfer mode is a new message handling mode, developed for 
the Fujitsu AP1000, to support the collective communication routines presented 
in §3.3. This mode can operate in one of two ways. Either it transfers data 
items that are regularly located in a single message into a contiguous area of a 
cell’s local memory, or it stores the data items contained in a single message into 
regular locations of a cell’s local memory. This section describes in detail the 
operation of this mode and outlines its implementation.
3 .2 .1  A d d ressin g  e lem en ts o f  a tw o -d im en sio n a l array
A two-dimensional array is placed in memory so that elements take up a con­
tiguous block of memory. The array can be stored in either row-major ordering 
or column-major ordering. This thesis focusses on the Fortran programming lan­
guage. Because of the reliance on column-major ordering inherent in Fortran 
only column-major ordering is discussed.
If a two-dimensional array is stored using column-major ordering the columns 
of the array are placed in memory one after another. In general, for column-major 
ordering the elements are first sorted on the last index, then on the next-to-last 
index, and so on.
Exam ple 3.1 Figure 3.3(b) illustrates how the two-dimensional array M(0:3, 0:2) 
is stored in successive memory addresses in column-major order. ■
Consider the two-dimensional array A. Let A(0:ui,0:u2) declare the array. 
The two-dimensional array data space, denoted S 2, described by the array A is 
defined as
S 2 = {i e Z 2 I (0, 0) < (z'i, i2) < (ui, u2)}, 
where Uj is the upper bound of the j th (1 < j  < 2) dimension of the array A.
Exam ple 3.2 The array A(0:3,0:2) describes the two-dimensional array data 
space
S2 = {i g Z2 | (0,0) < (ii, ii) (3,2)}.
This array data space S2 is illustrated in Figure 3.3(a). ■
§3.2 Stride message transfer 26
0 1 2
( 0 , 0 ) ( 0 , 1 ) ( 0 , 2 )
( 1 , 0 ) ( 1 , 1 ) ( 1 , 2 )
( 2 , 0 ) ( 2 , 1 ) ( 2 , 2 )
( 3 , 0 ) ( 3 , 1 ) ( 3 , 2 )
(a) The two-dimensional array data space S 2 described by the array 
4(0:3,0:2).
First column Second column
( 0 , 0 ) ( 1 , 0 ) ( 2 , 0 ) ( 3 , 0 ) ( 0 , 1 ) ( 1 , 1 ) ( 2 , 1 ) ( 3 , 1 ) ( 0 , 2 ) ( 1 , 2 ) ( 2 , 2 ) ( 3 , 2 )
0 1 2 3 4 5 6 7 8 9 10 11
Third column
(b) Array 4(0:3,0:2) stored in column-major order.
Figure 3.3: Layouts for a two-dimensional array.
The two-dimensional array is placed in memory as a contiguous block of data. 
This block of data, denoted S, is defined as
S — {2 (E Z I 0 < i ^  (ui + l ) (u2Tl )  — l}.
E x a m p le  3.3 The array 4(0:3, 0:2) is placed in the contiguous block of data
S = { i e Z \ 0 < i <  11}.
This block of data S is illustrated in Figure 3.3(b). ■
Each element of the array data space S2 maps to exactly one element in the 
block of data S. This mapping is represented by function C formulated as
£  : S 2 —> 5, £(i) = (1, U\ 4- 1) • (?i, *2)- 
E x a m p le  3 .4 Function £  for the array 4(0:3, 0:2) can be formulated as
C \ S2 S. C(i) = (1,4) • (*i, *2).
§3.2 Stride message transfer 27
The stride message transfer mode allows individual data items of the array 
data space S 2 to be transferred by a cell. The data items transferred by this 
mode are specified using the parameters hcnt, hskip, vent and vskip. These data 
items are described by a two-dimensional memory space. This two-dimensional 
memory space is defined as
M2 =  {j  G Z2 I (0,0) < (ji,j2) < {vent -  l, hcnt -  1)}.
Exam ple 3.5 Consider the stride message transfer operation schematically in­
dicated in Figure 3.4(a). The shaded areas indicate the data items actually 
transferred by the operation. The two-dimensional memory space describing 
these data items is
M 2 = {j€  Z 2I (0,0) <  (ji’h )<  (2,3)}.
The labels on the shaded areas in Figure 3.4(a) illustrate this memory space. ■
Each element of the memory space M 2 corresponds to exactly one location 
in the block of data S in which the array data space S 2 is stored. This mapping 
can be expressed as function A4 formulated as
A4 : M 2 —> S, A4{j) =  addr + {hskip, hskip x {hcnt — 1) +  vskip) • (A, j2),
where addr is the first location in the block of data S to be transferred.
Exam ple 3.6 Function A4 for the stride message transfer operation illustrated 
in Figure 3.4 is
A4 : M 2 —>• S,A4{j) = addr +  (3,10) • (A, j2)-
This function is illustrated in Figure 3.4(b). The data item (2,0) belonging to 
the memory space M 2 corresponds to memory location
A4(2, 0) — addr + (3,10) • (2, 0) =  addr 4- 6,
as illustrated in Figure 3.4(b). ■
3.2 .2  S tr id e  m essa g e  tran sfer o p era tio n
The stride message transfer operation is schematically indicated in Figure 3.5. 
The operation is applied to the contiguous block of memory with origin addr. 
The shaded items in Figure 3.5 indicate the individual data items written to the 
contiguous block of memory with origin buf. The relationship between these areas 
of memory is shown by the labels on the shaded areas in this figure.
This operation is programmed using the function in Figure 3.6. Throughout 
this chapter code will be specified in sections of pseudocode adapted from [40].
§3.2 Stride message transfer 28
addr
hskip =  3
hskip =  3
hcnt -  4
vskip =  4
\
1
(0,0) ( 0, 1) (0,2) ( 0,3)
\
1
( 1,0) ( 1, 1) ( 1,2) ( 1,3)
\
1
(2,0) (2, 1) (2,2) (2,3)
vent =  3
(a) The two-dimensional memory space M2 describing the data items 
transferred using stride message transfer.
hskip =  3 hskip =  3
--------------------------- 3 *  ------------------------------------> tC
( 0 , 0 ) ( 1,0 )
vskip = 4
( 2 ,0 ) ( 0 , 1 )
2 3 4  5 6  7 8  9 10 11
addr
(b) Each element of the memory space M2 corresponds to exactly one 
location in the block of data S.
F igu re  3.4: Access pattern for the stride message transfer message handling 
operation.
This pseudocode is representative of the high-level language used in a particular 
implementation. A description of the pseudocode is given in Appendix C.
Function sdma is applied to the contiguous block of data with address addr. 
The parameters to the function are:
buf: the origin of a block of memory used in one of two ways. Either data
items regularly located in the contiguous block of memory with origin 
addr are transferred into this block of memory, or data items contained 
in this block of memory are stored into regular locations of a block of 
the cell’s local memory with origin addr.
§3.2 St ride message transfer 29
source
hcnt =  2 vent =  1
hskip =  3 hskip =  3 vskip — 4
(0,0) ( 1, 0) ( 2,0)
hskip =  3 hskip =  3
(0 ,1) (1, 1)
addr
( 2 , 1)
destination ( 0 ,0 ) ( 1 ,0 ) ( 2 ,0 ) ( 0 , 1) ( 1, 1) ( 2 , 1)
0 1 2 3 4 5
buf
F igure 3.5: Stride message transfer operation.
dir: a flag which determines whether data items are read from or written
to the memory with origin buf;
vskip, vent, hskip, hcnt:
specify the individual data items transferred by the stride message 
transfer operation.
The function sdma is an iterative function. The array buf is either read from or 
written to (depending on the value of the parameter dir) at each iteration.
Example 3.7 The stride message transfer operation in Figure 3.5 is achieved by 
applying function sdma to the contiguous block of memory pointed to by addr in 
the following way,
dma(6u/, addr, 1, 2, 4, 1).
At the completion of this call The data items extracted from the memory block 
addr are stored to the area of memory with origin buf ■
The stride message transfer operation implemented by function sdma is illus­
trated by the following commutative diagrams,
S2
sdma'(i)
M 2
S s2
sdma(l) sdma'(O)
L  M M  ,
sdma(0),
s M 2 S
where sdma(l) indicates function sdma is applied with dir = 1 and sdma(0) indic­
ates function sdma is applied with dir = 0. The operation sdma' is that operation
§3.3 Collective communication routines 30
func
sdma(buf, addr, dir, hskip, hcnt, vskip, vent)
declare_buf buf, addr;
declare_int dir, hskip, hcnt, vskip, vent-,
iflend
naddr = naddr +  hskip 
buf = buf +  1
for_end
maddr = maddr +  vskip
for_end
Figure 3.6: The stride DMA operation is achieved using function sdma.
required to make the operations commute. It is indicated by the mapping from 
Figure 3.3(a) to Figure 3.4(a).
3.3 C o llec tiv e  c o m m u n ic a tio n  ro u tin e s
This section describes new communication routines, which can be thought of as 
multidimensional generalisations of the sdma operation (see §3.2). These rout­
ines are subsequently used to realize the data partitioning methods outlined in 
chapters 4 and 5. The precursors of these routines were presented by Fox et al. 
[40], in the context of developing scientific application programs on multicom­
puters. Several researchers have looked at the problem of developing efficient 
algorithms using such routines [59, 65].
These communication routines are referred to as collective communication 
routines. Collective communication is defined to be communication that involves 
a group of processors, referred to as the processor group. High-level collective 
communication routines are widely used by application programmers on mul­
ticomputers. A number of such communication libraries have been developed 
[31, 40, 44, 94, 101, 116]. Two new collective communication routines are de­
veloped in this section: distribute and concatenate.
§3.3 Collective communication routines 31
Source processor Processor group
(a) The data movements associated with the distribute collective com­
munication operation. The source processor sends a different partition 
of an array to each processor belonging to the processor group.
Destination processor Processor group
(b) The data movements associated with the concatenate collective 
communication operation. An array that has been partitioned over a 
processor group is concatenated and the resultant array is stored in the 
destination processor.
F igure  3.7: The data movements associated with the collective communication 
routines.
D is tr ib u te
A single source processor sends a different partition of an array to each processor 
belonging to the processor group in a single operation. The data movement 
associated with this operation is schematically indicated in Figure 3.7(a). Each 
shaded square in the figure represents a partition of the array sent by the source 
processor to a processor belonging to the processor group.
C o n c a te n a te
An array that has been partitioned over a processor group is concatenated in a 
single operation, and the resultant array is stored in a single destination processor. 
The data movement associated with this operation is schematically indicated in 
Figure 3.7(b). Each shaded square in the figure represents a partition of the array 
sent by a processor belonging to the processor group to the destination processor.
§3.3 Collective communication routines 32
The operation of the collective communication routines is captured by the 
following commutative diagrams
S n C S S n c S
dist dconcat.
These diagrams are direct analogues of the diagrams illustrating the operation 
of the stride message transfer operation (see §3.2.2). The functions are relevant 
generalisations of the functions in those diagrams. The remainder of this section 
develops the appropriate details.
3 .3 .1  G en era lised  ad d ressin g  o f  array e lem en ts
The collective communication routines require the functions introduced in §3.2.1 
to be extended to operate on n-dimensional arrays. Consider the n-dimensional 
array A. Let A(l\\u\ , . . . ,  ln:un) declare the array.
Definition 3.1 (Array data space) The n-dimensional array data space, de­
noted S n, described by the array A is defined as
S n =  {i G Z n I 0 < i < u},
where the vector u = (iq, u2, . . . ,  un) is called the upper limit vector of the array.
Assume that each element of the array A can be stored in one memory location 
and the array is stored in column-major order. The n-dimensional array is placed 
in memory as a contiguous block of data.
Definition 3.2 (Linear array data space) The one-dimensional data space, 
denoted S , described by the array data space S n is defined as
n
S = { i e Z \ t i < i <  (Uj -(- 1) — 1},
j = i
where Uj is the upper bound of the jth  (1 < j  < n) dimension of array A.
Each element of the array data space S n maps to exactly one element in the 
one-dimensional data space S. This is a generalisation of the data layout shown 
in Figure 3.3 and is illustrated in Figure 3.8. The array linearisation function C 
is a function from the n-dimensional array data space S n to the one-dimensional
§3.3 Collective communication routines 33
• • •
A\l
 ^ - - ' '
A\i
d3
S i » » : M M R H M
------------------------------------5)-
Figure 3.8: The construction of the dope vector of an n-dimensional array.
data space S . To simplify this function, the following variables dj (j = 1, . . .  ,n )  
are introduced where
dj —
{uj-1 + l )dj-u if 2 < j  < n;
1 , j  =  l ;
where Uj is the upper bound of the j th (1 < j  < n) dimension of array A. The 
values of di (1 < i < n) of an array A are contained within a descriptor called the 
dope vector, denoted d. The dope vector for an n-dimensional array is defined 
by d =  (dl3 d2. . . . ,  dn). Figure 3.8 illustrates the construction of the dope vector 
for an n-dimensional array. The shaded regions in the figure demonstrate each 
element of the dope vector.
Definition 3.3 (Array linearisation function) The array linearisation func­
tion C can be formulated using the dope vector d as
C : S n -> S,£(i )  — d • i.
The collective communication routines allow individual data items of an array 
data space Sn to be sent or received by a cell. These data items are described by 
a m-dimensional memory space.
§3.3 Collective communication routines 34
Definition 3.4 (Memory space) The m-dimensional memory space describ­
ing the data items of an array data space S n to be sent or received by a cell is 
defined
Mm = { j  e Zm I 0 < j  < c -  1},
where c is the m integer vector (ci, C2 , • • •, cm) called the count vector.
Each element of the data space M m corresponds to one location in the block 
of data S in which the array data space S n is stored. This is a generalisation 
of the data layout shown in Figure 3.4. It is illustrated in Figure 3.9. In the 
first dimension of M m each data item, represented by the shaded regions in row 
one of the figure, are followed by a\ spaces or skips. Each block of the second 
dimension of M m, represented by the shaded regions in row two of the figure, is 
made up of the structure of the first dimension. Each of these blocks is followed 
by <72 spaces or skips. Each of the remaining dimensions of Mm is structured in 
a similar fashion as shown in the figure.
The memory space allocation function is a mapping from the m-dimensional 
memory space Mm to the one-dimensional array data space S. To formulate this 
function the variables Sj (j  = 1 , . . . ,  m) are introduced where
Sj-iCj-i + cry, if 2 < j  < m;
Sj  —
1 + (7i, j  = i;
where crz is the number of skips or spaces between the individual data items of the 
zth (1 < i < m ) dimension of M m. The values of s* (1 < i < m) are contained 
within a descriptor called the stride vector, denoted s. The stride vector for a m- 
dimensional memory space is defined by s =  (si, S2 , •. . ,  sm). Figure 3.9 illustrates 
the construction of the stride vector for an m-dimensional memory space.
Definition 3.5 (Memory space allocation function) The memory space al­
location function A4 can be formulated using the stride vector s as
M  : Mm -> S , M( j )  = b + s - j
where b is the first location in the block of data S to be transferred, called the base 
offset.
Clearly, the memory space allocation function A4 is a generalisation of the 
array linearisation function C. Intuitively function A4, and hence function C, is 
one-to-one. A proof of this result is given in appendix A.
§3.3 Collective communication routines 35
3 .3 .2  D is tr ib u te  co llec tiv e  co m m u n ica tio n  ro u tin e
The declaration of the distribute communication routine is shown in Figure 3.10. 
The declaration of the dist communication primitive used by a processor belong­
ing to the processor group onto which an array is to be partitioned is given in 
Figure 3.10(a). The declaration of the sdist communication primitive used by the 
source processor is shown in Figure 3.10(b). Use of the distribute communication 
routine comprises of a call to the sdist communication primitive by the source 
processor and a call to the dist communication primitive by all other processors 
in the group.
The sdist communication primitive broadcasts a block of contiguous data, with 
address sbuf of size m, from the source processor to the processor group. The dist 
communication primitive specifies which elements of the data, broadcast by the 
source processor, are to be received by a processor belonging to this processor 
group. The elements received are stored into the area in the processor’s local 
memory indicated by the pointer Ibuf. The remainder of this section describes 
how the dist communication primitive is used by each processor to specify which 
elements are to be received from the source processor.
Assume a processor group composed of k processors is executing the distribute 
collective communication routine. Figure 3.11(a) schematically illustrates the 
use of the sdist communication primitive for the source processor and the dist
§3.3 Collective communication routines 36
func
dist(/6u/, b. I, t) 
declare_buf lbuf\ 
declare_int b, n\
declare_array t declare_struct {
declare_int s; 
declare_int c; }[n];
(a) The declaration of the dist collective communication primitive, 
func
sdist(sbuf, m) 
declare_buf sbuf; 
declare_int m;
(b) The declaration of the sdist collective communication primitive.
Figure 3.10: The declaration of the distribute communication primitives for 
both the calling processors and source processor.
communication primitive for each processor belonging to the processor group. 
Formally, this processor group is defined as
P = {p e Z \ 0 < p < k — 1}.
The source processor ps executes the sdist communication primitive, and broad­
casts a contiguous block of data, of size m with address sbuf, to the processor 
group. This data space is described as
S = { i € Z \ 0 < i < m  — 1}.
An 71-dimensional array defines the 71-dimensional array data space denoted S n 
(Definition 3.1). The data space S that is broadcast by the source processor 
to the processor group is the one-dimensional data space described by this 71- 
dimensional array data space S n (Definition 3.2). Each element of the array data 
space S n maps to one element in the data space S.  This mapping is represented 
by the array linearisation function from Definition 3.3.
Example 3.8 Consider the use of the collective communication routine distrib­
ute shown in Figure 3.11(b). The source processor broadcasts 12 elements to the 
processors belonging to the processor group, described as
S = { i e Z \ 0 < i <  11}.
§3.3 Collective communication routines 37
Communication network
dist(/6u/, 6, n, (si, 52, - ■ •, sn), (ci, c2, . . . ,  c„))
dist(/6u/, 6, n, (si, 52, * • * j ^n), (ci, c2, ♦ ♦ ♦, Cj^ ))
k — 1
Processor Group
(a) Typical use of the distribute collective communication operation on 
a processor group composed of k processors.
Communication network
lbuf( 1:6)
dist(/6u/, 2, 2, (2,3), (3,2))
Processor Group
(b) Use of the distribute collective communication operation on a 
processor group composed of two processors to partition the array 
A(0:3,0:2).
Figure 3.11: Examples of use of the distribute collective communication 
routine by a processor group. Each processor belonging to the processor group 
independently specifies the parameters to the dist communication primitive.
§3.3 Collective communication routines 38
Consider processor p (0 < p < k — 1) belonging to the processor group P. 
The dist communication primitive specifies which elements of the data space S 
broadcast by the source processor ps are to be received by processor p. These 
elements are stored into the contiguous block of the processor’s local memory 
indicated by the pointer Ibuf, given as a parameter to the dist communication 
primitive. An m-dimensional structure is imposed onto this contiguous block of 
local memory. The m-dimensional local memory space of processor p, denoted 
M™, is described using Definition 3.4 as
M™ = {j  € Zm \ 0 < j  < c — 1},
where c is the m integer vector (ci, C2 , . . . ,  cm). Each ct (1 < i < m) is given as 
a parameter to the dist communication primitive.
Exam ple 3.9 Consider the use of the dist communication primitive by processor 
p = 0 in Figure 3.11(b). The two integer vector c given as a parameter to the 
dist communication primitive is (3,2). Therefore, the processor imposes a two- 
dimensional structure onto the area of local memory pointed to by Ibuf. This 
two-dimensional local memory space M02 is defined as
M,2 =  {j  e z2 1 (0,0) <  (i,2)}. ■
Each element of the local memory space Mpm corresponds to one element of 
the data space S broadcast by the source processor. The local memory alloca­
tion function, denoted M p, is a function from M™ to S. Function A4P can be 
formulated using Definition 3.5 as
•Adp : —> S , JClp( j ) =  b +  s • j ,
where b is a positive integer and s is the m integer vector (si, 5 2 , . . . ,  sm). The base 
offset b and each (1 < % < m) are given as parameters to the dist communication 
primitive. The base offset refers to the first element of S required by processor 
P-
Definition 3.6 The elements of the data space broadcast by the source processor 
which are received by processor p are described by the set
Dp = { s e s \ 3 j  e Mpm . s = M p(j)}
= {s € S \ 3 j  € 0 .. c —1 • s = b + s - j } .
Exam ple 3.10 Consider the use of the dist communication primitive by pro­
cessor p = 0 in Figure 3.11(b). The two integer vector s given as a parameter to 
the dist communication primitive is (2, 3) and the base offset is 2. Therefore, the 
local memory allocation function for this processor can be expressed as
Mo  : M2 -4 S, M 0(j) =  2 +  (2,3) ■ (jx,j2).
§3.3 Collective communication routines 39
func
dist(Ibuf, b, n, t)
declare_buf Ibuf; 
declare_int 6, n; 
declare_array t declare_struct {
declare_int s; 
declare_int c; }[n];
iflbegin ( n = 1 ) then
sdma(/6u/, hmsg(b), 1 , £(0).s, £(0).c, £(l).s, t(l).c)
else
for_begin i = 0 to t(n).c — 1
d\st(lbuf b + (i * £(n).s), n — 1, t)
for_end
iflend
Figure 3.12: Pseudocode for the implementation of the dist communication 
primitive.
Hence, the elements of the data space S received by processor p = 0 are described 
by the set
Do — { 5  E S I 3 j  G Mq • s — Aio(j) — 2 + (2, 3) • (j7i, J2)}
=  {2, 4,6, 5, 7, 9}
The elements of S belonging to the set D0 correspond, in left-to-right order, to 
the following elements of S2
(2,0),(0,1),(2,1),(1,1),(3,1),(1,2)
under function C
C : .S'2 —> 5, C(i) =  (1,4) • (zi, i2). ■
The distribute collective communication routine has been implemented on the 
Fujitsu AP1000 multicomputer (see §3.1). The pseudocode for the dist commu­
nication primitive is shown in Figure 3.12.
The function dist is applied to the contiguous block of data broadcast from the 
source processor to the processor group. This data is broadcast to the processor 
group using the sdist communication primitive. The parameters to the function 
dist are:
Ibuf: the area in the processor’s local memory where the elements selected
from the data broadcast by the source processor are stored;
§3.3 Collective communication routines 40
b\ the base offset;
n: the number of elements of the array t;
t: the array of structures representing the m integer vector s and the m
integer vector c.
The function dist is a recursive function. The parameter Ibuf is updated at each 
recursive stage. The function works as follows.
Assume that the execution of the function reaches the beginning of the condi­
tional statement. In the case when n is one, function dist results in a sequence of 
data items selected by the function sdma (see Figure 3.6) from the data broadcast 
by the source processor which is stored at location msg in the processor’s local 
message queue. These data items are stored in the contiguous block of memory, 
in the processor’s local memory, indicated by the pointer Ibuf. In the general 
case, when n is not one, the result of dist is the result of the recursive stage, 
where function dist is applied t(n).c times with base offsets
6, b + t(n).s, b 4- 2t(n).s, . . . ,  b + (t(n).c — 1 )t(n).s.
Appendix B.l presents a formal specification of the dist collective communication 
primitive in the specification notation Z [120].
3 .3 .3  C o n ca ten a te  co llec tiv e  co m m u n ica tio n  ro u tin e
Declaration of the concatenate collective communication routine is given in 
Figure 3.13. Figure 3.13(a) shows the declaration of the concat communication 
primitive used by processor belonging to the processor group from which an 
array is to be gathered. The declaration of the dconcat communication prim­
itive used by the destination processor is shown in Figure 3.13(b). Use of the 
concatenate collective communication routine comprises of a call to the dconcat 
communication primitive by the destination processor, and a call to the concat 
communication primitive by all other processors in the group.
The dconcat communication primitive receives, from each processor in the 
processor group, a partition of the array being gathered. The data received is 
stored into the area in the destination processor’s memory of size m indicated 
by the pointer rbuf. The concat communication primitive specifies the elements 
belonging to the array partition sent to the destination processor by each pro­
cessor. These elements are stored in the area of the processor’s local memory 
indicated by the pointer Ibuf. The remainder of this section describes how the 
array being gathered is composed by the destination processor from the elements 
sent by each processor belonging to the processor group.
Assume the processor group P composed of k processors is executing the 
collective routine concatenate. Figure 3.14(a) schematically illustrates the use of 
the dconcat communication primitive for the destination processor and the concat
§3.3 Collective communication routines 41
func
concat(/6u/, b, n, t)
declare_buf lbu}\ 
declare_int &, n;
declare_array t declare_struct {
declare_int s; 
declare_int c; }[n];
(a) The declaration of the concat communication primitive.
func
dconcat (t'&u/, m) 
declare_buf rbuf; 
declare_int m;
(b) The declaration of the dconcat communication primitive.
Figure 3.13: The declaration of the concatenate communication primitive for 
both the calling and root process.
communication primitive for each processor belonging to the processor group. 
The destination processor pd executes the dconcat communication primitive and 
receives array data from each processor belonging to the processor group. This 
array data is stored to a contiguous block of data of size m with address rbuf., 
which is described as
S = {i £ Z I 0 < i < m — 1}.
An n-dimensional array defines the 72-dimension array data space, denoted S n 
(Definition 3.1). The data space S into which the array data received from the 
processor group is stored is the one-dimensional data space described by the 72- 
dimensional array data space S n (Definition 3.2). Each element of the array data 
space S n maps to one element of the data space S. This mapping is represented 
by the array linearisation function from Definition 3.3.
Exam ple 3.11 Consider the use of the collective communication routine concat­
enate shown in Figure 3.14(b). The destination processor allocates a block of 12 
elements into which the array data sent by processors belonging to the processor 
group is to be received. The data space for the destination processor is described 
as
S = { i £ Z \ t ) < i <  11}. ■
Consider processor p (0 < p < k — 1) belonging to the processor group P. The 
concat communication primitive sends a structured message from this processor
§3.3 Collective communication routines 42
Processor Group
(a) Typical use of the concatenate collective communication operation 
on a processor group composed of k processors.
Communication Network
- o —
lbuf{ 1:9)
concat(Ibuf, 3, 2, (2,3), (3,3)) ^
/6m/ (1:8)
concat(Z6m/ ,  2, 2, (2,3), (4,2)) q
4(0:3,0:2) 
dconcat(4, 12)
P d
Processor Group
(b) Use of the concatenate collective communication operation on a 
processor group composed of two processors which gather the array 
4(0:3,0:2).
F igure  3.14: Examples of use of the concatenate collective communication 
routine by a processor group. Each processor belonging to the processor group 
independently specifies the parameters to the concat communication primitive.
§3.3 Collective communication routines 43
to the destination processor. This structured message consists of array data 
together with information specifying which elements of the array being gathered 
have been sent. These messages are received by the destination processor in an 
order defined by the processor number of the sending processor. That is, the 
message sent by processor p is received by the destination processor before the 
message sent by processor p +  1.
The array data, contained in the structured message, sent by processor p 
is stored in a contiguous block of the processor’s local memory pointed to by 
Ibuf. The pointer Ibuf is given as a parameter to the concat communication 
primitive. An m-dimensional structure is imposed onto this array data when 
received by the destination processor using the dconcat communication primitive. 
The m-dimensional data space for processor p, denoted Mpm, is described using 
Definition 3.4 as
M™ = { j  £ Z m \ 0 < j  < c — 1},
where c is the m integer vector (ci, C2 , . . ., cm). Each C{ (1 < i < m) is given as 
a parameter to the concat communication primitive executed by processor p.
Example 3.12 Consider the use of the concat communication primitive by pro­
cessor p = 0 shown in Figure 3.14(b). The two integer vector c given as a 
parameter to the concat communication primitive is (4, 2). Therefore, the destin­
ation processor imposes a two-dimensional structure onto the array data received 
from this processor. This two-dimensional memory space M02 is defined as
Mq =  { j  € Z2 I (0,0) < (3,1)}. ■
Each element of the m-dimensional memory space Mpm corresponds to one 
element of the data space S allocated by the destination processor. The memory 
allocation function M.p is a function from Mpm to S. Function M.p can be for­
mulated using Definition 3.5 as
AAp : —> S , Joip(j) =  b -I- s • j ,
where b is a positive integer and s is the m integer vector (si, s2, . . . ,  sm). The 
base offset and each st (1 < i < m) are given as parameters to the concat 
communication primitive executed by processor p. The base offset refers to the 
first element of the data space S sent by processor p.
Definition 3.7 The elements of S belonging to the array partition sent to the 
destination processor by processor p are described by the set
Gp -  { s e s \ 3 j  e M pm. s  = M p(j)}
= {5 e 5 I 6 0 . . C -  l » s  = b + s - j }
Example 3.13 Consider the use of the concat communication primitive for pro­
cessor p = 0 shown in Figure 3.14(b). The two integer vector s given as a
§3.3 Collective communication routines 44
parameter to the concat primitive is (2, 3) and the base offset is 2. Therefore, the 
memory allocation function for this processor can be expressed as
j\4o : Mq —> S , jMo(j ) — 2 + (2, 3) • (juh)-
Hence, the array elements belonging to the partition sent by processor p = 0 to 
the destination processor are described by the set
Go — {s e S \ 3 j  € • s = Mo(j)  =  2 +  (2,3) • ( jhh)}
= {2,4,6,8,5,7,9,11}
The elements of S belonging to the set Go correspond, in left-to-right order, to 
the following elements of S2
(1,0),(0,1),(2,1),(0,2),(1,1),(3,1),(1,2),(3,2),
under function C from Example 3.10.
The concatenate collective communication routine has been implemented on 
the Fujitsu AP1000 multicomputer (see §3.1). The pseudocode for the dconcat 
communication primitive is shown in Figure 3.15.
Function dconcat applies the recursive function insert to each structured mes­
sage containing an array partition sent by each processor belonging the processor 
group from which the array is being gathered. The structured message is sent 
by each processor using the concat communication primitive. As shown in Fig­
ure 3.15, the structured message msg contains:
b: the base offset;
n: the number of elements of array t;
t: the array of structures representing the m integer vector s and the m
integer vector c.
Function dconcat is an iterative function. The parameter rbuf is updated at each 
iteration by the recursive function insert. Function insert works as follows.
Assume that the execution of function insert reaches the beginning of the 
conditional statement. In the case when n is one, function insert results in a 
sequence of elements selected by function sdma (see Figure 3.6) from the data 
received by the destination processor from a processor belonging to the processor 
group. This sequence is stored in the contiguous block of memory in the des­
tination processor’s memory indicated by the pointer rbuf. In the general case, 
when n is not one, the result of insert is the result of the recursive stage, where 
function insert is applied t(n).c times with base offsets
6, b + t(n).s, b + 2t(n).s, . . . ,  b + (t(n).c — 1 )t(n).s.
§3.3 Collective communication routines 45
func
insert(6w/, rbuf, b, n, t)
declare_buf buf ; 
declare_buf rbuf ; 
declare_int 6, n;
declare_array t declare_struct {
declare_int s; 
declare_int c; }[n];
if_begin ( / =  1 ) then
sdma(üm/, hrbuf(b ), 0, £(0).s, t(0).c, t(l).s, £(l).c)
else
for_begin i =  0 to t(l).c — 1
insert {buf. rbuf b +  (i *£(/).s), l — 1, t)
for_end
if_end
func
dconcat(r6u/, m) 
declare_buf buf; 
declare_int m;
declare_array msg declare_struct {
declare_int /; 
declare_int 6;
declare_array t declare_struct {
declare_int c; 
declare_int s; }[n];
declare_buf data; }[&];
for_begin i =  0 to k — 1
insert(ms#(z).da£a, rbuf, msg(i).b, msg(i).l, msg(i).t)
for_end
Figure 3.15: Pseudocode for the implementation of the dconcat communica­
tion primitive.
§3.3 Collective communication routines 46
Appendix B.2 presents a formal specification of the dconcat collective communic­
ation primitive in the specification notation Z.
4
Im p lic it d a ta  p a r titio n in g
A significant limitation of most compilers for multicomputers is that they require 
the programmer to determine much of the data partitioning scheme, with little 
support from the compiler itself. A significant amount of the current research 
has focussed on the development of techniques for automating interprocessor com­
munication generation in an attempt to correct this inherent failing. However, 
many of the fundamental objectives controlling the development of these com­
pilers remain incomplete because the programmer is still obliged to decide on the 
distribution of data among the processors.
This chapter develops a novel approach to the problem of data partitioning 
on multicomputers, referred to as implicit data partitioning. Each computation 
performed by a processor belonging to a multicomputer requires data. Implicit 
data partitioning determines the data partitioning from the distribution of the 
computation amongst the processors of the multicomputer. The communication 
associated with the distribution of the data partitions is greatly simplified by the 
use of collective communication routines (see §3.3).
The chapter begins with a description of the class of nested loops relevant 
to this analysis. Transformation of these loops into a form which can be readily 
distributed amongst the processors of the multicomputer is presented in §4.2. A 
detailed description of the implicit data partitioning approach to data partition­
ing on the Fujitsu AP1000 then follows.
4.1 Loop structure
The discussion in this chapter is restricted to the class of nested loops known 
as uniform recurrences. A uniform recurrence is a perfectly nested loop with 
constant data dependences between the loop iterations. This class includes many
47
§4.1 Loop structure 48
DO z"i -  Li, U\
DO ?2 = L21 U2
DO im — Lm, Um 
S
END
END
END
Figure 4.1: An m-dimensional perfectly nested loop nest.
important types of scientific computations, including iterative methods for linear 
systems to solve partial differential equations.
Formally, a perfectly nested loop is recursively defined as a loop in which the 
body is either a perfectly nested loop or a code segment not containing any loops, 
conditionals or jumps. This definition, although restrictive, ensures that
• the loop body has a single point of entry—the first statement in the loop 
body, and
• the loop body has a single point of exit—the completion of the last state­
ment in the loop body.
For the purpose of this thesis, the body of a perfectly nested loop is restricted to 
a single assignment statement. Figure 4.1 illustrates a loop nest which satisfies 
this definition. Symbol S in this figure represents the loop body. This loop will 
be used as a reference for the definitions in this section.
4 .1 .1  T h e  itera tio n  sp ace
All manipulation of data other than of the loop index variables takes place in the 
body of the loop nest.
Example 4.1 The loop index variables in Figure 4.1 are z1}. . . ,  im. ■
The value range for the loop index variables of the loops surrounding a state­
ment represent the iterations over which the statement is executed. This is 
captured by the notion of an iteration space.
The m nested loops in Figure 4.1 define an iteration space that is a subset of 
Z m. The iteration space is bounded in each dimension by the upper and lower
§4.1 Loop structure 49
DO k  =  0,7 
DO i2 = 0,7 
S
END
END
Figure 4.2: Example of a two-dimensional perfectly nested loop nest.
bounds of the corresponding loop index variable. Therefore, i =  (fi,. . . ,  im) £ Zm 
is in the iteration space defined by the m nested loops if and only if
Li < i\ A . . .  A Lm < im A i\ < JJ\ A . . .  A im < Um: (4-1)
where Lj (1 < j  < m) and Uj (1 < j  < m) are constants, and each loop index 
variable ij (1 < j  < m) is incremented by one at each loop iteration.
Definition 4.1 (Iteration vector) The m vector i =  (h, . . . ,  zm) belonging
to the iteration space defined by m nested loops is called an iteration vector.
Example 4.2 The iteration vector i = (h, i2) is in the iteration space defined 
by the two nest loops in in Figure 4.2 if and only if
0 < i\ A ii < 7 A 0 < Z2 A i2 <  7. ■
More generally, the space defined by statement 4.1 is called a polytope [109]. 
A polytope is a convex hull bounded by a set of hyperplanes. The A:th bounding 
hyperplane can be expressed as a linear inequality of the form
+  Ck2x 2 +  . . . - (-  Ckn x n <  c*;. (4.2)
The polytope is defined as
P n = {x e Rn I Cx < c},
where C is a matrix with n columns and c is a n x 1 vector. The kth. row of 
matrix C is the vector of coefficients (c^x, , • • ■, Dtn), and the A:th element of
the vector c is the constant c*.
The iteration space defined by m nested loops can be modelled using the 
integer points within the bounding hyperplanes.
Definition 4.2 (Iteration space) The integer polytope describing the iteration 
space is
I m = { i e Z Tn\ C i <  c}.
where C is an integer matrix with m columns and c is a m x 1 integer vector.
§4.1 Loop structure 50
This formal definition of the iteration space is the starting point of the iteration 
space tiling transformation developed in §4.2.
Example 4.3 The inequalities stated earlier for the loop nest in Figure 4.2 can 
be expressed in the form given in Statement 4.2 as
— i\ < 0 A i\ < 7 A — 2 2 < 0 A z2 < 7 .
Therefore, the iteration space defined by the two nested loops in Figure 4.2 can 
be described as
where
= { i € Z 2 1 Ci < c},
1
0 ^ (  7 \
0 1 7
- 1 0 c = 0
0 - 1  / l  0 j
In a sequential program, the iterations of a loop are executed in a particular 
order. This order corresponds to the lexicographic ordering of the iteration vec­
tors i = ( i x , . . . ,  im) E I m. This is sometimes referred to as the sequential order. 
Formally, an iteration point i E I m is said to be executed before the iteration 
point i E / m, denoted i -< i', if and only if
(3j G {1 .. n} • ij < I- A (V k G {1 .. j  — 1} • ik = h))-
4 .1 .2  A rray  access fu n ction
An array access (or array reference) occurs whenever the assignment statement, 
enclosed in a m-dimensional perfectly nested loop nest, reads or writes an array 
element. The assignment statement conforms to the single assignment rule. That 
is, each array element can be assigned a value only once. However, an array can 
appear on the left-hand side of many assignment statements providing different 
array elements are assigned values each time.
The general form of a reference to an n-dimensional array A enclosed in the 
m nested loops is
Each ipj(i) (1 < j  < n) is an affine function of the loop index variables i =  
(il5 z2, . . . ,  im) of the enclosing loops of the form
771
¥>;(*) =  Y lh *  +
1= 1
The array access function for an array reference is an affine function which 
maps an iteration vector to the subscripts for the array access.
§4.2 Tiling loop iteration space 51
Definition 4.3 (Array access function) The array access function can be ex­
pressed as a function from I m to the array data space S n (Definition 3.1 page 32).
ip : I m -> S n, <p{i) = Fi +  c,
where F is a n x m integer matrix F =  (fy) called the coefficient matrix and c is 
a n x 1 integer vector called the displacement vector.
Example 4.4 The coefficient matrix F and displacement vector c for the array 
reference A(i\ + i%, h + 3) enclosed in the loop nest shown in Figure 4.2 are
F = 1 1 1 0 c =
Therefore, the array access function for this array reference can be formulated as
/ 1 1 \ . ( 0 \
1 0 i  +
The array access function differs from what is traditionally known as a sub­
script function in that the subscript function is a function that maps an iteration 
vector to only one of the dimensions of the array. Clearly, the array access func­
tion is a vector of subscript functions.
4.2 T iling loop itera tion  space
Iteration space tiling is defined as dividing the iteration space into tiles (or blocks) 
of some size and shape (typically squares or cubes or higher order equivalents), 
and traversing between the tiles to cover the whole iteration space. Iteration space 
tiling is an extension of the strip-mining transformation in vectorising compilers 
[129]. Ancourt [11] formulated a general theory for iteration space tiling.
4 .2 .1  N o rm a lis in g  th e  loop  n est
The transformations presented in this section assume the loop nest has been 
normalised. Normalisation converts all loops so that each loop index variable ij 
(1 < j  < m) is initially zero and is incremented by one at each loop iteration 
[1 0 ]. '
The result of normalising the m nested loops is a loop nest with m loops of 
the form shown in Figure 4.3. Symbols S' in this figure represents a loop body 
which references the normalised loop variables.
§4.2 Tiling loop iteration space 52
DO i\ — 0, U\ — L\
DO «2 — 0, U2 — L2
DO zm = 0, Ufn Lm 
S '
END
END
END
Figure 4.3: Normalised loop nest.
4.2.2 D eriv ing  th e  tile  space
Consider the loop nest shown in Figure 4.3. Recall from Definition 4.2 that the 
iteration space of the loop nest is
I m = {i e Z m I Ci < c},
where C is an integer matrix with m columns and c is a m x 1 integer vector, 
which together define an integer polytope (see §4.1.1). The tile with origin i Q is 
the subset of I m
T (i0) =  {i G I m I i 0 < i < i0 + k -  1},
where k = (/ci, k2, . . . ,  km) is the tile size vector of the tiling and each kj (1 < 
j  < m) is a positive integer. The tile origins i0 form a lattice in I m and can be 
generated by an integer matrix L, as follows
i 0 —  L t f
where t  = (£1 , t2, . . . ,  tm) is called the tile index of the corresponding tile. In a 
simple rectangular tiling L is
( h 0 •■■ 0 \
0 h •• • 0
\ 0 0 • j
Since Lt — (&i£i, k2t2 , . . . ,  ^m£m)5 the tile origin iQ can be expressed as
ic = k *t.
§4.2 Tiling loop iteration space 53
DO t\ = Q,(U\ — L\)  div k\
DO £2 =  0,(172 — L2 ) div k2
DO tm = 0,(Um -  Lm) div km
DO i\ = max(0, t\k\) ,min(U\ — L\, (t\ +  l)k\  —  1)
DO ?2 — max(0, t2 k2 ),min(U2 — £ 2 , (£ 2  +  1)A:2 — 1)
D O  lm —  TTl&X ( 0 ,  tm km ),777.277. (Um 7 m , ( tm “I" 1 )  km 1 )
5'
END
END
END
END
END
END
F ig u re  4.4: The 2m-dimensional tiled loop nest derived from tiling a m- 
dimensional normalised perfect nested loop nest.
.......... . ...
Therefore, the tile with tile index t , denoted 7^, is defined
Tf =  T(k  * t) = {i € I m I k * t < i < k * t  + k — 1}.
The tile index set, denoted Tm, is the set of all tile indices whose tiles are non­
empty. The tile index set is defined as follows
T m — {t G Z m I 3 i  G I m • k * t  < i < k * t  + k — 1},
or
T m = {t G Z m \ 3 i e l m* k * t < i < k * t  + k — 1 A Ci  < c}. (4.3)
The inequalities stated in Equation 4.3 define a convex polyhedron in the 2m-  
dimensional space spanned by vectors t and i. The tile index set T m is an integer 
programming projection onto the m-dimensional subspace spanned by t. Using 
the row-echelon algorithm [11], it is possible for compilers to generate the loop 
bounds to scan T m accurately.
§4.2 Tiling loop iteration space 54
DO = 0,3 
DO t2 = 0,3
DO i\ = max(0,2t\),min(7,2t\ -f 1)
DO «2 = max(0,2t2),min(7,2t2 + 1)
5
END
END
END
END
Figure 4.5: The four-dimensional tiled loop nest derived from tiling a two- 
dimensional normalised perfect nested loop nest.
Example 4.5 Consider the loop nest shown in Figure 4.2. Assume the tile size 
vector k = (2, 2) is used to tile the iteration space of this loop nest. The tile with 
index t is
Tt — {i £ I 2 I (2£i, 2^) < (h, *2 ) < (2£i + 1,2t2 + 1)}.
The tile index set for this example is
r 2 = { t e Z 2 | ( 0 , 0 ) < ( i 1;«2)<(3,3)} .  ■
Iteration space tiling alters the execution of the iterations to finish all itera­
tions in one tile before executing the next tile. The result of the iteration space 
tiling of m normalised nested loops, shown in Figure 4.3, is a loop nest with 2m 
loops of the form shown in Figure 4.4.
Example 4.6 The result of tiling the loop nest in Figure 4.2 is shown in Fig­
ure 4.5. ■
Each iteration i in the loop index set 1171 belongs to exactly one tile in the 
tile index set T m.
Definition 4.4 (Iteration allocation function) The iteration allocation T  is 
a function from Im to T m. For rectangular tiling function T is defined as
T :  I m —> Tm,T(z) =  i div k.
Example 4.7 The iteration allocation function for the tiled loop nest shown in 
Figure 4.5 is
T - . I 2 ^  T 2,T (i)  = {iuii) div (2,2).
§4.2 Tiling loop iteration space 55
/*  receive partitioned array data * /
DO ti = pi,{Ui — Li) div k\,P\
DO t2 = P2 i{U2 -  L2) div k2,P2
DO tm - Pmi{Um — Lm) div km,Pm
DO z’i =  max(0, t\k\),m in( U\ — L\, (t\ +  l)&i — 1)
DO i2 = max(0, t2k2), min(U2 -  L2, (t2 +  1 )k2 -  1)
DO im — max (0, tm km),min ( Um Lm, (tm -j- 1) km 1)
5 '
END
END
END
END
END
END
/*  send partitioned array data * /
Figure 4.6: The partition of a 2m-dimensional tiled loop nest executed by a 
processor p  belonging to the processor space P m.
4 .2 .3  A llo c a tin g  tile s  to  p ro c e s s o rs
The purpose of iteration space tiling is to control the granularity of the parallel 
execution of the loop nest. The computation of the loop iterations of a tile are 
executed sequentially by a single processor.
The processors of the multicomputer are assumed to be organised as an 771- 
dimensional mesh. The use of a virtual topology allows this approach to iteration 
space tiling to be developed in a machine independent manner, as long as the 
virtual topology is supported by the actual target machine. The mesh topology 
can, in fact, be easily embedded on most multicomputers.
D efin ition  4.5 (Processor space) The m-dimensional processor space is
P m = {p e Z m I 0 < p  < P -  1}.
where P  = (Pi, P2, . . . ,  Pm) is the size vector of the processor space, and Pj >  1 
(1 < j  < m) is the size of the j th  dimension of the processor space.
§4.2 Tiling loop iteration space 56
/* receive partitioned array data * /
DO ti = 1,3,2 
DO t2 = 0,3
DO i\ = 2t\.2t\ + 1 
DO i2 — 2t2.2t2 -t- 1 
5
END
END
END
END
/* send partitioned array data */
Figure 4.7: The partition of a four-dimensional tiled loop nest executed by 
the processor p = (1,0) belonging to the processor space P2.
To execute the nested loop in parallel, the tiles need to be allocated and 
scheduled to the processors of the multicomputer.
Definition 4.6 (Tile allocation function) The tile allocation function A  is 
the function from the tile index set T m to the processor space P m
A : T m -> P m, A(t) = t  mod P.
The tiles allocated to the same processor are executed in lexographic order of 
the tile index t. The code for each processor p  6 P m is illustrated in Figure 4.6.
Example 4.8 Consider the tiled loop nest introduced in Example 4.6. Assume 
this code is to be executed on a two-dimensional processor space. The size of the 
processor space is P  = (2,1). The processor space P2 is defined as
p 2 = {P e z 2 \ (o,o) < {pup2) < (i,o)}.
The tile allocation function is defined as
A : T 2 -+ P2,A {(t1:t2)) = {tl : t2) mod (2,1).
The code for processor p  — (1, 0) is shown in Figure 4.7. ■
4 .2 .4  A llo c a tin g  itera tio n s  to  p ro cesso rs
According to the allocation function the tiles allocated to processor p  are de­
scribed by the set Tp
Tp = {t € T m I t  mod P  — p]
§4.3 Implicit data distribution and retrieval 57
T pi A pi
Figure 4.8: The relationship between the iteration allocation function T  and 
the tile allocation function A. The composite of these mappings allocates the 
loop iterations to processors.
=  {t e T m I t € p ■ ■ T  • t mod P  = p} (4.4)
and T  is defined as
T  =  (U -  L  + 1 div k) -  1,
where
U = (Uu v2,.
L  =  ( Z u ,  L/21 • • • j  I 'm  )  5
and each Li (1 < i < m) and U{ (1 < i < m), respectively, are the lower and 
upper bound of the zth (1 < i < m) loop in the original loop nest.
The tile set Tp defines which iterations of the loop nest are executed by 
processor p. These iterations are described by the iteration set, denoted Ip .
D efinition 4.7 (Itera tion  set) The iterations allocated to processor p by tiling 
the iteration space are described by
Ip =  {i € I \ 3 t  e Tp, j  =  +
where k is the tile size vector for the tiling.
The operation of the iteration space tiling functions developed in this section 
is illustrated in Figure 4.8. Clearly, Ip is a subset of the iteration space I m 
mapped to a particular processor p by the composite map A ° T .
4.3 Im plicit data distribution and retrieval
A multicomputer, such as the Fujitsu AP1000 (see §3.1), consists of a number of 
cell processors connected to and controlled by a host computer. The host com­
puter is used to manage the cell processors by creating parallel tasks, sending 
initial data to and collecting the results of the computation from the cell pro­
cessors. The tiled loop nest shown in Figure 4.6 represents the partition of the 
2m-dimensional tiled loop nest executed by a cell processor of the multicomputer.
§4.3 Implicit data distribution and retrieval 58
DO h = 0.3 
DO t2 = 0,3
DO z"i = max(0,2£i),mm(7,2t\ -f 1)
DO i2 = max(0,2t2),min(7,2t2 + 1) 
B(ii,i2) = A(ii + «2, t'i +3) 
END 
END 
END 
END
Figure 4.9: An array reference enclosed in a four-dimensional tiled loop nest 
derived from tiling a two-dimensional normalised perfect nested loop nest.
There are two code sections that have not yet been discussed: receive par­
titioned array data and send partitioned array data. This section develops a 
novel method of statically partitioning array data on a multicomputer, referred 
to as implicit data partitioning. Implicit data partitioning determines the code 
for a cell processor to both partition and receive the initial array data from the 
host computer, and return the modified array data to the host computer at the 
completion of the computation.
Implicit data partitioning takes advantage of the distribute collective com­
munication routine (see §3.3.2) to partition and send the relevant array data to 
all processors from the host computer in a single operation. Similarly, the modi­
fied array data is returned to the host computer using the concatenate collective 
communication routine (see §3.3.3).
Implicit data partitioning exploits the basic relationship between data and 
computation. That is, if iteration i of a tiled loop nest is mapped to the pro­
cessor p  then the array data accessed by iteration i must also be mapped to this 
processor.
This relationship between data and computation is illustrated by the following 
example. Consider the tiled loop nest shown in Figure 4.9. The tile size vector 
k = (2. 2) was used to tile the iteration space of this loop nest. This code is 
to be executed on the two-dimensional processor array P2. Assume the size 
vector of the processor array is P  = (2,1). Figure 4.10(a) illustrates the result 
of allocating this tiled loop nest onto the processor array P2 (see §4.2). The 
small circles represent the iterations of the loop nest. Each square in the figure 
represents a tile. The shaded tiles are allocated to processor p  =  (1,0), the others 
are allocated to processor p = (0, 0).
Consider the read array reference to array A enclosed in the tiled loop nest 
shown in Figure 4.9. Assume array A is declared as A(0:14, 0:10). Figure 4.10(b)
§4.3 Implicit data distribution and retrieval 59
03o«3
a
CO
a3ts
73
73
J D
H
S '
03oa3a.
CO
G
.2
t5
5
.■s
S
H
S '
_ £ h
CO
G
.2
d
S
S-4
2
faß
.2
ts£
co
5
s
oT
CD
d
O h
CO
CÖ
s
73
73
Geö
CD
O
ct3
a
CO
pj
.2
d
CD
.ts
~d
.2
co
g
03
G
-3
6
£
- t j
73
03
G_o
5  «3
cö 3a
73 73
G Es a3 ctf
73 G
^  .2
-  S
. .  *—S s. o
G
03 
03 
£
r «fa Xl
0)U
3
b J D
§4.3 Implicit data distribution and retrieval 60
represents the elements of array A accessed by each tile. The small circles rep­
resent array elements. Each region in the figure corresponds to a tile. The labels 
on the tiles indicate the tile index t. The shaded tiles are allocated to processor 
p = (1,0), the others are allocated to processor p = (0, 0).
The indexes of the array elements required to execute the iterations encom­
passed by the tiles assigned to processor p = (1,0) are described by the set
£(i,o) — {(h + 2^ , h + 3) £ Z 2 I
(2*i + 2*2, 2*i + 3) < (q + 22, h T 3) < (2*i 4- 2*2 -T 2, 2*i + 4)
A *i £ 1 .. 3 A *i mod 2 — 1 A *2 £ 0 .. 3}.
The set R(i,o) is illustrated in Figure 4.10(b) by the shaded regions. Therefore, 
only array elements A(i) where i £ #(i,o) need be sent to processor p = (1, 0) by 
the host computer. Similarly, only array elements A(i ) where i £ -ß(o,o) need be 
sent to processor p = (0, 0) by the host computer.
Next, consider the write array reference to array B enclosed in the loop nest 
shown in Figure 4.9. Assume array B is declared 5(0:7, 0:7). The indexes of the 
array elements modified by the iterations encompassed by the tiles assigned to 
processor p = (1,0) are described by the set W(i>0),
W(i,0) =  {(*i, 22) F Z 2 I (2*i, 2*2) < (z'i, *2) < (2*i + 1, 2*2 + 1)
A *i £ 1 .. 3 A *1 mod 2 = 1 A *2 £ 0 .. 3}.
Therefore, only array elements B(i) where i £ W(i>0) need be sent from processor 
p = (1,0) to the host computer. Similarly, only array elements B(i)  where 
i £ PT(o,o) need be sent from processor p = (0, 0) to the host computer.
This example illustrates the basic approach of implicit data partitioning. The 
remainder of this section formally introduces general array read and write sets, 
illustrated above, and develops a systematic methodology for implicit data par­
titioning.
4.3.1 D eterm ining the array read set
As discussed in §4.2.3, the tiled loop nest shown in Figure 4.6 is executed on the 
m-dimensional processor array P m ( Definition 4.5)
P m =  {p e Z m I 0 <  p < P  -  1},
where P  = (Pi, P2, . . . ,  Pm) is the size vector of the processor array. The loop 
nest in Figure 4.6 represents the partition of a tiled loop nest executed by pro­
cessor p  £ P m. The tile size vector for the tiling is k = (k\, k2, . . . ,  km). The
§4.3 Implicit data distribution and retrieval 61
iterations allocated to this processor by the tiling are described by the iteration 
set (Definition 4.7)
Ip = {i e I \ 3 t  e Tp, j  G 0 .. k — 1 • i = k  * t  + j }, 
where Tp is the set of tiles allocated to processor p.
Exam ple 4.9 Continuing Example 4.8, the tiles allocated to processor p = (1, 0) 
are indicated by the shaded squares in Figure 4.10(a). These tiles are described 
by the set
I ( i , o )  — {t £  T 2 I ( t i ,  t2) E  (1,0) .. (3, 3) A (ti, t2) mod (2,1) — (0,0)}.
The iterations executed by processor p = (1,0) are described by the set
/ (1,0) = {i  ^ I~ \ 3 t e T(1)0), j  G (0,0) .. (1,1) • i =  (21\ +  ji, 2 t2 +  j2)}. ■
Next, consider an n-dimensional array A that is referenced within the body S 
of the tiled loop nest shown in Figure 4.6. Let l2:u2, . . . ,  ln:un) declare the
array. Array A describes the n-dimensional array data space S n (Definition 3.1 
page 32) extended to account for the non-zero lower bound of array A
S n = {i e Z n \ l < i  < u } ,
where l = (h, k, • • •, L) and u = (ui, u2, . . . ,  un).
Each iteration belonging to Ip reads one element of the array data space S n. 
This mapping is represented by the array access function (Definition 4.3)
Lp : I m —»■ Tn, tp(i) =  Fi + c.
The elements of the array data space S n accessed by the set of iterations Ip 
assigned to processor p are described by the array read set, denoted Rp,
Rp =  { j  F Sn I 3i  G I p» j  =
Exam ple 4.10 The loop nest given in Figure 4.11 represents the partition of a 
four-dimensional tiled loop nest executed by processor p = (1,0). Assume the 
two-dimensional array A accessed in the tiled loop nest shown in Figure 4.11 is 
declared as y4(0:14, 0:10). The two-dimensional array data space defined by this 
array can be expressed as
S 2 = { i e z 2 \ (0.0) < (h.ij)  < (14,10)}.
§4.3 Implicit data distribution and retrieval 62
/* receive partitioned array data * /
DO t\ = 1,3,2 
DO t2 = 0,3
DO i\ = 2t\,2t\ -f- 1 
DO ?2 — 2^ 2,2t2 + 1
B (ii,i2) = A(n + z2, i\ + 3)
END
END
END
END
/* send partitioned array data * /
Figure 4.11: An array reference enclosed in the partition of a tiled loop nest 
allocated to processor p = (1,0) belonging to the processor space P2.
The array access function for the array read reference in Figure 4.11 (see Ex­
ample 4.4) is
V ■ C  S 2,ip(i) ^  j J
The array read set R(i$) can be formulated as
#(i,o) -  {j £ S 2 \ 3 i  e  /(1>0) • j  — <^(i)}-
The n-dimensional array A is placed in memory as a contiguous block of data. 
This block of data can be expressed as the linear array data space (Definition 3.2 
page 32) extended to account for the non-zero lower bound of array A
n
S =  { i € z Z \ 0 < i ' ^ i  (Uj — lj + 1) — 1},
j = i
where lj and Uj, respectively, are the lower and upper bounds of the j th  (1 < j  < 
n) dimension of array A.
Example 4.11 Continuing Example 4.10, the two-dimensional array A refer­
enced in Figure 4.11 is placed in the block of data expressed as the linear array 
data space
S = { i e Z \ 0 < i <  164}.
§4.3 Implicit data distribution and retrieval 63
Each element of the array data space S n maps to one element in the linear 
array data space S. This mapping is represented by the array linearisation func­
tion (Definition 3.3 page 33) extended to account for the non-zero lower bound 
of array A
C: S n S ,£(i)  = d - (i — l).
The elements of S encompassed by Rp are described by the set R p ,
R'p = {s e S \ 3 j  e Rp • s = C(j)}
=  {s e S I 3 j  e Rp • 5 = d ■ (j -  /)}
— {5 G 5* | 3 i G Ip • s — d • [D[i) — Z)}
= {s e S \ 3 i  e Ip • s = d • (Fi  + c — /)}
Let q = dF  and v =  d ■ (c — /). Rp can be expressed as
Rp = {s E S \ 3 i  e Ip • s = q ■ i + v}.
Exam ple 4.12 Continuing Example 4.11, the elements of S encompassed by 
R(i$) are described by the set R'^
R{m  — {s G S I 3 i G /(i,o) • 5 = (15, 1) • i + 42}. ■
4.3.2 Deriving the array receive partition
Implicit data partitioning utilises the distribute collective communication routine 
(see §3.3.2) to send each array read set Rp to each processor p  belonging to the 
physical processor space in a single operation.
The host computer executes the sdist communication primitive, and broad­
casts a contiguous block of data to all processors in the m-dimensional processor 
space P m. Consider processor p  belonging to this processor space. This processor 
executes the dist communication primitive. The elements of data S broadcast by 
the host computer vchich processor p  receives are (Definition 3.6 page 38)
Dp = {5 G ^ I 3 jr G0 . . C — 1 • s = b + s ■ j} ,
where s is the stride vector, c is the count vector and b is the base offset. The 
remainder of this section derives these parameters to the dist communication 
primitive in order that processor p  receives the array data described by the set 
Rp from data S broadcast by the host computer.
The first step in the derivation of the parameters to the dist communication 
primitive for processor p  is to normalise the vector t =  (£1?. . . ,  tn) from Equa­
tion 4.4. Consider the general case of normalising the n vector x  = (xx, . . . ,  xn) 
such that
l < x < u  A x  mod S  =  Z,
§4.3 Implicit data distribution and retrieval 64
with x =  Sx' 4- l . where
/ Si 0  • • •  0 \
0 5 2  • • •  0
V 0 0  • • •  5 n )
Lower and upper bounds of x' can be expressed in terms of the lower and upper 
bounds of x  as follows.
I < x  < u A x  mod S =  l 
=> l <  Sx' +  l < u  A (S x ' + l) mod S  =  l 
=> 0 < Sx' < u — l A Sx' mod S' = 0 
=> 0 < S * x ' < u  — l A S * x '  mod 5  =  0
(4.5)
The following shows that the distribute collective communication routine can be 
used to deliver the set of elements Ftp to processor p  from the host computer.
q - (k * (Pt  + p) +  j )  =  q • (k * P t ’ + k * p) + q ■ j
=  q • (k * P t 1) +  q (k * p) +  q ■ j
=  q • (k * P  * t 1) +  q • (k * p) +  q • j
m
=  QikPit'i +  q { k * p )  +  q j  
1 =  1
=  (q * k * P) ■ t ’ +  q • (fe * p)  + q • j . (4.6)
y j  ^  \
sr*1 CL O
- l
The elements of the data space 5, broadcast by the host computer to all processors 
in the processor array P m, required by processor p  are described by the set
C
{ s e S \ 3 i e  Ip »s =  q-  i +  v}
{s e  S \ 3 t  e  Tp : j  E 0 .. k — 1 • s — q • (k * t  + j )  + v}  
{s e  S \ 3 t  E p  . . T,  j E 0 . . k  — l » s  =  q - ( k * t  +  j )  +  v 
A (t — p)  mod P  =  0}
(by Statement 4.5 normalise t  with t  = Pt '  + p)
{s € S 3t' e o. . T - pP , j  € 0 .. k — 1 •
s = q ■ (fc * (Pt ' + p) + j )  + u}
(by Equation 4.6)
§4.3 Implicit data distribution and retrieval 65
s = (q * k * P) ■ t' + q ■ j  + q - (k * p) + v}
{s e S \ 3t' e 0 ..
s = (q * k * P) • t' + q • j  + q • (k * p) + v}. 
Let b = q • (k * p) + v, s be the 2m integer vector
a  , Qm ^m Pm i Ql ? • • • j I m )
and c be the 2m integer vector
c =
The set Rp is a subset of Rp which can be expressed in terms of the scalar 6, 
and the vectors s and c
Clearly, the set Rp is in a form that can be used with the dist communication 
primitive. Processor p  receives the elements of S described by the set Rp from the 
elements broadcast by the host computer using the dist communication primitive 
with parameters b, s and c derived above.
Example 4.13 Continuing Example 4.12, the set R [i^  is a subset of R"10) which 
can be expressed
Let b = 72, s be the 4 integer vector s = (60, 2, 15, 1) and c be the 4 integer 
vector c = (2, 4, 2, 2). The set R"x can be expressed in terms of the scalar 6, 
and the vectors s and c as
Rri 0) = { s g 5 | B j G O . . c - 1 * s =  72 + s • j } .
Therefore, processor p = (1,0) is able to receive the elements described by the 
array read set R'^ using the call to the dist communication primitive shown in 
Figure 4.12. ■
Rp C Rp = {s G S I 3 j  6 0 . . C  — 1 • s = b + s ■ j } .
R"m  = { « € 5  I 3 t ' € (0 ,0 ) . .  (2 ,4)- (1,1) ,  j €  (0,0)..  (2 ,2) - (1 ,1)
. s = (60, 2) • f  +  (15, 1) • j 72}
§4.3 Implicit data distribution and retrieval 66
sdist(A, 164) dist(/u6/, 72, 4, (60, 2, 15, 1), (2, 4, 2, 2))
DO h = 1,3,2 
DO t2 = 0,3
DO i\ — 2 i^ ,2 i^ +  1 
DO i2 — 2t2:2t2 -f- 1
B{i\, 12) =  A(i\ + 22, *i +  3) 
END
END
END
END
/* send partitioned array data * /
Host computer Cell processor
Figure 4.12: The use of the distribute collective communication routine to 
partition array data. Use of both the sdist communication primitive by the 
host computer and the dist communication primitive by the cell processor are 
illustrated.
4.3.3 D ete rm in in g  th e  a rray  w rite  set
Recall from §4.3.1 th a t the tiled loop nest shown in Figure 4.6 is executed on 
the m-dimensional processor array P m. The loop nest in Figure 4.6 represents 
the partition of a tiled loop nest executed by processor p  € P m. The itera­
tions allocated to this processor by the tiling are described by the iteration set 
(Definition 4.7)
where Tp is the set of tiles allocated to processor p.
Consider an n-dimensional array B  tha t is w ritten within the body S  of 
the loop nest shown in Figure 4.6. Let the array be declared . . . ,  ln:un).
Array B describes the n-dimensional array data space S n (Definition 3.1 page 32) 
extended to account for the non-zero lower bound of the array
where l = (lu fe, • • •, h)  and u  =  (nl5 u2, • • •, w„).
The elements of the array data space S n described by array B  written by the 
iterations belonging to Ip are described by the array write set, denoted Wp,
Ip = {i € I m I 3 t  G T p . j  GO..fc-l»i = fcR + j},
S n = {i <E Z n \ l < i < u},
Wp = { j e S n \ 3 i e I p . j  = <p(i)},
§4.3 Implicit data distribution and retrieval 67
where the function p)  is the array access function (Definition 4.3).
Exam ple 4.14 Assume the two-dimensional array B accessed in the tiled loop 
nest shown in Figure 4.11 is declared 5(0:7, 0:7). The two-dimensional array data 
space defined by this array can be expressed as
■ S2 = { i <sZ2 \ (0,0) < {iuh)  < (7,7)}.
The array access function for the write array reference in Figure 4.11 can be 
formulated as
<P : I 2 - >  S 2:p{i) =
1 0 
0 1
h
*2
The array write set VF(i,0) can be formulated as
Wm  = { j  e S2 \ 3 i  e /(i,o) • j  =  <£(*)}> 
where the iteration set 7(1)0) is given in Example 4.9. ■
The n-dimensional array B is placed in memory as a contiguous block of data. 
This block of data can be expressed as the linear array data space (Definition 3.2 
page 32) extended to account for the non-zero lower bound of array B
n
S = {i £ Z | 0 < i < (Uj — lj + 1) — 1},
3 = 1
where lj and Uj, respectively, are the lower and upper bounds of the j th (1 < j  < 
n) dimension of array B.
Exam ple 4.15 Continuing Example 4.14, the one-dimensional array data space 
described by the two-dimensional array B referenced in Figure 4.11 is
S = { i e Z \ 0 < i <  49}. ■
Each element of the array data space Sn maps to one element in the one­
dimensional array data space S. The elements of S encompassed by Wp are 
described by the set Wp. This set can be formulated using the array linearisation 
function (Definition 3.3 page 33) extended to account for the non-zero lower 
bound of array B
C : Sn S , C(i) =  d • (i — l).
w'p -  {s e S \ 3 j e  wp .s = c(j )}
= {s e S \ 3 j  e Wp • s = d ■ (j - 1)}
as
§4.3 Implicit data distribution and retrieval 68
— {s E S' | 3 i E Ip • s — d • [P(i) — l)}
= {5  E S I 3 i E Ip • 5 =  d • (Fi  + c — Z)}.
Let g = dF  and v = d • (c — /). Wp can be expressed as
Vbp = {s e 5 I 3 i £ l p » s  = q- i + v}.
Example 4.16 Continuing Example 4.15, the elements of S encompassed by 
VE(ii0) are described by the set
^(i,o) = {5 E S I 3 i E /(i,o) • s =  (1, 7) • i + 0}- ■
4.3.4 D eriv ing  th e  a rray  send p a r tit io n
Implicit data partitioning utilises the collective communication routine concaten­
ate (see §3.3.3) to collect the array write set Wp from each processor p  belonging 
to the physical processor space in a single operation. The discussion in this sec­
tion relies on (k * P ) | ( U — L). This assumption ensures that each tile of the tiled 
iteration space contains the same number of iteration points and each processor 
is allocated an equal number of tiles.
The host computer executes the dconcat communication primitive and receives 
array data from each processor p  in the m-dimensional processor space P m. 
Consider processor p  belonging to this processor space. This processor executes 
the concat communication primitive. The elements of S sent to the host computer 
by processor p  are (Definition 3.7 page 43)
Gp =  { s E S I 3 j  E 0 . . C — l * s  = 6 + s - j r },
where s is the stride vector, c is the count vector and b is the base offset. The 
remainder of this discussion derives these parameters to the concat communic­
ation primitive in order that processor p sends the array elements described by 
Wp to the host computer.
{ s e S \ 3 i e  Ip •s  =  q- i + v}
{s e S \ 3 t  e Tp , j  E 0 . . k -  1 • s =  q • (k * t  -I- j )  +  v}
{s € S \ 3 t  e p . .T,  j  e 0 .. k — l » s  = q - ( k * t  +  j )  + v 
A (t — p) mod P  = 0}
(by Statement 4.5 normalise t with t = Pt' + p  and (k * P
{ s E i S l ^ t  E 0 .. T - pP
j  e O . T - 1 *
^max ))
5 =  q ■ (fc * ( Pt 1 + p)  + j )  + v}
(by Equation 4.6)
§4.3 Implicit data distribution and retrieval 69
{s g S I 3 ?  g o ..
T - p
, j  e o .. k -  l
s = (q * k * P) ■ t' + q ■ j  + q • (k * p) + v}
T - p
+ 1 - 1 ,  j  € 0..Ä5 -  1«= {s € S \ 3t '  e 0 .
s = (q * k * P) ■ t' + q ■ j  + q • (k * p) + v}. 
Let 6 =  g • (k * p) + v, s be the 2m integer vector
S (Qi ^ i P ii  • • • j Q u ik m ^ m i Qli ■ ■ • i Qm)
and c be the 2m integer vector
c = Pi
P i
+
Pm Pi
Pm +  l j  ^ l j  ■ •  •  )  K
The set Wp can be expressed in terms of the scalar 6, and the vectors s and c
Wp = {s € S I 3 j  G 0 .. c — 1*5 =  b + s • j }.
The assignment statement which modifies the array elements conforms to 
the single assignment rule (see §4.1). Therefore every array element modified is 
unique over the execution of the iteration space I m. That is,
(Vi, j  e I m* i ^  j  => </>(i) /  <p{j)),
where <p(i) is the array access function (Definition 4.3).
Next, consider processors p, p' G P m. The iteration space tiling transforma­
tion (see §4.2) ensures that
Ip n Ip' — 0.
Combining this condition with single assignment assumption gives the computer- 
owns rule,
(vp,p' e p m . p ^ p '  => wp n wP' = 0).
That is, all array elements written by a processor belonging to the processor space 
Pm are owned and allocated to that processor.
These conditions ensure the set Wp is in a form that can be used with the 
concat communication primitive. Processor p sends the elements of S described 
by the set WL to the host computer using the concat communication primitive 
with parameters 6, s and c derived earlier.
Exam ple 4.17 Continuing Example 4.16, the set Wp can be expressed
W!p = {s € 5 I 3 1' 6 (0,0) .. (2,4) — (1,1), j  6 (0 ,0).. (2,2) — (1,1) 
.(4 ,1 4 )-t'H- (1,7) - j  + 2}.
§4-3 Implicit data distribution and retrieval 70
sdist(^4, 164) 
dconcat(i?, 49)
Host computer
dist(/u6/, 72, 4, (60, 2, 15, 1), (2, 4, 2, 2))
DO t\ = 1,3,2 
DO t2 = 0,3
DO i\ — 2t\,2t\ -(- 1 
DO i2 =  2t2:2t2 + 1
B (ii,i2) = A(ii -f i2,i\ + 3) 
END 
END 
END 
END
concat(/u6/, 2, 4, (4, 14, 17), ((2, 4, 2, 2)) 
Cell processor
Figure 4.13: The use of the concatenate collective communication routine to 
retrieve array data. Use of both the dconcat communication primitive by the 
host computer and the concat communication primitive by the cell processor 
are illustrated.
Let b = 2, s  be the 4 integer vector s =  (4, 14, 1 7) and c be the 4 integer vector 
c =  (2, 4, 2, 2). The set Wp can be expressed in terms of the scalar b and the 
vectors s  and c as
tUp — {s £ »S' I 3 j  G 0 .. c — 1 • s — 2 T  s • c)-.
Therefore, processor p  =  (1, 0) is able to send the elements described by the array 
write set Wp using the concat communication primitive shown in Figure 4.13. ■
5
E xplic it d a ta  d is tr ib u tio n : th e  H P F  
d irec tives
High-level parallel languages such as Delirium [90], Linda [23] and Strand [38] are 
valuable when used to coordinate coarse-grained functional parallelism. However, 
these languages fail to elegantly describe data-parallel computation of the type 
described by Hillis and Steele [54] and Karp [66]. In particular, these languages 
lack both language and compiler support to assist in efficient data placement, 
that is the partitioning and mapping of data to individual processors [100].
High Performance Fortran (HPF) [53] has been designed to overcome this defi­
ciency. The goal of HPF is to provide a simple, yet efficient, machine independent 
parallel programming model. By shifting much of the burden of machine depend­
ent optimisation to the compiler, the user can easily write data-parallel programs 
that can be compiled and executed on a variety of architectures.
The specification of HPF provides a powerful set of compiler directives de­
signed to allow many different data mappings to be specified. The HPF approach 
is similar to other approaches [39, 56, 57, 58, 123, 132] in that descriptive names 
such as block and cyclic are used to specify different types of mappings. In HPF 
this approach has more flexibility because these modifiers may be applied to 
individual dimensions of a multidimensional array.
This chapter will concentrate on the implementation of the HPF data distri­
bution model on the Fujitsu AP1000 multicomputer. In this implementation, 
collective communication routines (see §3.3) simplify the communication associ­
ated with the data partitioning. This chapter begins with a description of the 
HPF data distribution model which allows the user to advise the compiler of how 
to assign array elements to processor memories. The HPF features for aligning 
array elements which are accessed together are discussed and illustrated in §5.2. 
The HPF processors directive is presented in §5.3. The HPF data distribution
71
§5.1 Data distribution model 72
D is tr ib u te
Group of 
aligned objects
Arrays or 
other data objects
Abstract
processors
Physical
processors
Figure 5.1: HPF Data Distribution Model.
directives are explained in §5.4, including examples which illustrate the type of 
data distributions that are possible using HPF. A description of an approach 
to implementing the HPF data distribution model on the Fujitsu AP1000 then 
follows.
5.1 D a ta  d is tr ib u tio n  m odel
HPF provides the user with explicit control over data partitioning utilising 
data alignment and distribution specifications. HPF has three main compiler 
directives derived from Fortran D and Vienna Fortran: TEMPLATE, ALIGN and 
DISTRIBUTE.
The TEMPLATE directive declares the name, dimensionality, and the size of each 
problem domain. In its simplest form a template is an abstract representation of 
a problem or index domain.
The ALIGN directive specifies fine-grain parallelism, mapping each array ele­
ment onto one or more elements of the template. Multiple templates may repres­
ent different problem mappings, but an array can only be aligned to one template 
at any one point in time. An array not explicitly aligned to any template serves 
as its own template.
The DISTRIBUTE directive specifies coarse-grain parallelism. It groups tem­
plate elements and maps them, together with aligned array elements, to the finite 
resources of the machine. Each dimension of the template is either distributed 
in a block or cyclic fashion, or replicated. The distribution which is selected can 
affect the compiler’s ability to minimise communication and load imbalance in 
the resulting program.
An HPF compiler maps arrays to physical processors by using the three stage 
process shown in Figure 5.1. This mapping is guided by the user via the HPF 
compiler directives.
Stage 1: ALIGN directives are processed to compute functions that map the array 
index domain to the template index domain.
§5.2 The align directive 73
Stage 2: Each dimension of the template is mapped onto the logical processor 
grid based on the DISTRIBUTE directive.
Stage 3: The logical processor grid is mapped onto the physical system. This 
mapping can change from one system to another, but the data mapping 
onto the logical grid does not need to change. This enhances portability 
across a large number of architectures.
By performing this three stage mapping, the program is decoupled from the 
specifics of a given machine or configuration.
5.2 T he align d irective
The ALIGN directive is used to specify which portions of two or more arrays will 
be mapped to the same processor for a particular data distribution. Clearly, if 
arrays involved in the same computation are aligned so that after distribution 
their respective sections are stored on the same processor, then the number of 
non-local accesses will be reduced.
The ALIGN directive is designed to make it particularly easy to specify explicit 
mappings for all the elements of an array at once. While it is possible in some 
cases to align array elements through the careful use of matching of DISTRIBUTE 
directives, the ALIGN directive is more general and in many cases more convenient. 
The general form of the ALIGN directive is
!HPF$ ALIGN WITH B(/i(ij,
The specified elements of A are aligned to those of B. This alignment is referred 
to as an alignment for the source array A with respect to the target array B. The 
target array B is eventually distributed onto a set of processors (see §5.4). The 
compiler must guarantee that the array elements of A aligned to the same element 
of array B will be mapped to the same processor. Each alignment function ) 
(1 < j  < n and aj G { 1 ,..., n}) is an affine function of the form
fj (Tj) Sjiüj T Oj, (3.1)
where the integer Sj (1 < j  < n) is referred to as the alignment stride and the 
integer Oj (1 < j  < n) is referred to as the alignment offset. Alignment of arrays 
can take place either within or between array dimensions. A description of the 
syntax for the ALIGN directive can be found in appendix D.2. The remainder of 
this section describes the use of the ALIGN directive.
5 .2 .1  E xact m atch
The simplest alignment occurs when the source array is exactly mapped onto the 
target array. The exact alignment of the source and target arrays implies that
§5.2 The align directive 74
any two elements of the arrays with the same index are guaranteed to reside in 
the same processor.
Example 5.1 In this example, the arrays XI and X2 are distributed onto the 
equivalent dimensions in the arrays A and B.
!HPF$ ALIGN XI(I)  WITH A(I)
!HPF$ ALIGN X2(I ,J )  WITH B (I ,J )  "
5.2.2 In tra -d im en sio n  a lignm ent
Intra-dimension alignment determines the data distributions within each dimen­
sion. Intra-dimension alignment is specified using the alignment function. The 
HPF language [53] requires that each ia (a; € { 1 ,..., n}) appears in exactly one 
alignment function of the form given in Equation 5.1. This restriction does not 
permit skew alignments such as aligning XI(I) with B (I,I)  or X 2(I,J) with 
A(I+J).
An alignment offset can be specified for each dimension of the source array. 
Constants are added to the placeholders in the target array to indicate the offset 
in that dimension.
Example 5.2 In this example, XI and X2 are aligned with respect to the target 
A
!HPF$ ALIGN XICD WITH A(I+1)
!HPF$ ALIGN X2(I) WITH A ( I - l )
The array element XI (I) is always mapped to the same processor as array element 
X2(I+2). Therefore, array element XI(1) is mapped to the same processor as 
array element X2 (3). ■
The ALIGN directive also allows a stride to be specified when performing intra- 
dimensional alignment. Alignment strides are used to determine the density of an 
array mapped to a dimension. They are introduced as coefficients of placeholders 
in the alignment function of the target array. Clearly, strides may also be used 
in combination with offsets.
Example 5.3 In this example, array XI has a stride of 2 with respect to the 
target A.
!HPF$ ALIGN X I ( I ,J )  WITH A(2*I,J)
!HPF$ ALIGN X2(I ,J )  WITH A(2*I-1 ,J )
Therefore, the elements of array XI are aligned with the even elements of A. Array 
X2 also has a stride of 2, but the alignment offset of —1 causes its elements to be 
aligned to the odd elements of A. ■
Alignment strides are easily extended to higher order arrays. Alignment 
strides with negative values are permitted; they correspond to mapping the re­
flection of the source array dimension onto the target array.
§5.3 The processors directive 75
5 .2 .3  In ter-d im en sion  a lign m en t
Inter-dimension alignment determines the data mapping between dimensions. In 
HPF the user can arbitrarily permute the dimensional alignment between arrays. 
A common application for index permutation is to perform array transpositions. 
Placeholders must be used to mark the aligned dimensions.
Example 5.4 In this example, the transpose of XI is mapped to the distribution 
of B, as indicated by the reversed placeholders I and J. Similarly, the third and 
first dimensions of X2 are mapped to the first and second dimensions of the 
decomposition of B.
!HPF$ ALIGN X 1 ( I , J )  WITH B ( J , I )
!HPF$ ALIGN X 2 ( I , J , K )  W I T H B ( K , I , J )  "
5.3 T he processors d irective
The PROCESSORS directive declares a rectilinear abstract processor array onto 
which the array data is to be partitioned. The directive specifies a name, the 
number of dimensions, and the size of each dimension of the abstract processor 
array. The general fo r m  of the PROCESSORS directive is
!HPF$ PROCESSORS P(Pi,P2, • • • ,Pm)
where each Pi > 0 (1 < i < m).  A description of the syntax for the PROCESSORS 
directive can be found in appendix D.3.
Example 5 . 5  Consider the following PROCESSORS directive,
!HPF$ PROCESSORS P ( 4 , 3 )
P is a two-dimensional abstract processor arrangement with 4  abstract processors 
in the first dimension and 3 abstract processors in the second dimension. There­
fore, the total number of abstract processors is 12. ■
Definition 5.1 (Abstract processor space) The m-dimensional abstract 
processor space defined by the PROCESSORS directive is described by
Pm = {p e Z m I 0  <  p < P  -  1 } ,
where P  =  (P1? P2, . . . ,  Pm) is the size vector of the abstract processor space, and 
Pj > 1 (1 < j  < m) is the size of the jth  dimension of the processor space.
Example 5.6 The PROCESSOR directive given in the previous example defines the 
two-dimensional abstract processor space
P2 = {p e Z2 \ (0,0) < (pi,P2) < (3,2)}.
The size vector for P2 is P  =  (4,3).
§5.4 The distribute directive 76
5.4 T h e  d is t r ib u te  d ire c tiv e
The DISTRIBUTE directive allows the user to advise the compiler how to assign 
array elements to the abstract processor memories. The general form of the 
DISTRIBUTE directive is
!HPF$ DISTRIBUTE A(di, d2, .. . , dn) ONTO P
where each distribution format di (1 < i < n) is either CYCLIC, BLOCK or *, and 
P is an abstract processor array (see §5.3). The number of dimensions of the 
array partitioned using the distribution formats CYCLIC or BLOCK is equal to the 
number of dimensions of the abstract processor array P .
Assume the n-dimensional array A is to be distributed onto the m-dimensional 
(m < n ) abstract processor arrangement P using the DISTRIBUTE directive. Let 
the array be declared A(li:ui, . . . ,  ln:un). The dimensions of the abstract pro­
cessor array P correspond, in left-to-right order, to those dimensions of array 
A distributed with formats CYCLIC or BLOCK. Those dimensions of array A dis­
tributed with format * are not associated with the dimensions of the abstract 
processor array. The * specifies that every element belonging to the array dimen­
sion is allocated to the same processor. Implicit in this definition is that there is 
only one processor associated with the * distribution format.
The m-dimensional abstract processor space P m, from Definition 5.1, is ex­
tended to the n-dimensional processor space,
Pn = {p e Z n I 0 < p < (P ' T ) -  1},
where P'  =  (Pl5 P2, . . . ,  PTO, 1, . . . ,  1), and T E Z nxn is a transformation matrix 
which maps the dimensions of P'  to the corresponding distribution formats of 
array A. Assume the kth (1 < k < n) distribution format of array A corresponds 
to the j th  (1 < j  < n) dimension of the abstract processor space P n. The matrix 
element
Tjk =  1,
where j  < k and each row of the matrix has only one non-zero element.
Exam ple 5.7 Assume the array A(1:10,1:8 ,1:6) is to be mapped onto a two- 
dimensional abstract processor space using the DISTRIBUTE directive.
!HPF$ PROCESSORS P(3,2)
INTEGER A(1:10,1:8,1:6)
!HPF$ DISTRIBUTE A(CYCLIC(3),* ,BLOCK(3)) ONTO P 
The PROCESSOR directive defines the two-dimensional abstract processor space 
P2 = { p e Z 2 1 (o,o) < (PuP2) < (2,1)}.
§5.4 The distribute directive 77
The size vector for P2 is P  = (3,2). The three-dimensional extended processor 
space is defined as
P 3 = {p G Z3 I 0 < p  < (P ' T ) -  1}, 
where P' = (3, 2,1) and the transformation matrix T G Z nxn is
T =
/  1 ° ° \  
0 0 1
VO 1 0 )
Definition 5.2 (Extended abstract processor space) The n-dimensional
extended abstract processor space is defined as
P n = { p e Z n \ 0 < p < P '  - 1 } ,
where P'  =  (P1? P2, . . . ,  Pm, 1 ,. . . ,  1) T is the size vector of the extended abstract 
processor space, and T G Z nxn.
Example 5.8 Continuing the previous example, the extended processor space 
can be expressed as
P3 = { p e  Z3 I (0, 0, 0) < {pup2, Ps) < (2, 0,1)}.
The size vector for P 3 is P' = (3,1,2). ■
A description of the syntax for the DISTRIBUTE directive can be found in appen­
dix D.l. The remainder of this section explains the three distribution formats 
CYCLIC, BLOCK and *.
5.4.1 Cyclic distribution format
Assume the kth (1 < k < n) distribution format of an array A is CYCLIC (m), 
where m is a positive integer. The element of A whose index in the kth. dimension 
is j  (Ik < j  < Uk) is mapped to the abstract processor whose index along the 
corresponding dimension of the extended abstract processor space is pi (0 < Pi < 
Pk — 1) where
P i = mod Pk. (5.2)
The HPF language requires [53] the distribution format CYCLIC to be interpreted 
as CYCLIC(1).
Example 5.9 Assume the array B(l:100) is to be mapped onto 16 abstract 
processors using the CYCLIC distribution format,
§5.4 The distribute directive 78
Abstract Processors
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46
2 5 8 11 14 17 20 23 26 29 32 35 38 41 44 47
3 6 9 12
_
15 18 21 24 27 30 33 36 39 42 45 48
49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94
50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95
51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96
97 100
98
99
F igure 5.2: CYCLICO) mapping of the array B(1:100) onto the abstract pro­
cessor array P (16).
!HPF$ PROCESSORS P(16)
INTEGER B(1:100)
!HPF$ DISTRIBUTE B(CYCLIC(3 ))  ONTO P
Using the distribution format CYCLICO) element B (34) is mapped to the abstract 
processor whose index in P is 16 calculated using Equation 5.2. The complete 
CYCLIC(3) mapping of array B onto P is shown in Figure 5.2. ■
5 .4 .2  B lo ck  d is t r ib u t io n  fo rm a t
Assume the kth (1 < k < n) distribution format of an array A is BLOCK(ra), 
where m is a positive integer. The element of A whose index in the kth dimension 
is j  (4 < j  < Uk) is mapped to the abstract processor whose index along the 
corresponding dimension of the extended abstract processor space is pi (0 < Pi < 
Pk — 1) where
Pi = (5.3)
and PkTn > (tijt — 4 +  !■)• The HPF language [53] requires the distribution format 
BLOCK is to be equivalent to the distribution format BLOCK (m) where
m = 'U 'k 2 4 T 1
P k
Exam ple 5.10 Continuing Example 5.9, assume the array C (1:100) is to be 
mapped onto the 16 abstract processors using the block distribution format,
§5.4 The distribute directive 79
Abstract Processors
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 9 17 25 33 41 49 57 65 73 81 89 97
2 10 18 26 34 42 50 58 66 74 82 90 98
3 11 19 27 35 43 51 59 67 75 83 91 99 .
4 12 20 28 36 44 52 60 68 76 84 92 100
5 13 21 29 37 45 53 61 69 77 85 93
6 14 22 30 38 46 54 62 70 78 86 94
7 15 23 31 39 47 55 63 71 79 87 95
8 16 24 32 40 48 56 64 72 80 88 96
F igu re 5.3: BL0CK(8) mapping of the array C(1 : 100) onto the abstract pro­
cessor arrangement P (16).
INTEGER C(1:100)
!HPF$ DISTRIBUTE C(BLOCK(8 ))  ONTO P
Using the distribution format BLOCK(8) element C (47) is mapped to the abstract 
processor whose index in P is 11 calculated using Equation 5.3. The complete 
BLOCK(8) mapping of array C onto P is shown in Figure 5.3. ■
The BLOCK distribution format can be shown to be a special case of the CYCLIC 
distribution format. The HPF language [53] requires the BLOCK distribution format 
to assert
Therefore,
Pkm > uk -  lk + 1 => Pk >
=> Pk >
Uk — h  +  1 
m
U k  —  4
=> (Since lk < j  < uk) 
j  ~ kPk >
=> Pk >
m
j  ~ k
Pz =
J  ~  k  
m
j ~ k mod Pk-
§5.5 Explicit data distribution 80
Hence, the BLOCK distribution format is a special case of the CYCLIC distribution 
format.
5.4.3 * d is tr ib u tio n  fo rm at
Assume the fcth (1 < k < n) distribution format of an array A is *. The * 
specifies that every array element belonging to this array dimension is allocated 
to the same processor.
The * distribution format can be shown to be a special case of the CYCLIC 
distribution format. Recall the CYCLIC allocation function from Equation 5.2
Vz =
j  ~ k
m
mod TV
From Definition 5.2 Pk = 1. Let m = 1. Therefore, for all j  (If, < j  < it*), Pi = 
0. Clearly, every array element belonging to this array dimensions is allocated 
to the same processor in the extended abstract processor space. Hence, the * 
distribution format is a special case of the CYCLIC distribution format.
5.5 Explicit data distribution
A multicomputer, such as the Fujitsu AP1000 (see §3.1), consists of a number 
of cell processors connected to and controlled by a host computer. The host 
computer is used to manage the cell processors by creating parallel tasks and 
sending initial array data to the cell processors. The HPF compiler directives 
provide the user with explicit control over the allocation of array data to the cell 
processors.
The PROCESSORS directive defines the structure of the abstract processor ar­
rangement onto which the array data is to be partitioned. The user advises 
how to distribute the initial data to these abstract processors via the DISTRIBUTE 
and ALIGN compiler directives. The HPF compiler then assigns the array data to 
physical processors using the three stage mapping described in §5.1.
This section develops a new method of analysing the HPF compiler directives 
in order to partition the array data on a multicomputer, referred to as explicit 
data distribution. Explicit data distribution determines the code for a physical 
processor to both partition and receive array data from the host computer.
Explicit data distribution utilises the distribute collective communication 
routine (see §3.3.2) to partition and send the relevant array data to all pro­
cessors from the host computer in a single operation. The remainder of this 
section develops a systematic methodology for explicit data partitioning based 
on the three stage mapping from §5.1.
§5.5 Explicit data distribution 81
5.5.1 A llocating array data to abstract processors
The DISTRIBUTE directive divides the array data into blocks of particular size 
and allocates these blocks onto the abstract processors in either block or cyclic 
fashion. Recall from §5.4 that the BLOCK and * distribution formats are shown to 
be special cases of the CYCLIC distribution format. Therefore, the discussion in 
this section need only focus on the CYCLIC distribution format.
Consider an 72-dimensional array A which is distributed onto the n-dimension- 
al extended abstract processor space P n (Definition 5.2) using the DISTRIBUTE 
directive. Let A(li:ui, /2:tz2, ■ ■., ln'un) declare the array. Array A describes the 
n-dimensional array data space S n (Definition 3.1 page 32) extended to account 
for the non-zero lower bound of the array
S n = {i e Z n \ l < i < u},
where l = (/i, /2, . . . ,  ln) and u — (u1? uq, . . . ,  un). The block with origin i0 is the 
subset of the array data space S n
B(i0) =  {i e S n I i 0 < i < i0 -I- m  -  1 },
where m  — (mi, 7?22, . . . ,  mn) is the block size vector of the distribution and each 
rrij (1 < j  < n) is a positive integer. The block origin i0 can be expressed as
i0 = m  * b + i,
where b = (61, 62, . . . ,  bn) is called the block index of the corresponding block. 
Therefore, the block with index 6 , denoted is defined
Bfo = B (m  * b + l) = {i e S n \ m  * b + l < i < m  * b + l m  — 1}.
The block index set, denoted Rn, is the set of all block indices whose blocks are 
non-empty, is defined as
B n = {b € Z n | 3 z € S n * m * b  + l < i < m * b  + l - \ -m  — 1 }.
Example 5.11 Consider the HPF code fragment shown in Figure 5.4. There 
is one array B (1:20,1:15). This array defines the two-dimensional array data 
space
S2 = {i £ Z2 \ (1,1) < (>'i, *2) < (20,15)}.
The block size vector of the distribution is m  = (3, 4) taken from the DISTRIBUTE 
directive in Figure 5.4. The block with index b is
Bfo = {i G S~ I (36i + 1,462 + 1) < (fi, z2) < (36i -(- 3, 462 -I- 4)}.
§5.5 Explicit data distribution 82
INTEGER B ( 1 : 2 0 , 1 : 1 5 )
!HPF$ PROCESSORS P ( 4 , 3 )
!HPF$ DISTRIBUTE B (C Y C L IC (3 ) ,C Y C L IC (4 ))  ONTO P 
Figure 5 .4 :  H P F  co d e  fragm en t in v o lv in g  d is tr ib u te  d irectives .
The block index set for this example is
B2 = {b e Z 2 \ (0,0) < (4i, 62) < (6,3)}.
The block index set B 2 is illustrated in Figure 5.5. The small circles represent 
the elements of the array data space S 2. Each region in the figure represents an 
element of the block index set B 2. ■
Each element of the array data space S n belongs to one block in the block 
index set B n.
Definition 5.3 (Element allocation function) The element allocation func­
tion is a function from S n to B n and can be expressed as
B : S n B n,B(i) =
i -  l 
m
Example 5.12 The element allocation function for the block index set B 2 illus­
trated in Figure 5.5 is
B : S2 ^  B 2,B((iu h)) (juh) ~  ( 1, 1)
(3,4)
The blocks need to be allocated to the extended abstract processor space.
Definition 5.4 (Block allocation function) The block allocation function is 
a function from the block index set B n to the extended abstract processor space 
Pn defined as
A  : B n —> P n:A(b) = b mod P ', 
where P ' is the size vector of Pn.
Example 5.13 The block allocation function for the block index set B 2 illus­
trated in Figure 5.5 is
A  : B 2 -> P2, A{(bu f>2)) =  (bu b2) mod (4,3). ■
§5.5 Explicit data distribution 83
o__oo__g
o oo o
o oo o
o oo oo o
o o
o__g
o oo o
o o
o o
o oo o
o oo o
o oo o
c_> LOo oo o
o oo o
o —e-■ e — a
§5.5 Explicit data distribution 84
Clearly, the function in Equation 5.2 is equivalent to the composition of the block 
allocation and the processor allocation functions
A o B  = A(B(i)) = A mod P' .
According to the block allocation function A, the blocks allocated to abstract 
processor p G Pn are described by the set
Bp = {b G B n I b e p .. B • b mod P' =  p},
where B  is
m
Example 5.14 Continuing Example 5.11, the array B is to be distributed onto 
the two-dimensional abstract processor array P(4,3). The blocks allocated to 
the abstract processor (1,0) are described by the set
£(i,o) — {b G B 2 I (öi, b2) G (1,0) .. (6, 3) • (6X, b2) mod (4,3) — (1, 0)}.
The set £(i,o) is represented by the shaded regions in Figure 5.5. ■
5 .5 .2  A llo ca tin g  a b stra ct p rocessors to  th e  ta rg e t  
m u ltico m p u ter
The abstract processor space P n must be allocated to the finite set of physical 
processors of the Fujitsu AP1000. The physical processors of the multicomputer 
are assumed to be organised as an n-dimensional mesh.
Definition 5.5 (Physical processor space) The n-dimensional physical pro­
cessor space is defined as
Cn = {c G I 0 < c < C  -  1},
where C  = (Ci, C2, . . . ,  Cn) is the size vector of the physical processor space, and 
1 < Ci< Pz (1 < i < n).
Example 5.15 Continuing Example 5.14, assume P2 is to be allocated to the 
two-dimensional physical processor space
C2 = {c G Z2 I (0, 0) < (cx, c2) < (2,1)}.
The size vector for C2 is C  = (3, 2). ■
Next, consider the ith. (1 < i < n) dimension of the n-dimensional abstract 
processor space which is to be allocated to C2 physical processors. A block layout
§5.5 Explicit data distribution 85
strategy divides the Pt abstract processors into N{ segments. These segments are 
then assigned to an array of Cz physical processors, with the j th segment assigned 
to the jth  physical processor. There are two cases to be considered when using 
this strategy.
In the first case where Pi is a multiple of Cu the Ci segments can be specified 
simply as
Pi Pi 2Pi
0 —  — 1 —  — -
"  Ci ’ Ci ' '  Ci - 1, . . .
{Ci -  1 )Pj 
Ci . . P i - 1.
Clearly, each segment is of size s = Pi/Ci.
In the second case where is not a multiple of Ci, there are two standard 
approaches. The first approach is to let the first p segments be of size s = \PijCi\ 
and the (p+l)th segment is of a smaller size s' =  Pi — p\Pl/Cf], where
Pi
\Pi/ Ci\
The second approach is to let the first p segments be of size s =  J +1, and
the remaining segments are of size s' = [Pi/Cl\.
P — P i  C i
Pi,
.Ci.
Pz mod Ci.
This second approach guarantees that the segments differ in size by at most 
one, and is adopted for allocating the abstract processor space to the physical 
processor space. It should be noted that the case when P* is a multiple of Ci is 
a special case of this more general approach.
Example 5.16 Consider an n-dimensional abstract processor space which is to 
mapped onto an n-dimensional physical processor space. Assume there are 37 
abstract processors in the ith (1 < i < n) dimension of the abstract processor 
space which are to mapped onto 6 physical processors. Figure 5.6(a) illustrates 
the layout using the first approach. Clearly, processor 5 does substantially less 
work than the other processors. The layout using the second approach is shown 
in Figure 5.6(b). ■
Definition 5.6 (Abstract processor allocation function) The allocation of 
the abstract processors to the physical processors is a function from the abstract 
processor space P n to the physical processor space Cn. The abstract processor 
allocation function, denoted ot, can be expressed as
a. : P n —» Cn, a(p) =  p  mod C.
This abstract processor allocation function implements the second layout strategy 
as illustrated in Figure 5.6(b).
§5.5 Explicit data distribution 86
Physical processor 0 1 2 3 4 5
Number of allocated abstract processors 7 7 7 7 7 2
(a) [37/6] = 7  abstract processors are allocated to the first five physical 
processors, and 37 — 5 [37/6] = 2 abstract processors are allocated to 
the last physical processor.
Physical processor 0 1 2 3 4 5
Number of allocated abstract processors 7 6 6 6 6 6
(b) [37/6J +  1 = 7  abstract processors are allocated to the first physical 
processor, and [37/6J =  6 abstract processors are allocated to the last 
five physical processors.
Figure 5.6: Two approaches to the problem of allocating the abstract pro­
cessors to the physical processors.
E xam ple 5.17 Consider the allocation P 2 onto C 2 from Example 5.15. The 
processor allocation function is
öl : P 2 -+ C2, a(p)  =  (pi,P2 ) mod (3,2).
According to this allocation function the abstract processor p  =  (3,1) is allocated 
to the physical processor c =  (0,1). Figure 5.7 illustrates the complete allocation 
of P 2 onto C2. The squares represent the abstract processors in P 2. The labels on 
these abstract processors indicate the corresponding physical processor belonging 
to C2. m
The set of all abstract processors allocated to the physical processor c  is called 
the domain of c, denoted Sc-
D efin ition  5.7 (D om ain  o f  c) For each physical processor c, its domain Sc is 
defined
Sc =  { p e  P n I a{p)  =  c}
=  {p e  P n \ p  € c .. P'  -  1 • p  mod C  =  c}.
E xam p le 5.18 Consider the physical processor c =  (0,0) from Example 5.15. 
The domain of this processor, denoted 5(0,op is
<W) =  {P € p2 I (PhPi) € (0,0) .. (3,2) • (pu p2) mod (3,2) =  (0,0)}.
§5.5 Explicit data distribution 87
( 0 ,0 ) ( 1 ,0 ) ( 2 ,0 ) ( 0 ,0 )
( 0 , 1 ) ( 1 , 1) ( 2 , 1) ( 0 , 1)
( 0 ,0 ) ( 1 ,0 ) ( 2 ,0 ) ( 0 ,0 )
0 1 2 3
Figure 5.7: The allocation of the abstract processor space P2 onto the physical 
processor space C2.
The domain 5(0,o) of the physical processor (0,0) is illustrated by the shaded 
squares in Figure 5.7. ■
The abstract processor allocation function allocates the abstract processor 
space P n to the physical processor space Cn. The use of this logical structure 
allows this function to be developed in a machine-independent manner, as long as 
the logical structure of the physical processor space Cn is supported by the actual 
target multicomputer. The logical structure of Cn can, in fact, be easily embed­
ded on most multicomputers. The remainder of this section derives a mapping 
to embed the physical processor space Cn onto the target multicomputer.
Each processor of the Fujitsu AP1000 is identified by a unique integer g, 
0 < q < n?=i(C< — 1), called a sequence number. In order to embed the logical 
structure of Cn onto the multicomputer, it is necessary to define the function T  
to map the sequence number n to the corresponding physical processor c £ Cn. 
Function T  is a function from Z to Cn. That is, T : Z —» Cn.
Definition 5.8 (Sequence number allocation function) Function T  can be 
formulated as follows
T  : Z —> Cn,T(q)  — (ci, C2 , . . . ,  cn),
where
Cz =
C f + l  C i - i f - 2  ■ • ■ C n
mod Ci, (1 < i < n).
Example 5.19 Consider the two-dimensional physical processor space from Ex­
ample 5.15. The size vector for C2 is C  =  (3,2). Function T  for this physical 
processor space can be formulated as
T{q) = (1_^ /2J mod 2, q mod 2).
§5.5 Explicit data distribution 88
S n B n A p n cn
Figure 5.8: The relationship between the element allocation function B, the 
block allocation function A and the abstract processor allocation function a . 
The composite of these mappings allocates array data to the physical processors.
Next, consider the processor with sequence number q — 3. The result of function 
T  applied to this sequence number is (1,1). Thus, the processor with sequence 
number q = 3 corresponds to the physical processor c = (1,1). ■
5.5.3 A llocating array data to physical processors
The abstract processor space P n has been allocated to the physical processor 
space Cn. The abstract processors allocated to physical processor c G Cn are 
described by the domain 6q (Definition 5.7). Therefore, the blocks belonging to 
the block index set Bn allocated to the physical processor c are described by the 
set
Bc = { b e  Bn \ 3 p  e 6c,b e p .. B  • b mod P' = p}.
Example 5.20 Consider the physical processor (0,0) from Example 5.18. The 
blocks belonging to the block index set B 2 allocated to physical processor (0, 0) 
are described by the set
£ (o,o) ~ {( i^> M  € B 2 I 3(pi, P2) G 5(0,0), (h,  M  € (pi, p2) • • (0, 3)
. (&l5 b2) mod (4,3) =  (pi,p2)}.
The elements of the array data space S n (Definition 3.1 page 32) encompassed 
by the set of blocks Be are described by the set Rq
Rc — {i € S n I 3 b G Be, j  G O . .m  — l » i  = m * b  + l + j } .
The set Rq represents the partition of the array specified by the distribute direct­
ive that is to be allocated to physical processor c. The operation of the allocation 
of array data to physical processors is captured by Figure 5.8. Operationally, Rq 
is the subset of S n mapped to a particular processor c by the composite map 
a  o A  o B.
§5.5 Explicit data distribution 89
Example 5.21 The elements of the array data space S2 encompassed by the set 
of blocks 5(0,0), from Example 5.20, are described by the set
5(o,o) — {{h, *2) G S2 I 3(6i, b2) G 5(o,o), ( juh)  G (0,0) .. (2,3)
• (®i, *2) = (36i + 1 + ji,462 + 1 + .72)}•
The n-dimensional array A, which is to be partitioned according to the 
DISTRIBUTE compiler directive, is stored in memory as a contiguous block of data. 
This block of data can be expressed as the linear array data space (Definition 3.2 
page 32) extended to account for the non-zero lower bound of array A
n
S = { i e Z \ 0 < i <  [uj — lj + 1) — 1}.
j=1
where lj and Uj respectively are the lower and upper bounds of the j th  (1 < j  < n) 
dimension of array A.
Example 5.22 The array B from the HPF code fragment shown in Figure 5.4 is 
placed in the block of data expressed as the linear array data space
S = {i e Z I 0 < z < 266}. ■
Each element of the array data space S n maps to one element in the linear 
array data space S. This mapping is represented by the array linearisation func­
tion (Definition 3.3 page 33) extended to account for the non-zero lower bound 
of array A
C : Sn S , C(i) =  d • (i — l).
The elements of S encompassed by Rc are described by the set R'c which can be 
formulated using function C as
Rq — G S I 3 i G Rq • s = £(z)}.
Example 5.23 The elements of S encompassed by 5(0,0) from Example 5.21 are 
described by
5(o,o) =  {s G S I 3 i G 5(o,o) • s — (1? 20) • (fi — 1, i2 — 1)}. ■
Explicit data partitioning utilises the distribute collective communication 
routine (see §3.3.2) to send each array partition Rq to each processor c belonging 
to the physical processor space Cn in a single operation.
The host computer executes the sdist communication primitive which broad­
casts a contiguous block of data to all processors in the physical processor space 
Cn. Consider physical processor c belonging to this processor space. This phys­
ical processor c executes the dist communication primitive. The elements of the
§5.5 Explicit data distribution 90
data S broadcast by the host computer which processor c receives are (Defini­
tion 3.6 page 38)
Dc = {5 € S I 3 j  G 0 . . C  -  1» s = b + s • j } ,
where s is the stride vector, c is the count vector and b is the base offset. The 
remainder of this section derives these parameters to the dist communication 
primitive to enable the physical processor c to receive the elements contained in 
R'c from the data S broadcast by the host computer.
The following shows that the distribute collective communication routine can 
be used to deliver the set R'c to physical processor c.
d ■ (m  * (Pb' + Cp -fi c) + j )
= d ■ (m  * Pb' + m  * Cp' + m  * c + j)
= d • (m * P' * b') + d • (m  * C  * p') +  d ■ (m * c) + d j
n n
= dimiP'iK + H  CiP'i + d-  (m  * c) + d j
i = l  i =  l
= (d * m  * P') ■ b' + (d * m  * C) • p' + d ■ j  + d • (m  * c) (5.4)
The elements of the data space S , broadcast by the host computer to all physical 
processors belonging to (7n, required by the physical processor c are described 
by the set
Rq — {s E 5 I 3 i E Rq • s — T(z)}
= {s e S \ 3b e Bc , j E 0 . . m - l » s  =  C(m  * b + Z + j )}
= {s e S \ 3b e Bc , j  E 0 .. m  — 1 • s = d • (m  * b + j )}
= {s e S \ 3b e p  .. B, j  6 0 .. m  — 1, p  e 6C* s = d • (m  *b + j)
A b mod P' — p }
C (by Equation 4.5 normalise b with b =  Pb' + p)
B  -  p
{s E S I 3 b' E 0 .. P' , j  E 0 .. m  -  1, p  E Sc
s = d ■ (m * (Pb' + p) + j ) }
= {s E S I 3b' E 0
B  -  p
, j  E 0 .. ra  — 1, p E c .. P' — 1
• s = d • (m  * (Pb' + p) + j )  A p  mod C  — c]
C (by Equation 4.5 normalise p with p = Cp' + c)
B - C p '  - c
{s e S \ 3b' e 0..
P' - l - c
P'
, j  E 0 . m  -  1,
p' E 0 ..
C
§5.5 Explicit data distribution 91
• 5 = d • (m  * (Pb' + Cp' + c) +  j)}  
=  (by Equation 5.4)
B -  Cp' - c
{s e S \ 3b' e o ..
P ' - l - c
P'
, j  G 0 . m  -  1,
p' € 0 c
• s = (d * m  * P') b' + (d * m  * C) ■ p' + d ■ j  + d ■ (m * c)} 
C (since [(B -  Cp' -  c)/P'J < [{B -  c ) / P' \ )
B - c
{s e S \ 3b' e o
P' , j  e 0 . m  -  1,
p' e o . P' - l - c
C
s = (d * m  * P') b' 4- (d * m  * C) p' +  d j  +  d • (m  * c)}
B - c
+  1 - 1 ,  j  E O . . m  -  1,=  {s e S I 3b' e o
p' e o . P'  -  1 -  c
C
P'
+ 1
s = (d * m  * P') • b' + (d * m  * C) • p' + d ■ j  + d • (m * c)}
Let b — d - (m * c), s be the 3n integer vector
S — ( d \  Tfl\ P y , • • • , d n TTLn P n, d \  TYl\ C\ , • • • , d n TTln Cn, d \ , . . . , ),
and c be the 3n integer vector
c = ( ^ i  ~  c i )
P[
(P'l-  1 -  Cl)
(Bn -  cn)
PL + I?
+  ! > • • • >
( p ; - i - c n)'
+  l , m i , . . . , m n )
The set a subset of R!'c which can be expressed in terms of the scalar 6, and 
the vectors s and c as
Rc e. Rq = {s e S I 3 j  e o . . c  — i »5  — b +  s • j }.
Clearly, the set R!'c is in a form that can be used with the dist communication 
primitive. Physical processor c receives the elements of S described by the set R'c 
from the elements broadcast by the host computer using the dist communication 
primitive with parameters 6, s and c derived above.
§5.5 Explicit data distribution 92
INTEGER 0DD(50), EVEN(50), IDENT(IOO), B(IOO)
!HPF$ PROCESSORS P(16)
! HPF$ DISTRIBUTE B(CYCLIC(3)) ONTO P 
!HPF$ ALIGN IDENT(i) WITH B(i)
!HPF$ ALIGN ODD(i) WITH B(2*i-1)
!HPF$ ALIGN EVEN(i) WITH B(2*i)
Figure 5.9: HPF code fragment involving the align and distribute directives.
Example 5.24 Continuing Example 5.23, the set i?[0 0) is a subset of Rj'0?0) which 
can be expressed as
R 'm  = {s E S I 3 6' £ (0,0) .. (2,2) — (1,1), j  € (0, 0) .. (3,4) — (1,1),
p'e (0,0)..(2,2)-(1,1)
. s = (12, 240) • b' + (9,160) • p' + (1, 20) • j '  +  0}
Let b = 0, s be the 6 integer vector s = (12,240,9,160,1,120) and c be the 6 
integer vector c = (2, 2, 2, 2,3, 4). The set R"0 can be expressed in terms of the 
scalar b and the integer vectors s and c as
R'(oo) — E S j 3 j  E O.. c — 1 • s =  0 -t- s • j} .
Thus, processor c = (0, 0) is able to receive the elements described by the set 
R'^oj using the dist communication primitive with the parameters 6, s, and c. ■
5 .5 .4  A llo c a t in g  aligned  array d ata  to  p hysica l  
p rocessors
Alignment determines which portions of twro or more arrays will be assigned to 
the same processor for a particular distribution. Consider an n-dimensional array 
A which is to be aligned with the n-dimensional array B. Recall from §5.2 the 
general form of the alignment directive is
!HPF$ ALIGN A { hWITH ß(/i(i01),
This alignment is referred to as an alignment for the source array A with respect 
to the target array B. The discussion in this section does not consider the use of 
replication in array alignment.
Example 5.25 Consider the HPF code fragment shown in Figure 5.9. There are 
three arrays ODD(50), EVEN(50) and IDENT(100). The elements of array ODD are 
aligned with the odd elements of array B. Similarly, the elements of array EVEN
§5.5 Explicit data distribution 93
are aligned with the even elements of array B. The elements of array IDENT are 
aligned identically with B. Hence, the elements of arrays ODD and EVEN are aligned 
with the odd and even elements of IDENT, respectively, because they are aligned 
to the same align target. ■
Let the source array be declared . . . ,  Zn:it„). This array defines the
n-dimensional source array data space, denoted S f,
S f = {i e Z n \ l < i < u},
where l = (Z1}. . . ,  ln) and u  =  (u1}. . . ,  un). Next, let the target array be declared 
B(l[:u[ , . . . ,  I'n'-u'n)- This array defines the n-dimensional target array data space, 
denoted 5 ”,
S? = { i e Z n \ l' < i <  u'}, 
where l' = (/(, and u' =  {u[ , . . . ,  < ) .
Exam ple 5.26 Consider the alignment of array ODD shown in Figure 5.9. The 
source array ODD defines the one-dimensional source array data space
Si = {i e Z I 1 < i < 50}.
The target array B defines the one-dimensional target array data space
Sl = { i e Z \ l < i <  100}. ■
Recall from §5.2 that each alignment function (1 < j  < n) is required
to be an affine function of ia (a; G {1, . . . ,  n}) of the form
fj (ia.j) Sj ißj T Oj.
The array alignment function for an aligned source array is an affine function 
which maps an element of the source array data space to its corresponding element 
in the target array data space.
Definition 5.9 (A rray alignm ent function) The array alignment function is 
a function from 5sn to S f ,
T :  S f  -> S? ,T (i) = S i + k ,
where S is an n x n integer matrix called the stride matrix and k is a n integer 
vector called the offset vector.
Exam ple 5.27 The stride matrix S and the offset vector k for the array ODD in 
Figure 5.9 are
S  =  (2) fc =  (-!)■
§5.5 Explicit data distribution 94
Therefore, the array alignment function for this array alignment can be formu­
lated as
f : S s1 ^ S t1,F (i)  =  (2)i + ( - l ) .  ■
The distribute collective communication routine (see §3.3.2) is used to parti­
tion the source array A onto the processors of the multicomputer in a way that 
guarantees that the elements of the source array A, aligned to the same element 
of the target array B , are mapped to the same processor. To achieve this map­
ping the source array data space Sj1 is extended to match the size of the target 
array data space 5}n, creating the extended source array Ae.
Assume, in each of the n dimensions the source array A is equal to or smaller 
than the target array B. If the source array A is equal to the target array B 
in the j th (1 < j  < n) dimension, then the size of the j th dimension of the 
extended source array A e is the size of the jth  dimension of the source array A. 
If the source array A is smaller than the target array B in the jth  (1 < j  < n) 
dimension, then the jth  dimension of the extended source array Ae is the size of 
the jth  dimension of the target array B.
The extended source array Ae defines the n-dimensional extended source array 
data space, denoted S'",
5" = {i e Zn I l" < i < u"},
where l" =  ( /" , . . . ,  /") and u" =  (u" , . . . ,  u”). Each l" and u” (1 < j  < n) are 
defined
lj , i f  =
-------
IIAT
uJ:
•vriisT• r-H
{ , i f  k < i f  Uj <  u'j
Many commercial compilers such as Dec MPP Fortran [92], CM-Fortran [123] 
and Cray MPP Fortran [91] extend the source array data space to the nearest 
power of two. However, it is only necessary to extend the source array data 
space to match the target array data space when using the distribute collective 
communication routine to partition the source array.
Exam ple 5 . 2 8  From Example 5.26 the source array ODD defines the one-dimen­
sional source array data space
S] = {i € Z I 1 < i < 50}.
The one-dimensional extended source array data space used for the alignment of 
array ODD shown in Figure 5.9 is
Sl = { i e Z \ l < i <  100}. ■
Next, each array reference to the source array A is transformed from a refer­
ence to an element in 5 ” to a reference to an element in 5". Clearly, the data
§5.5 Explicit data distribution 95
1
2
3
4
5
50
Source array
0 D D ( 1 )  
QDD(2) 
0DD(3) 
ODD(4) 
0DD(5)
0DD(50)
F(i)  = (2)i + ( -1 )
Sj =  {: € Z I 1 <  * < 50}
Extended source
99
100
0DD(1)
0DD(2)
0DD(3)
0DD(4)
0DD(5)
0DD(50)
array 
(^1) = 1
JF(2) =  3
^ ( 3 )  =  5
T { 4 )  =  7
^■(5) =  9
.F(50) =  99
S \  =  {* € Z  I 1 < t <  100}
Figure 5.10: The transformation of the source array data space to match the 
target array data space. The shaded regions represent dummy variables in the 
extended source array data space. These values will never be referenced.
space 5en is identical to the target array data space 5fn. Therefore, the array 
alignment function T  (Definition 5.9) is used to perform this transformation.
Exam ple 5.29 Figure 5.10 illustrates the transformation of the source array ODD 
from the source array data space S\ into the extended array data space 5* using 
the array alignment function from Example 5.27
F  '■ S \  S l , F ( i )  =  (2)t +  (—1).
The shaded regions in this figure represent dummy variables in the extended 
source array data space. Clearly, these values will never be referenced. ■
Finally, the extended source array is partitioned onto the same abstract pro­
cessor arrangement as the target array according the DISTRIBUTE directive given 
for the target array. The extended source array is partitioned using the technique 
developed in §5.5.3. This approach to the partitioning of aligned array data en­
sures that the elements of the source array A, aligned to the same element of the 
target array B. are mapped to the same processor.
6
E x p e rim e n ta l re su lts
The process and data partitioning techniques developed in this thesis have been 
implemented as part of an experimental parallelising compiler, named E c l ips e . 
E clipse  has been developed for the Fujitsu AP1000 multicomputer. This com­
piler accepts a restricted class of Fortran 77 programs (see §4.1) and a subset of 
the HPF compiler directives (see §5.1).
For a program of either class the compiler automatically determines the pro­
cess partitioning to be used for the computation, and the data partitioning to be 
used for all arrays declared in it, and yields a semantically equivalent extended 
Fortran [41, 42, 43] program that is tailored to the Fujitsu AP1000 multicom­
puter. Case studies are used to illustrate how the compiler implements both the 
data and process partitioning techniques developed in this thesis. The perform­
ance of the compiler-generated code is compared against hand-crafted versions 
on the Fujitsu AP1000 multicomputer. Results of these studies are presented in 
both graphical and tabular form.
The techniques developed in this thesis are not restricted merely to compiler 
application. If the automated element is removed from the compilation pro­
cess, and the data and process partitioning techniques are applied to programs 
manually, they provide a methodology for the systematic construction of paral­
lel programs. Typically, large complex scientific programs are unsuitable for the 
class of compiler of which E clipse  is a member. Therefore, manual application 
is necessary. Within this chapter this is illustrated by developing a parallel imple­
mentation of the National Centre for Atmospheric Research (NCAR) Community 
Climate Model version 1 (CCM1).
This chapter begins with a brief description of the experimental compiler. 
The performance study method is presented in §6.2. Reports Results for the case 
studies used to illustrate the experimental compiler are reported in §6.3. Finally,
96
§6.1 Structure of the experimental compiler 97
performance results for the parallel implementation of the NCAR Community 
Climate Model (CCM1) are presented in §6.4.
6.1 Structure o f th e  exp erim en ta l com piler
E c l i p s e  has been developed as an extension to the Parafrase-2 System [105] and, 
in keeping with the structure of Parafrase-2, is organised as a series of passes 
through the program and written in C (see §2.2.4). Figure 6.1 illustrates the 
components of the compiler. A source program is transformed through several 
phases into code suitable for compilation and execution on the Fujitsu AP1000. 
Collective communication routines are used to realise the data partitioning.
The compilation process is broken into six modules: (1) parsing; (2) de­
pendence analysis; (3) parallel loop detection; (4) task scheduling; (5) data 
partitioning; and (6) code generation. The internal representation of the program 
is constructed by Parafrase-2 and provides information such as the parsetree, data 
dependences and control flow. It also performs the data dependence analysis and 
parallel loop detection. E c l i p s e  requires more information about a program 
than Parafrase-2 does, such as the dope vector for each array declared in the 
program, the array access function for each array reference in the program, array 
read and write sets, and information pertaining to any HPF directives used. The 
internal program representation has been extended accordingly.
Figure 6.1 illustrates that for a syntactically correct Fortran 77/HPF pro­
gram, the first step of the compilation process is the generation of a parse tree. 
The parallel loop detection module analyses the data dependences in each do loop 
and (if possible) converts the sequential loop into a parallel doall loop. Since 
this output program has no information concerning the process distribution or 
data distribution it is referred to as a shared memory parallel program.
The task scheduling module tiles the parallel iteration space and allocates 
these tiles to the physical processors of the Fujitsu AP1000 multicomputer (see 
§4.2). According to the array referencing in the body of the tiled loops the data 
partitioning module determines a mapping of array data to physical processors 
(see §4.3). The data partitioning module also processes the HPF data distribution 
directives; namely, DISTRIBUTE, ALIGN and PROCESSORS. Using these directives it 
partitions the array data among the physical processors (see §5.5).
Finally, the code generator module produces code suitable for compilation 
and execution on the Fujitsu AP1000 multicomputer. Collective communication 
primitives are constructed to realise the data partitions determined by the data 
partitioning module.
The generated code is structured as distinct phases of collective communica­
tion and local computation. Collective communication includes the partitioning 
and retrieval of the array data. Local computation consists of operations by each 
physical processor on the data in its local memory. In such a model, the pro­
cessors do not need to synchronise during local computation. However, if two
§6.1 Structure of the experimental compiler 98
Partitioner
Shared Memory Parallel Program
'I
Task Sc 
ai
Data Pai
neduling
id
'titioning
Message Passing Program
Figure 6.1: Architecture of the ECLIPSE compiler system.
§ 6.2 Performance study method 99
or more processors participate in a collective communication routine they are 
implicitly synchronised by the communication primitive.
The final output of the compiler is an Fujitsu AP1000 Fortran host program 
and an Fujitsu AP1000 Fortran cell program. The cell program is executed on 
each physical processor of the Fujitsu AP1000 multicomputer.
To run the compiler, the user invokes the compiler with a Fortran 77/HPF 
source program. The user can compile a program step by step, and examine the 
output of each step. All the intermediate results can be directed to a log file.
6.2 P erform ance stu d y  m eth od
This section introduces the metrics that are used to measure the performance of 
the compiler-generated programs. For the purpose of the experiments reported 
in this chapter the average of these metrics across all processors is stated.
6 .2 .1  P erform an ce m easu rem en t
The performance of each compiler-generated program is measured by elapsed time 
and the two related parameters: speedup and efficiency.
Elapsed time is defined to be the total execution time of a compiler-generated 
program. The start point of the timing is the first executable statement in a cell 
program, and the end point of the timing is after the last executable statement.
An important measure of a multicomputer is the speedup factor associated 
with a particular program. Speedup is a measure that captures the relative benefit 
of solving a problem in parallel, rather than in sequence, and demonstrates how 
a program scales with the size of a parallel machine.
Although the sequential and parallel implementations of a problem are differ­
ent programs they are encompassed by a single family of programs, denoted 
Each family member ipn represents an implementation of the problem designed 
to execute on n processors. The family member ipi is taken to represent a one 
processor (or sequential) implementation of the problem. Speedup is defined as 
the ratio of the n processor execution time to the single processor execution time.
Let Ti denote the elapsed time of the program fa  (running on a single pro­
cessor), and Tn be the elapsed time of program fa  (running on n processors). 
The n processors used to obtain the parallel execution time are assumed to be 
identical to the single processor used to obtain the one processor execution time. 
Formally the speedup, denoted S , is defined
In an ideal case the speedup will be n. Such ideal behaviour is not achieved 
because while executing a parallel program it is not possible for the processors to 
devote 100 percent of their time to the computations of the program. Efficiency
§6.3 Automatic data partitioning results 100
is a measure of the fraction of time for which a processor of the multicomputer 
is usefully employed; it is defined as the ratio of speedup to the number of pro­
cessors.
For an ideal program executing on n processors, speedup is equal to n and 
efficiency is equal to one. In practice, speedup is less than n and efficiency is 
between zero and one, depending on the degree of effectiveness with which the 
processors are utilised. Formally efficiency, denoted E, is defined
6.2.2 Analysis and com parison
The goal of the performance study is to determine and indication of the effect­
iveness of the process partitioning and data partitioning techniques presented in 
this thesis.
In order to achieve this the compiler-generated programs are compared to 
hand-crafted code for several applications. These programs were written by 
independent experienced programmers. The performance of the hand-crafted pro­
grams was measured under the same set of parameters as the compiler-generated 
programs. The results show that the performance of the compiler-generated pro­
grams is within a factor of one or two of the hand-crafted code.
The experimental data was obtained using a 128 processor Fujitsu AP1000 
multicomputer, running under CELLOS 1.3, augmented with the collective com­
munication routines described in §3.3, with a Sun SPARCServer 4/390 host 
computer. Timings were taken using the Fujitsu AP1000 cell Fortran library 
routine cdtime [41] which retrieves the elapsed runtime in seconds since the start 
of execution of the cell program. The programs were compiled with the Sun 
Fortran compiler f77 version 1.3 [121] with no optimisations.
6.3 A u tom atic  data  p artition in g  resu lts
Results are presented on two Fortran programs of different complexity. The 
first program is a parallel matrix multiplication program based on the standard 
algorithm. The second program computes the Mandelbrot fractal by iterating on 
each point in the fractal.
6.3.1 M atrix m ultiplication
Matrix multiplication is an important operation in linear algebra. It forms part 
of the Level 3 Basic Linear Algebra Subprogram (BLAS3) [35, 36] widely used in 
many supercomputing applications. It is also used to implement more extensive 
linear algebra subroutine libraries, such as LINPACK and LAPACK. Because of
§6.3 Automatic data partitioning results 101
its various uses, such as these, matrix multiplication has become an important 
standard example and benchmark for parallel programming.
Matrix multiplication can be expressed with a simple computational proced­
ure and has large potential for parallelisation. The purpose is to illustrate the 
effectiveness of the process and data partitioning techniques developed in this 
thesis.
The source program computes the product of two n x n  matrices a and b using 
the standard matrix multiplication algorithm. The matrices are represented by 
two-dimensional arrays. Each element of the matrix product is defined
C(zj) öj • bj,
where a, represents the zth row of matrix a and bj represents the jth  column of 
matrix b. The computation of each single element in the matrix product requires 
the use of an entire row of matrix a and an entire column of matrix b.
The data matrices a and 6, and the product matrix c are stored in memory 
on the host computer. The host program generated by the compiler distributes 
the data matrices a and b to the cells, and collects the product matrix c from the 
cells at the completion of the computation. The cell program generated by the 
compiler consists of three phases: data distribution, local computation and data 
collection. The host and cell programs use collective communication routines 
(see §3.3) to partition matrix a by rows, matrix b by columns and to collect the 
product matrix c.
The purpose of iteration space tiling is to control the granularity of the par­
allel execution of a loop nest (see §4.2). In this implementation of the matrix 
multiplication problem there is no interprocessor communication. Hence, the size 
of each tile is not critical to the performance of this problem. Consequently the 
experimental compiler uses a tile size of one to tile the iteration space described 
by the loops in this problem.
Execution times, speedups and the ratio of execution times between hand­
crafted and compiler-generated code for different problem and machine sizes are 
shown in Figure 6.2. Figure 6.3 displays speedups and efficiency for the exper­
iments graphically, with these values plotted along the y-axis and number of 
processors plotted along the x-axis. Each line of the graph represents the spee­
dup and efficiency for a given problem size. The experimental data was obtained 
as follows: the matrix size was first fixed, and the same problem was then run on 
different numbers of processors, from 1 to 128.
The compiler-generated code achieved excellent speedups (114-126), even 
for small problem sizes. Figure 6.3(a) shows the speedup to be linear. The 
hand-crafted versions exhibited improvements over the compiler generated code 
(70%-90%), except when the problems were parallelised on many processors (32- 
128). On small numbers of processors (1-16) the matrix multiplication problem 
is dominated by computation. The computation in the hand-crafted versions is
§6.3 Automatic data partitioning results 102
Compiler-generated Hand-crafted T 'x n
Problem Size n Tn S V■L n S ' T n
100 x 100 1 69.30 1.00 4.21 1.00 0.06
2 34.17 2.03 3.05 1.38 0.08
4 17.03 4.07 2.61 1.61 0.15
8 8.79 7.88 2.47 1.70 0.28
16 4.40 15.75 2.56 1.65 0.58
32 2.20 31.46 2.39 1.76 1.09
64 1.17 59.06 2.40 1.75 2.05
_ 128 0.60 114.96 2.35 1.79 3.92
128 x 128 1 86.98 1.00 6.85 1.00 0.08
2 43.45 2.00 5.04 1.36 0.12
4 21.76 3.10 3.70 1.85 0.17
8 10.94 7.95 3.55 1.93 0.32
16 5.50 15.79 3.01 2.28 0.55
32 2.76 31.51 2.98 2.30 1.08
64 1.40 61.91 3.39 2.02 2.42
128 0.71 123.34 2.93 2.34 4.13
200 x 200 1 561.47 1.00 21.48 1.00 0.04
2 278.62 2.02 12.93 1.66 0.05
4 140.67 3.99 8.36 2.57 0.06
8 70.07 8.02 6.31 3.41 0.09
16 35.18 15.96 5.58 3.85 0.16
32 17.25 32.56 5.17 4.15 0.30
64 8.61 65.20 5.48 3.92 0.64
128 4.97 113.06 5.49 3.91 1.10
256 x 256 1 708.15 1.00 41.49 1.00 0.06
2 354.31 2.00 23.32 1.78 0.07
4 177.18 4.00 14.50 2.86 0.08
8 88.80 7.97 10.46 3.97 0.12
16 44.58 15.89 8.13 5.10 0.18
32 22.32 31.73 7.49 5.54 0.34
64 11.34 62.45 6.85 6.06 0.60
128 5.69 124.50 6.82 6.08 1.20
512 x 512 1 5662.69 1.00 297.80 1.00 0.05
2 2831.46 2.00 156.71 1.90 0.06
4 1416.42 4.00 84.20 3.54 0.06
8 708.93 7.99 48.55 6.13 0.07
16 355.25 15.94 32.12 9.27 0.09
32 177.67 31.87 23.65 12.59 0.13
64 89.24 63.46 19.49 15.28 0.22
128 44.88 126.17 17.32 17.20 0.39
Figure 6.2: Fujitsu AP1000 execution times for matrix multiplication in 
seconds.
§6.3 Automatic data partitioning results 103
1 128 
Processors
(a) Speedup.
100 x 100
128 X 128
200 x 200
256 x 256
512 x 512
1 128 
Processors
(b) Efficiency.
Figure 6.3: Performance of the compiler-generated matrix multiplication code.
§6.3 Automatic data partitioning results 104
highly optimised when compared to the compiler-generated code. This accounts 
for the comparatively slow performance of the compiler-generated code. On large 
numbers of processors (32-128) the communication involved in partitioning the 
matrices a, b and c, and collecting the result matrix c contributes to a large 
proportion of the execution time. These costs are significantly reduced through 
the use of the collective communication routines by the compiler-generated code. 
This is the reason the compiler-generated code improved when compared to the 
hand-crafted code for large numbers of processors. Figure 6.3(b) illustrates the 
effective utilisation of the processors achieved by the compiler-generated code.
Appendix E presents the compiler-generated matrix multiplication program.
6.3.2 Com putation of the M andelbrot fractal
The performance results of a parallel computation of the Mandelbrot fractal are 
reported in this section. A fractal is loosely described as a geometrical figure that 
consists of an identical motif repeating itself on an ever reduced scale. The pur­
pose is to illustrate the applicability of the data partitioning techniques developed 
in this thesis to operating on large data sets (up to 16 Mbyte).
Mandelbrot was responsible for substantial developments in the mathematical 
theory of a type of fractal generated by a so called iterative conformal transform­
ation (one that leaves angles unchanged [81]). The canonical example is provided 
by
/  9  9x = x — y + a,
V = 2 xy + b, (6.1)
where a and b are integers representing the real and imaginary parts of the 
complex number a 4- bi. The resulting fractals are known as Julia fractals [81], 
denoted J(a , b).
Julia fractals are divided into two classes—wholly disconnected or wholly 
connected. The integers from Equations 6.1 determine the class to which the 
Julia fractal J(a, b) belongs. The Mandelbrot fractal is a complex plane on which 
the point (a, b) is drawn if the resulting Julia fractal J(a , b) is wholly connected.
To determin Julia fractal J(a, b) is wholly connected it is only necessary to 
examine the orbit of the starting point xo = a and yo = b under the iterative 
process
xn+i = x2n - y 2n + a,
yn+1 2xnyn -t~ b. (6-2)
If this orbit goes to infinity then J(a , b) is wholly disconnected; otherwise J(a , b) 
is wholly connected. All points (a, b) for which the resulting Julia fractal J(a , b) 
is connected constitute the Mandelbrot fractal.
§6.3 Automatic data partitioning results 105
The points of the Mandelbrot fractal can be represented as a two-dimensional 
array M . Let the array M be declared MfLrq, l :u2)- The indexes i (1 < i < Ui) 
and j  (1 < j  < u2) of the array element M(i , j )  correspond to the pair of values 
(a, b) in the complex plane for which the resulting J(a,b)  needs to be tested 
for connectivity. Assume both the real and imaginary axis of the complex plane 
range from -2 to 2. The point (i, j) can be mapped to the point (a, b) in this 
complex plane using the transformation
a
b
2 ,
2 -
4
J — • 
U2
The test to determine if the orbit of the point (a, b) goes to infinity consists of 
a repeated application of the iterative process given in Equations 6.2 with a and 
b as initial values. If any point of the orbit calculated by the iterative process 
lies outside the circle defined by the equation
z 2 +  y1 =  4,
it is certain that the orbit goes to infinity. Therefore, the iterative process is 
terminated when a point of the orbit lies outside this circle. For an each starting 
point (a, b) the number of the iteration step at which a point of the orbit leaves 
this circle for the first time is determined, denoted k. The value k is assigned to 
the array element M(i , j )  which corresponds to the (a, b) in the complex plane. 
Since the value of k may be very large—for points within the Mandelbrot fractal it 
is infinite—the total number of iterations is limited. The value of k is interpreted 
as the colour of the point in the Mandelbrot fractal represented by the array 
element M(a, b). Therefore, the total number of iterations is limited to 256. The 
program is highly computation intensive, but also highly parallelisable.
The array M is stored on the host computer. The host program generated 
by the compiler distributes this array to the cells and collects the array from the 
cells at the completion of the computation. The cell program generated by the 
compiler computes the values of k for a partition of the array M  determined by 
the tile allocation (see §4.3). The host and cell programs use collective commu­
nication routines (see §3.3) to partition the array M and collect the resulting 
values of k assigned to each element of M . Finally, the host program graphically 
displays the computed Mandelbrot fractal.
As was the case with the matrix multiplication problem, there is no in­
terprocessor communication in this implementation of the computation of the 
Mandelbrot fractal. Hence, the size of each tile is not critical to the performance 
of this problem, and so the experimental compiler uses a tile size of one to tile 
the iteration space described by the loops in this problem.
§6.3 Automatic data partitioning results 106
Problem Size n
Compiler-generated 
T„ S
Hand-c
T 'n
rafted
5 '
T 'M n
T n
128 x 128 i 8.37 1.00 6.26 1.00 0.75
2 4.25 1.97 3.30 1.90 0.78
4 2.18 3.85 2.91 2.15 1.33
8 1.15 7.27 1.82 3.43 1.58
16 0.64 13.02 1.03 6.10 1.61
32 0.46 18.32 0.57 1 1 . 0 0 1.24
64 0.35 23.94 0.32 19.35 0.91
128 0.22. 37.36 0.51 12.29 2.32
256 x 256 1 33.52 1.00 25.12 1.00 0.75
2 16.92 1.98 12.70 1.98 0.75
4 8.64 3.88 11.48 2.19 1.33
8 4.48 7.49 7.14 3.52 1.59
16 2.42 13.68 3.94 6.38 1.63
32 1.38 24.35 2.07 12.15 1.50
64 0.99 33.76 1.12 22.33 1.13
o
o
CMt—H 0.62 54.06 0.63 40.16 1.02
512 x 512 1 134.08 1.00 99.75 1.00 0.74
2 67.68 1.98 50.09 1.99 0.74
4 34.33 3.91 45.38 2.20 1.32
8 17.67 7.59 28.15 3.54 1.59
16 9.38 14.29 15.45 6.46 1.65
32 5.37 24.97 8.00 12.46 1.49
64 3.17 42.34 4.16 24.00 1.31
128 2.20 60.91 2.18 45.78 0.99
1024 x 1024 1 536.96 1.00 399.12 1.00 0.74
2 270.17 1.99 202.00 1.98 0.75
4 137.46 3.91 179.63 2.22 1.31
I 8 70.45 7.62 110.94 3.60 1.57
16 38.05 14.11 60.82 6.56 1.60
32 20.85 25.76 31.40 12.72 1.51
64 12.08 44.46 16.10 24.79 1.33
128 7.35 73.06 8.86 45.02 1.06
2048 x 2048 1 2256.00 1.00 3574.40 1.00 1.58
2 1128.00 2.00 1787.20 2.00 1.58
4 556.21 4.06 893.60 4.00 1.61
8 281.28 8.02 446.94 8.00 1.59
16 148.92 15.15 244.38 14.63 1.64
32 82.07 27.49 125.46 28.49 1.53
64 47.86 47.14 63.98 55.87 1.34
128 29.80 75.70 32.69 109.34 1.03
F ig u re  6.4: Fujitsu AP1000 execution times for calculating the Mandelbrot 
fractal in seconds.
§6.3 Automatic data partitioning results 107
128 x 128 pixels 
256 x 256 pixels 
512 x 512 pixels 
1024 x 1024 pixels 
2048 x 2048 pixels
Processors
(a) Speedup.
128 x 128 pixels -O— 
256 x 256 pixels —|—  
512 x 512 pixels -Q— 
1024 x 1024 pixels -X— 
2048 x 2048 pixels -A-
Processors
(b) Efficiency.
Figure 6.5: Performance of the compiler-generated Mandelbrot fractal pro­
gram.
§6.4 NCAR Community Climate Model 108
Figure 6.4 contains timings for calculating the Mandelbrot fractal. The figure 
presents speedups and the ratio of the elapsed time between hand crafted code 
and compiler generate code. Speedup and efficiency for the experiments are 
shown graphically in Figure 6.5. The experimental data was obtained as follows: 
the size of the Mandelbrot fractal is first fixed, and the same problem is then run 
on different numbers of processors, from 1 to 128.
The compiler-generated code achieved good speedup (37-70). Figure 6.5(a) 
shows the speedup to be near linear. The hand-crafted versions of the Mandelbrot 
computation exhibited only slight improvements (1%—20%) over the compiler- 
generated code when the code was parallelised over a small number of processors 
(1-2). The compiler-generated code far excels the hand-crafted code in each of 
the other experiments as shown Figure 6.4. Figure 6.5(b) illustrates effective util­
isation of processors when the program was parallelised over a moderate number 
of processors (1-32).
6.4 N C A R  C om m unity  C lim ate M odel
Computer models of the atmosphere circulation are used both to predict weather 
and to study the mechanisms of global change. The NCAR CCM1 is such a 
model. It can be used for numerical weather prediction and climate simulation. 
The concept of utilising the one model for both forecasting and climate studies 
has potential scientific value since a major part of long-range (one to two week) 
forecast errors is due to the drift toward a model climate which differs from that 
of the transient atmospheric conditions. This section reports results for a parallel 
implementation of the CCM1 developed for the Fujitsu AP1000.
The CCM1 was developed by NCAR and appeared in its present form in 1987. 
The code of the CCM1 evolved from the NCAR CCMOB model which in turn was 
based on an adiabatic, inviscid version of the spectral model developed at the 
European Centre for Medium Range Weather Forecasts (ECMWF). The spectral 
aspects are based on the Australian spectral model developed by Bourke [17], 
and the basic structure of the ECMWF code has been retained.
The CCM1 code is written in Fortran and comprises 151 modules containing 
some 25 000 lines of code. The structure of the code is very complex. Com­
munication between the model’s modules is based on the use of common blocks. 
The model stores data in a large array in a blank common, known variously as 
buf, bu ffer or b, depending on the version of blank common being used [15]. 
Pointers then further divide this blank common into eight logical buffers. Due 
to the nature of this structure, the complexity of the loops and the experimental 
compiler’s inability to handle subroutines the CCM1 code is unsuitable for the 
experimental compiler. The implementation of the model reported in this section 
was parallelised, without the use of a compiler, by applying the process and data 
partitioning techniques developed in this thesis.
§6.4 NC AR Community Climate Model 109
The development of this parallel implementation demonstrates the suitability 
of this approach to partitioning data stored in a file system, as opposed to data 
stored in the memory of the host computer. Using the data and process parti­
tioning techniques to develop the parallel implementation of the CCM1 achieves 
good speedup relative to the predicted speedup derived from Amdahl’s law.
6 .4 .1  B r ie f  ov erv iew  o f  th e  m o d e l
The CCM1 is primarily a spectral model, meaning that the time integration is 
done in the spectral domain. The physics calculations are done in the grid point 
domain. Moisture is handled non-spectrally, using a semi-Lagrangian solver. The 
version for the present study is a spectral model with rhomboidal truncation at 
wavenumber 15 (R15) and a corresponding physical grid of 7.5 by 4.5 degrees, 
giving 48 vertical by 40 horizontal grid points. The CCM1 has 12 vertical levels.
The CCM1 numerically solves a closed set of time-dependent (dynamic) equa­
tions for the conservation of mass (both dry air and water vapour), momentum 
and energy. These equations are called prediction equations. The time integra­
tion of these prediction equations results in a set of prognostic variables—variables 
depending explicitly on time. These variables are horizontal winds, vertical mo­
tions, atmospheric temperatures, surface pressure and specific humidity. The 
CCM1 also outputs a number of diagnostic quantities which are specific functions 
of the prognostic variables. Among the more important are surface temperat­
ures, surface radiative and energy fluxes, upper atmosphere radiative fluxes and 
precipitation. Altogther, the CCM1 outputs 55 diagnostic variables covering a 
broad spectrum of physical information.
The algorithm for the climate model can be summarised as follows:
1. Compute the physics and non-linear interactions in the physical grid-space.
2. Convert the variables into spectral space.
3. Compute the tendencies, and update the variables (excluding moisture) to 
the latest time step.
4. Perform inverse transform of the variables to physical space.
5. Compute and update moisture in the physical space using the semi-Lagra­
ngian method and repeat steps 1-5.
6 .4 .2  P ara lle l im p lem en ta tio n  o f  th e  m o d el
On a sequential machine the solution of the model would be obtained by travers­
ing the entire domain of the CCM1. Recall from §4.2 that one common approach 
for dividing work between processors is to decompose the problem space into 
tiles (or subtasks) and allocate these tiles to the individual processors of the
§6.4 NCAR Community Climate Model 110
Latitudinal index
10
0
18 24 30 36
Longitudinal index
48
Processor 0 
Processor 1 
Processor 2 
Processor 3
Figure 6.6: The CCM1 domain decomposed over a 2x2 grid of processors. Two 
processors decompose the latitudinal dimension and two processors decompose 
the longitudinal dimension. The latitudinal dimension is decomposed so that 
latitudes about the equator are paired on a processor, a property that simplifies 
implementation of the parallel spectral transformation.
multicomputer. The processors then execute these tiles and communicate with 
each other when necessary. Decomposition can be either by latitude or longitude 
alone (one-dimensional decomposition) or by latitude and longitude combined 
(two-dimensional decomposition).
This methodology has been employed for the parallel implementation of the 
CCM1. The Fujitsu API000 is viewed as a logical two-dimensional processor 
grid. The grid-point domain is patch decomposed over the processors in both 
the latitudinal and longitudinal dimensions, with the added constraint that each 
processor has both the northern and corresponding southern latitudes. Latitudes 
that are symmetric about the equator are paired on each processor to simplify 
the parallel implementation of the spectral transform algorithm.
Exam ple 6.1 Figure 6.6 illustrates the decomposition of the CCM1 over four 
processors. ■
The fast Fourier transformation (FFT) and the time evolution in spectral space 
are decomposed in terms of latitude. The latitudinal and longitudinal decompos­
ition of the model is focussed on the scalability of the parallel implementation, 
workload distribution and communication.
The latitudinal decomposition of the workload dominates the parallelisation 
of the code. To keep the original logic of the code intact the number of latitudinal
§6.4 NCAR Community Climate Model 111
divisions may not be greater than half the number of latitudes (at present resol­
ution up to 20 processors can be employed in the latitudinal direction). Further, 
the number of latitudes on a processor must be even. This restriction is due to 
the storage scheme for the latitude records in the model data buffer where the 
northernmost latitude occurs first, followed by the southernmost, followed by the 
next northernmost and so on. Northern and southern latitudes are paired on 
processors to take advantage of the symmetry in the spectral transform. The 
latitudinal decomposition provides a very effective coarse grained parallelisation. 
The large computational modules are distributed between the processors. The 
parallelisation commences at the very beginning of the time evolution. The con­
tiguous blocks are statically distributed between the processors.
The longitudinal decomposition is a fine grained division of the workload. 
The parallelisation is at the do loop level, and only mathematical expressions 
are distributed between processors. All of the longitudinal do loops have their 
range reduced according to the number of processors in the longitudinal direction 
using the process partitioning techniques presented in §4.2. The parallelisation 
of the model in the longitudinal direction invokes a much greater communication 
overhead than the latitudinal parallelisation.
The input data sets for the parallel implementation of the CCM1 are staged 
from the host computer to disks local to each processor. This staging is necessary 
because the size of these data sets is such that they cannot be stored in the 
memory of the host computer. Clearly, these data sets need to be partitioned for 
the parallel implementation of CCM1. The way in which data is allocated to the 
processors involved in the computation is derived from the decomposition of the 
grid-point domain described earlier in this section. This approach is the same as 
the one developed in chapter 4. Each processor reads input data from the data 
sets by applying the algorithm for the dist communication primitive (see §3.3.2) 
to a file (as opposed to data sent from the host computer).
6.4.3 A nalysis of th e  para lle l m odel
Figure 6.7 presents the elapsed time of the parallel implementation of the 
CCM1 on the Fujitsu AP1000. The experimental data was obtained by executing 
the CCM1 for one model day (48 time steps) on different numbers of processors: 
from 1 to 120. Speedup and efficiency of the parallel model are illustrated graph­
ically in Figure 6.8.
The necessity of manual application of the process and data partitioning tech­
niques renders a comparison with hand-crafted versions irrelevant. Hence, the 
performance of the parallel implementation of the model is compared with the 
predicted performance derived from a two-dimensional version of Amdahl’s law.
§6.4 NC AR Community Climate Model 112
n
Initial model 
Tn S
Optimised model 
Tn S'
T'x n
Tn
1 3330.51 1.00 3330.51 1.00 1.00
10 442.26 7.53 365.94 9.10 0.83
20 305.81 10.89 213.49 15.60 0.70
30 245.52 13.57 147.36 22.60 0.60
40 228.11 14.60 132.26 25.18 0.58
60 217.34 15.32 126.63 26.30 0.58
120 210.81 15.80 125.20 26.60 0.59
Figure 6.7: Fujitsu AP 1000 execution times of the parallel implementation of 
the NCAR CCM1 in seconds.
The predicted speedup of the current implementation of the parallel model 
is illustrated by the graph in Figure 6.8(a). These figures are based on profiling 
results and the two-dimensional version of Amdahl’s law
S =
fx + NrN„
where Nx is the number of processors in the latitudinal decomposition; Ny is the 
number of processors in the longitudinal decomposition; f x is the fraction of the 
total execution time spent performing operations that are not partitioned along 
the latitudinal directional; and f y is the fraction of the total execution time spent 
performing operations that are not partitioned along the longitudinal direction. 
The values of f x and f y were determined by extensive profiling of the original 
sequential code of the CCM1. The sections of code that are not partitioned along 
the latitudinal direction account for 1.5% of the total sequential execution time 
of the CCM1. The sections of code that are not partitioned along the longitudinal 
direction account for 10.7% of the total sequential execution time of the CCM1.
The speedup of the parallel implementation of the model are marked by o 
on the graph in Figure 6.8(a). The graph illustrates that for a one-dimensional 
decomposition of the workload (less than 20 processors involved in the compu­
tation), the speedup measured is very close to the theoretical curve predicted 
by Amdahl’s Law. As the workload is spread across a two-dimensional network 
(more than 40 processors involved in the computation), the costs of the inter­
processor communication in the longitudinal direction tend to limit the gains 
from this decomposition of workload. The incorporation of an efficient global 
summation operation and a systolic communication scheme in the longitudinal 
direction lowered these costs and thus enhanced the performance of the parallel 
model. The benefit of these optimisations is illustrated by A on the graph in
§6.4 NCAR Community Climate Model 113
Ideal linear case —I—  
Predicted speedup -X— 
Parallel model -O— 
Optimised parallel model A -
120
Processors
(a) Speedup.
Parallel model -O— 
Optimised parallel model -A—
120
Processors
(b) Efficiency.
F igure 6.8: Performance of the parallel implementation of the NCAR CCM1 
for one model day (48 time steps).
§6.4 NCAR Community Climate Model 114
Figure 6.8(a) by the difference between the plot of the parallel model speedup 
and the optimised parallel model speedup.
Conclusions
This thesis has examined the problems of process and data partitioning when 
compiling programs for a multicomputer, with the goal of determining these auto­
matically or implementing them as specified by the user. Appropriate collective 
communication routines have been developed to provide the required communic­
ation. In using these routines the communication associated with partitioning 
data on a multicomputer is greatly simplified.
Avenues of future research are identified in §7.1. The contributions made by 
this thesis to the development of compilers for multicomputers are discussed in 
§7.2.
7.1 Future research
The collective communication routines presented provide support for a diverse 
range of data partitions in the context of multicomputers. The operation of 
these routines could be extended in two directions.
Firstly, these routines are restricted to handling data stored in contiguous 
block of addressable memory. Clearly data stored in a contiguous part of a file 
system is analogous to data stored in a contiguous block of memory. Hence, the 
semantics of these operations are applicable to data stored in a file system. This 
has been demonstrated in the parallel implementation of the NCAR CCM1 (see 
§6.4). It is relatively straight forward to enhance standard file access routines 
with these semantics.
Secondly, the implementation of the collective communication routines relies 
on the regularity of location of the data items. A useful future development 
would be the ability of the collective communication routines to handle irregularly
115
§7.2 The contributions of this thesis 116
located data items. It should be noted that the data partitioning techniques 
presented in this thesis do not apply to irregular data structures.
*  *  *
The implicit data partitioning in this thesis deals with a perfectly nested loop 
nest with a single array assignment statement as the body. The most obvious 
avenue of future research is the extension of this loop structure to incorporate 
multiple array assignment statements into the body of the loop nest. When an 
array is referenced in many assignment statements it is possible in principle to 
merge each of the array read sets into a single array read set. By extending the 
implicit data partitioning techniques it should then be possible to use this single 
set in conjunction with the collective communication routines.
This thesis has shown that by restricting the arrays references in a perfectly 
nested loop nest to be regular and static, the compiler is able to compute the 
necessary information to automatically partition these arrays. For applications 
that need to use irregular data structures compile time analysis alone is no longer 
sufficient. Various run time techniques have been developed to handle dynamic 
problems [116, 70]. More work needs to be done to see how to combine both 
compile time analysis and run time techniques into a coherent framework.
*  *  *
The HPF language allows the user to dynamically remap array data during the 
execution of a program. This function could be implemented using the collective 
communication routines concatenate and distribute one after the other. However, 
techniques need to be developed to achieve this in an efficient manner.
7.2 The contributions of this thesis
Stride message transfer is a new message handling mode for the Fujitsu AP1000, 
and is presented in chapter 3. It provides basic support for building powerful 
communication primitives through the selection of individual data items from a 
message buffer and the transfer of these to a contiguous area of a processor’s 
local memory, or vice-versa. Communication primitives which utilise this stride 
message transfer mode can simplify the communication associated with data par­
titioning. Additionally, they can be used in a compiler in the context of automatic 
data partitioning.
*  *  *
Collective communication is defined to be communication that involves a 
group of processors. Two communication routines belonging to this class are 
presented in chapter 3: distribute and concatenate. These routines are developed 
as multidimensional abstractions (in both the data space and processor space) of 
stride message transfer.
§7.2 The contributions of this thesis 117
Using the distribute collective communication routine it is possible for a single 
source processor to send a different partition of an array to each processor be­
longing to the processor group. An array partitioned over a processor group can 
be reassembled into its original form, and stored in a single destination processor, 
using the concatenate collective communication routine.
The inherent flexibility of these routines allows them to be utilised in a 
compiler to implement a diverse range of data partitioning schemes. One such 
compiler is presented in this thesis. Because of the inherent flexibility of these 
routines it is irrelevant whether these are determined automatically in the com­
piler or specified by the user.
*  *  *
The implicit data partitioning technique presented in chapter 4 is a novel 
method of partitioning data on a multicomputer. Each processor of a multicom­
puter requires data to perform the computations allocated to it. Implicit data 
partitioning determines how the data is to be distributed from the distributions 
of these computations. Collective communication routines simplify significantly 
the communication associated with distributing data in this manner.
The implicit data partitioning technique distributes the iterations of a per­
fectly nested loop, by tiling the iteration space defined by these loops. The 
nature of this distribution is used to automatically determine the distribution of 
the arrays referenced within these loops.
*  *  *
High Performance Fortran provides a powerful set of compiler directives which 
allow the user to advise the compiler how to assign array data to an abstract 
processor arrangement. The user defines the structure of this arrangement and 
how the array data is to be partitioned onto these processors. The compiler 
then assigns the array data to the physical processors of a multicomputer using 
a mapping defined by the specification of HPF.
Chapter 5 developed a new methodology for analysing this mapping, referred 
to as explicit data distribution. It provides an elegant formulation of and a 
practical solution to the problem of implementing data partitions specified by 
the user.
Explicit data distribution has been used to implement, in the context of a 
compiler, the partitioning of array data onto the physical processors of a mul­
ticomputer as specified by the user via the HPF compiler directives. Explicit 
data distribution utilises collective communication routines to simplify the com­
munication associated with implementing the data partitions as specified by the 
user.
*  * *
The process and data partitioning techniques developed in this thesis have 
been incorporated into an existing experimental compiler. The compilation model
§7.2 The contributions of this thesis 118
is based upon program transformations and user directives, via a subset of the 
HPF compiler directives. Although this design has been developed for the Fujitsu 
A PI000 it could potentially be adapted to suit different multicomputers.
The implementation of the experimental compiler demonstrates that the pro­
cess and data partitioning techniques are practically sound. This is supported by 
a series of case studies presented in chapter 6.
A ppend ices
119
AM em o ry  space a lloca tion  func tion
This appendix presents a proof that demonstrates the memory space allocation 
function (Definition 3.5 page 34) is one-to-one.
The function is defined
M  : M m —> S, M( j )  =  b + s - j
where b > 0 is a given integer, and the vector s = (s1} s2, . . . ,  sm) where
Sj_iCj_i + (Ti, if 2 < j  < n;
Sj =  <
1 + 5i5 j  — i;
as illustrated in Figure 3.9 (page 34). The m-dimensional space M m is defined 
(Definition 3.4 page 33)
M m = { j  e z m I o < j  < c -  l},
where c is the m integer vector (ci, C2 , . . . ,  cm).
Consider the following family of functions
M l ( j l) =  b + diji 0 < j i < c i - l
M 2{j2) =  b + diji +  d2j2 0 < j i < C i - l  (1 < i <  2)
A4171 ( j m) = b 4- d\j\ + d2j2 + •.. + dmjm 0 < ji < c* — 1 (1 < i < m)
where j n is used to represent the n vector (ji, j2, . . . ,  jn).
120
§A Memory space allocation function 121
Lem m a A .l
b < (A.l)
M m( j m) < b + dn+i (A.2)
Proof: Clearly di > 0 (1 < i < m). Hence Statement A.l follows.
Statement A.2 is proved by induction. Denote the predicate
P(m) = ( M m(jm) < b + dm+1)
P( l ) :
M ^ j 1) -  b + di j1
^ b +  d i ( c \  — l )
< b +  d i ( c \  — 1) 4- d\ +  Si 
— b -f- d\ C\ + Si 
== b 4- C?2
P(n)  —> P{n 4- 1) :
M n+l{jn+l) = 
< 
<
<
dn+ljn+l +
dn+i(cn+i — 1) + M n{j)) 
(by Predicate P(n)) 
dn+i{cn+i — 1) + dn+1 +  b
dn+lCn+l +  b
dn+l C n + i  +  Sn + 1 4- b 
dn + 2 +  b
Hence Statement A.2 follows. □
T heorem  A .l
is one-to-one.
Proof: Theorem A.l is proved by induction. Consider the predicate
P(m) = ((M m( j m) -  M m(km)) =$► (j m = k m))
P( 1):
M ' i j 1) = M 1(k1) 
^  b 4- d\j^ — b 4- d\k}
= > 3 1 =  f c 1
§A Memory space allocation function 122
P{n)  — >■ P{n  4- 1)
M n+l y n + l )  =  ^ n + i ^ n + i )
^ n + l in + 1  +  M n( j n) — dn+ikn+i +  Ml n( k n)
Without loss of generality, consider j n+1 > fcn+1.
4 + 1 Ü H + 1  -  * n + i )  =  A T  (fcn ) -  A 4 n ( i n ) 
Assume jn+1 > kn+1. In Equation A.3 
lhs > dn+1
rhs < A4n(kn) — b by Statement A.l
<  b 4- dn+i — b by Statement A.2
dn+l
This is a contradiction. Hence jn+\ =  &n+i and
M n( j n) =  M n{kn).
From Equation A.3 and the Predicate P(n)
Hence
j n = k n.
j U+1 = kn+l
(A.3)
(A.4)
□
BC o m m u n ica tio n  o p e ra tio n s  
specification
In order to define precisely the collective communication routines developed in 
this thesis a formal specification of these routines is presented in the specification 
notation Z [120].
B . l  D istr ib u te  com m unication  rou tin e
In this account of the distribute collective communication routine, it is necessary 
to deal with a sequence of bytes broadcast by the source processor to all processors 
belonging to a processor group. For present purposes, it does not matter what 
form these bytes take, so the set of all bytes is introduced as a given set of the 
specification:
[BYTE]
The first aspect of the communication primitive to describe is the array of struc­
tures t representing the m integer vector s and the m integer vector c. Each 
element of t consists of positive integer s and natural number c. An element of 
this array is represented by the schema template:
_template________________________________________________
5  : N i  
c : N
The length of this array is modelled by the positive integer dim:
123
§B.l Distribute communication routine 124
dim : N i
The Z primitive 1, extract, operates on a set of indices U and a sequence 
s. The operation U ) s defines a new sequence s' that contains exactly those 
elements of s that appear at an index given in U, in the same order as in s. The 
Z primitive concatenation, operates on sequences s and t. The operation s ^ t  
is the concatenation of s and t. It contains the elements of s followed by the 
elements of t.
The dist collective communication primitive is modelled by two mutually re­
cursive functions distribute and extract. The function distribute is applied to a 
sequence buf of BYTE from the source processor, a sequence t of length dim of 
template, the dimension d, and a base offset b. In the case, when the dimension 
d is one, the function distribute results in a sequence whose elements are selec­
ted from the input sequence. The elements selected are those whose indices are 
dimension one s apart starting at b where dimension one c elements are selected. 
In the general case, when the dimension d is not one, the result of distribute is the 
result of the function extract applied to the c belonging to the current dimension 
of t.
The function extract is applied to a sequence buf of BYTE from the source 
processor, a sequence t of length dim of template, the dimension d, a count n , 
and a base offset b. In the case, when the count n is zero, the function extract 
results in the empty sequence. In the general case, when the count n is not 
zero, extract returns the concatenation of extract applied to a decremented c and 
distribute applied to the front of the current sequence template (denoted by d—1) 
and the appropriate base offset. The function extract applies distribute, c times 
with base offsets
6, 6-l-s, 6-t-2s,. . . ,  6-f-(c—l)s
for a s and a c belonging to dimension d of the sequence t of template.
distribute : seq BYTE  -+* seqdlTn template -+» Ni -+* N -+* seqBYTE  
extract : seq BYTE  -+» seqdzm template -+» Ni -+» N -+» N -+» seq BYTE
(V buf : seq BYTE] t : seqdim template; d : 1 . .  dim; b : N • 
d =  1 ==> distribute buf t d b =
{ i  : N I i mod t ( l ) . s = b mod t ( l ) . s  
b < i < b + (t( l ).c — 1) x t ( l ) . s}  1 buf 
d /  1 => distribute buf t d b =  extract buf t d t(d).c b)
(V buf : seq BYTE; t : seqdim template; d : 1 .. dim; n : N; b : N • 
n = 0 => extract buf t d nb = {} 
n /  0 => extract buf t d n b = 
extract buf t d (n—1) b ^  
distribute buf t (d—1) (b -f (rz — 1) x t(d).s))
§B.2 Concatenate communication routine 125
B .2  C on caten ate  com m unication  routine
The Z specification in §B. 1 is extended to include the dconcat communication 
primitive.
The first aspect of the communication primitive to describe is the structured 
message sent by each processor belonging to the processor group from which the 
array is being gathered to the destination processor. Each structured message 
consists of positive integer b and a sequence t of template of length dim. The 
array partition sent by a processor to the destination processor is modelled as 
block, a sequence of sequences of BYTE. Each of the component sequences that 
constitute block are of length c, where c is defined in the first template of t. The 
number of these component sequences in block is the product of the remaining c 
in t. A structured message is represented by the schema msg:
_msq____________________________________________________
b :
t : seqd-m template 
block : seqseq.BTT'E
(V i : (1 .. #block • #block(i) = t(l).c)
#block =  xfj^t(i).c
The number of processor in the processor group is modelled by the positive integer 
procs:
procs : Nx
The Z primitive ®, overriding, operates on two relations Q and R. The 
relation Q ® R relates everything in the domain of R to the same objects as R 
does, and everything else in the domain of Q to the same objects as Q does.
The insertion of a single message into the receive buffer of the destination 
processor is modelled by two mutually recursive functions insert and place. The 
function insert is applied to a message m, a dimension d of t the sequence of 
template, an index of a component sequence of block /, and an offset b. In the 
case, when the dimension d is one, the function insert results in a mapping from 
indices to bytes. The indices specify a location in receive buffer for each element 
of a component sequence of block. The indices are dimension one s apart starting 
at an offset b where dimension one c indices are generated. In the general case, 
when the dimension d is not one, the result of insert is the result of the function 
place applied to the c belonging to the current dimension.
The function place is applied to a message m, a dimension d of t the sequence 
of template, a count n, an index of a component sequence of block /, and a base 
offset b. In the case, when the count n is zero, the function place results in 
the empty sequence. In the general case, when the count n is not zero, place
§B.2 Concatenate communication routine 126
returns the result of overriding place applied to a decremented count and insert 
applied to the front of the current sequence template (denoted by d—1) and the 
appropriate component sequence of block and offset b. The function place applies 
insert, c times with offsets
6, b-\-2 ,. . . ,  6t (c—1)5
for s and c belonging to dimension d of t the sequence of template.
insert : msg -+» Ni -+> N -+>• Ni -4* (Ni -+* BYTE)
place : msg -+> Ni -+> N -4* Ni ■+> Ni (Ni -+>• BYTE)
(V m : msg; d : 1 .. dzm; / : ; b : N
/ < Jfm.block 
d = l => insert m d l b =
{k : Ni I 1 < k < m .t(l).c • 
b + {{k—1) x m .t(l).s) m.block(l)(k)} 
d /  1 =$■ insert m d l b = place m d m.t(d).c l b)
(V m : msg; d : 1 .. dzm; n : N; l : Ni; b : N • 
l < # m .block
n = 0 => place m d n l b = {}
place m d n ib  = place m d (n—1) l b ® 
insert m (d — 1) (n + (/—1) x m.t(d).c)(b + (n—1) x m.t(d).s))
The dconcat communication primitive is modelled by the recursive function 
concatenate. This function is applied to a sequence of messages q, a processor 
number n and the receive buffer of the destination processor rbuf. In the case, 
when the processor number n is zero, the function concatenate results in an 
unchanged receive buffer rbuf. In the general case, when the processor number 
n is not zero, the result of concatenate is the result of overriding concatenate 
applied to the processor (n—1) with the function insert applied to the message 
sent by processor n.
concatenate : seqprocs msg -+>• Ni -+»• seq BYTE  -+>• seq BYTE
(V q : seqpr0CS msg; n : 1 .. procs; rbuf : seq BYTE • 
n — 0 => concatenate q n rbuf = rbuf 
n /  0 => concatenate q n rbuf =
concatenate q (n—1) rbuf ® insert q(n) dim 1 q(n).b)
c
P se u d o co d e  d esc rip tio n
Pseudocode has been used in chapter 3 to present the structure of the collective 
communication routines concisely and clearly. This pseudocode is intended to be 
representative of high-level languages such as C and Fortran.
In describing pseudocode the equivalent expression in C and Fortran77 are 
given. Pseudocode is not intended to be a complete language in that not every 
construct of C and Fortran has a strict pseudocode equivalent. The constructs 
of the pseudocode described in this appendix are those used in chapter 3.
C .l  D ec la ra tio n  of variables
The data type of a variable is declared at the beginning of a routine in which it 
is used. Pseudocode permits four types of variables to be declared.
C .1 .1  In teg ers
Integers are declared with the following syntax:
declare_int variable;
This has a direct equivalent in C: 
in t  v a riab le ;
The Fortran77 equivalent is also straightforward: 
in teger variab le
127
§c.l Declaration of variables 128
C.1.2 Single dim ension arrays
Arrays are declared with the following syntax:
declare_array variable type [,size]:; 
where is type is valid. This has a direct equivalent in C: 
type v a r ia b le [0 :s iz e -1 ] ;
In Fortran77 this is declared as: 
type v a riab le (s ize )
C.1.3 Structures
A data structure is a special kind of array in which the subunits contained in the 
array are explicitly listed. The syntax is as follows:
declare_struct { list of declarations };
Pseudocode data structures are directly equivalent to structures in C.
Example C .l Consider the following data structure taken from Figure 3.12:
declare_struct { declare_int s;
declare_int c; };
This data structure can be represented in C as:
s tru c t { in t s ;
in t c; } x;
The equivalent data structure in Fortran77 can be represented with a common 
block:
in teger s, c
common /x /s ,  c ■
C.1.4 Buffers
In pseudocode a buffer is a general way of referring to a set of contiguous memory 
locations. The syntax of declaring a buffer is:
declare_buf buffer-name;
There is no single equivalent in C or Fortran77 to a pseudocode buffer. In C 
a buffer may be referenced by means of a pointer to the location in memory of 
the start of the buffer. In Fortran77, however, a buffer is referenced simply by a 
variable name.
§ C-2 Loops 129
E xam ple C .2  Consider the pseudocode declaration taken from Figure 3.12: 
declare_buf lbuf;
This buffer could be represented in C by any of the  following:
i n t  lb u f  [] ; 
f l o a t  lb u f  [1 2 ]; 
c h a r  * lb u f ;
Fortran77 equivalents include:
in t e g e r  lb u f O )
d oub le  p r e c i s io n  l b u f (12)
c h a ra c te r* 8  lb u f  ■
C.2 Loops
Only a for  loop is defined in pseudocode. The syntax of a for loop is as follows:
for_begin specification of num ber of loops 
statement
for_end
The pseudocode for a for loop corresponds directly to  the C for loop.
E xam ple C.3 Consider the following loop taken from Figure 3.6:
for_begin j  — 1 to  hcnt 
statement
for_end
The equivalent for loop in C is:
f o r  ( j  = 1; j  <= h c n t ;  j++) 
statement
A for loop can be w ritten  in Fortran77 as a do loop:
do 10 j  = 1, h c n t
statement
10 c o n tin u e  ■
§ C.3 Conditional Statements 130
C.3 C onditional S ta tem en ts
In pseudocode conditional branching is provided by the following construct:
iflb eg in  ( expression ) then  
statement 1
else
statement 2
end_if
This construct causes statement 1 to be executed if expression is true otherwise 
statement 2 is executed. The equivalent constructs in C and Fortran77 are similar.
E xam ple C .4  Consider the following conditional statem ent taken from Fig­
ure 3.12:
iflb eg in  ( / =  1 ) th en
statement 1
else
statement 2
iflen d
The equivalent conditional in C is:
i f  (1 == 1) { 
statement 1 
} e ls e  {
statement 2
}
The conditional can be expressed in Fortran77 as:
i f  ( l . e q . l )  then
statement 1
e ls e
statement 2
end i f  ■
C.4 D eclaring  rou tines
W hen defining the communication primitives in chapter 3 the argument list and 
data  type of each routine are given explicitly. The general form of a declaration 
is
func name ( list of arguments )
§C.4 Declaring routines 131
Example C.5 Consider the following declaration taken from Figure 3.15: 
func dconcat (rbuf, m)
The equivalent declaration in C is:
(void) dconcat (rbuf, m)
The procedure declaration in Fortran77 can be expressed as:
subroutine dconcat (rbuf, m) ■
S y n tax  of th e  H PF d irec tives
This appendix presents the syntax of the DISTRIBUTE, ALIGN and PROCESSORS 
directives discussed in chapter 5. The descriptions of the syntax are taken from 
the HPF language specification [53] and presented for completeness.
D .l  T he d is tr ib u te  d irective
The DISTRIBUTE directive specifies a mapping of data objects to abstract pro­
cessors in a processor arrangement.
Exam ple D .l The following DISTRIBUTE directive
REAL SALAMI(10000)
!HPF$ DISTRIBUTE SALAMI(BLOCK)
specifies that the array SALAMI should be distributed across some set of ab­
stract processors by slicing it uniformly into blocks of contiguous elements. If 
there are 50 processors, the directive implies that the array should be divided 
into groups of 200 elements, with SALAMI (1:200) mapped to the first processor, 
SALAMI (201:400) mapped to the second processor, and so on. If there is only one 
processor, the entire array is mapped to that processor as a single block of 10000 
elements. ■
Exam ple D.2 The block size may be specified explicitly.
REAL WEISSWURST(IOOOO)
!HPF$ DISTRIBUTE WEISSWURST(BL0CK(256))
This specifies that groups of exactly 256 elements should be mapped to suc­
cessive abstract processors. (There must be at least [10000/256] =  40 abstract 
processors if the directive is to be satisfied. The fortieth processor will contain a 
partial block of only 16 elements, namely WEISSWURST(9985:10000).) ■
132
§D.l The distribute directive 133
Exam ple D.3 HPF also provides a cyclic distribution format.
REAL CARDS(52)
!HPF$ DISTRIBUTE CARDS(CYCLIC)
If there are 4 abstract processors, the first processor will contain CARDS (1:49:4), 
the second processor will contain CARDS(2:50:4), the third processor will contain 
CARDS (3:51:4), and the fourth processor will contain CARDS (4:52:4). Successive 
array elements are dealt out to successive abstract processors in round-robin 
fashion. ■
Exam ple D.4 Distributions may be specified independently for each dimension 
of a multidimensional array.
INTEGER CHESS-BOARD(8,8), G0_B0ARD(19,19)
!HPF$ DISTRIBUTE CHESS-BOARD(BLOCK, BLOCK)
!HPF$ DISTRIBUTE G0_B0ARD(CYCLIC,*)
The CHESS-BOARD array will be carved up into contiguous rectangular patches, 
wrhich will be distributed onto a two-dimensional arrangement of abstract pro­
cessors. The GO-BOARD array will have its rows distributed cyclically over a one­
dimensional arrangement of abstract processors. (The * specifies that GO-BOARD is 
not to be distributed along its second axis; thus an entire row is to be distributed 
as one object.) ■
Formally, the syntax of the DISTRIBUTE and directive is:
distribute-directive
dist-directive-clause
distributee
dist-format-clause
dist-format-list
dist-format
dist- onto-clause 
dist-target
“!HPF$ DISTRIBUTE” distributee dist-directive-clause. 
dist-format-clause [ dist-onto-clause ]. 
object-name 
template-name.
“(” dist-format-list “)”
“(” dist-format-list )”
dist-format { uf  dist-format }.
“BLOCK” [ “(” int-expr “)” ]
“CYCLIC” [ “(” int-expr “)” ]
“ * ” .
“ONTO” dist-target. 
processors-name 
“*” processors-name
Constraint: An object-name mentioned as a distributee must be a simple name 
and not a subobject designator.
Constraint: An object-name mentioned as a distributee may not appear as an 
alignee in an ALIGN directive.
§D.2 The align directive 134
Constraint: If a dist-format-list is specified, its length must equal the rank of 
each distributee.
Constraint: If both a dist-format-list and a processors-name appear, the number 
of elements of the dist-format-list that are not * must equal the rank 
of the named processor arrangement.
Constraint: If a processors-name appears but not a dist-format-list, the rank 
of each distributee must equal the rank of the named processor ar­
rangement.
Constraint: If either the dist-format-clause or the dist-target in a DISTRIBUTE 
directive begins with * then every distributee must be a dummy 
argument.
Constraint: Any int-expr appearing in a dist-format of a DISTRIBUTE directive 
must be a specification-expr.
D .2  T he align  d irective
The ALIGN directive is used to specify that certain data objects are to be mapped 
in the same way as certain other data objects. Formally, the syntax of the ALIGN 
directive is
align-directive = “ ! HPF$ ALIGN” alignee “WITH” align-spec.
alignee = object-name [ “(” align-source-list “)” ].
align-source-list = align-source { “,” align-source }.
align-source
1
1
tt.W
align-sub script-name.
align-spec = align-target [ “(” align-sub script-list “)” ]
align-target
1
object-name
template-name.
align-sub script-list = align-sub script { “,” align-sub script }.
align-subscript — align-expr
align-expr specification-expr.
Constraint: An object-name mentioned as an alignee may not appear as a distributee 
in a DISTRIBUTE directive.
Constraint: The align-source-list (and its surrounding parentheses) must be omitted 
if the alignee is scalar.
Constraint: If the align-source-list is present, its length must equal the rank of the 
alignee.
§D.3 The processors directive 135
D .3 T he  p rocessors d irec tive
The PROCESSORS directive declares one ore more rectilinear processor arrangements, 
specifying for each one its name, its rank (number of dimensions), and the size of each 
dimension. Formally, the syntax of the PROCESSORS directive is
processors-directive 
process or s-decl-list 
processors-decl 
processors-name
“PROCESSORS” processors-decl-list. 
processors-decl { processors-decl }. 
processors-name { “(” explicit-shape-list “)” }. 
object-name.
A com piler g e n e ra ted  p ro g ra m
This appendix gives a complete example of a compiler-generated program for one of 
the case studies in §6.3: matrix multiplication.
The source program for the compiler computes the product of two 512 x 512 matrices 
a and b to form the product matrix c, and was parallelised by the compiler to execute 
on eight processors. The compiler-generated code is complete and unaltered.
E .l  H ost program
c This code was generated by Eclipse 
subroutine chmain{) 
integer a(512,512), 6(512,512), c(512,512) 
call ccnfxy{8,1)
call ccreat(IR,' . /m atM ul_cell' ,10,0) 
call dist(10, b, 1048576,ir)
call dist(10,a, 1048576,ir) 10
call dis£(10,c,1048576,zr) 
call retrieve Array c(c)
end
E .2 H ost library
c This code was generated by Eclipse
subroutine retrieve Array c(c) 
integer c(512,512)
136
Cell program 137§E.3
integer ms<?(0:262143) 
pointer (msgp,msg)
call cmallc(1048576,msgp,ir) 
call dconcat(lO,msg,ir)
do 100 iO = 1,512 
do 100 il = 1,512
c(i0,il) = msg((i0 — l) + 512* (il-l))  
100 continue
end
E.3 Cell p ro g ram
10
subroutine ccmain()
integer i, j, k 
integer &(*), a(*), c(*) 
pointer (bp,b), (ap,a), (cp,c) 
integer cztf(0:1), gtl
call getCid{cid)
call partitionArrayb(cid,bp) 10
call partition Array a(cid,ap) 
call partitionArrayc(cid,cp) 
do 10 j  = dd(0),512,8 
do 10 i = cid{ 1),512,1 
do 10 k = 1,512
c(gtl[" c" ,i,j)) = c{gtl[" c" ,i,j))
+ a{gtl{"a",i,k)) * b(gtl(''b",k,j))
10 continue
call retrieve Array c(cid,c)
end 20
E.4 Cell lib ra ry
/* This code was generated by Eclipse
*/
#include <varargs.h>
logicalMesh[2\ =  {8,1}; 
int saddr[4], index =  0;
§ E.4 Cell library 138
void  getcid-(cid) 
in t *cid]
{ _
in t acid;
acid = getcidQ;
cid[0] =  (acid /  1) % 8 +  1;
cid[ 1] =  (ad d  /  1) % 1 + 1;
}
void  partitionarrayb-(cid,b) 
in t *cid,**b]
{  _
in t *skip,* count, msg]
saddr[3] = (512 * dd[0]) +  (1 * 1) +  (-513);
skip = (int *)malloc(2*sizeof(int));
count =  (in t *)malloc(2*sizeof(int));
sdp[0] =  4; count[0] =  512;
skip[ 1] =  16384; count[ 1] =  64;
msg =  ccmn£[0] * coim£[l];
*6 =  (int *)ma//oc(msg*4); 
dzs£(*&,saddr[3]*4,4,2,sdp,coim£);
}
void  partitionarraya~(cid,a) 
in t *cid,**a]
{
in t * skip,* count,msg]
saddr[2] =  (1 * dd[l]) +  (512 * 1) +  (—513);
sdp  =  (int *)malloc(2*sizeof(int));
count =  (in t *)ma//oc(2*sizeof(int));
sdp[0] =  4; count[0] =  512;
sdp[l] =  2048; count[ 1] =  512;
msg = count[0] * count[1];
*a =  (int *)malloc(msg*4); 
did(*a,saddr[2]*4,4,2,sdp,coun£);
}
void  partitionarrayc-(cid,c) 
in t *cid,**c]
{
10
20
30
40
50
in t * skip,* count, msg]
§E.4 Cell library 139
saddr[ 1] =  (512 * cid[0]) +  (1 * cid[ 1]) +  (—513);
skip = (int *)malloc(2*sizeof(int));
count = (int *)malloc(2*sizeof(int));
sA:zp[0] =  4; count[0] =  512;
skip[ 1] =  16384; count[ 1] =  64;
msg = count[0] * count[ 1];
*c =  (int *)ma//oc(7nsg*4); 
c?z5i(*c,sa(idr[l]*4,4I2,5A;zp,count);
}
void retrievearrayc-(cid,-c) 
int *cid,*-C',
{ _
int * skip,* count,msg]
saddr[0] =  (512 * czd[0]) +  (1 * cid[ 1]) +  (—513);
skip = (int *)malloc{2*sizeof(int));
count =  (int *)ma/ioc(2*sizeof(int));
skip[0] =  4; couni[0] =  (512 — cid[ 1])/1 +  1;
skip[ 1] =  16384; couni[1] =  (512 — cid[0])/8 + 1;
=  couni [0] * couni [1]; 
concai(_c,msg*4,saGWr[0]*4,4,2,sA;zp,couni);
}
int gtL(va-alist) 
va-dcl 
{
char *ident] 
vaJist args]
int globalAddr,localAddr\ 
vastart(args)]
ident = va-arg{args, char *);
if (!sirc7np("b",i<ieni)) {
globalAddr — (*va.arg{args, int *) — 1); 
globalAddr + =  (*ua-arg(args, int *) — 1) * 512; 
globalAddr —= saddr[ 3]; 
localAddr = ((globalAddr % 4096) /  1)
+  (globalAddr /  4096) * 512;
} else if (!sircmp("a",zcieni)) {
globalAddr = (*va-arg(args, int *) — 1); 
globalAddr + =  (*va-arg{args, int *) — 1) * 512; 
globalAddr —= saddr[ 2]; 
localAddr = ((globalAddr % 512) /  1)
+ (globalAddr /  512) * 512;
60
70
80
90
Cell library
} else if (\strcmp("c",ident)) {
globalAddr = (*va-arg(args, int *) — 1); 
globalAddr += (*va-.arg{args, int *) — 1) * 512; 
globalAddr —= saddr[ 1]; 
localAddr = ((globalAddr % 4096) /  1)
+ (globalAddr /  4096) * 512;
}
va-end(args) 
return localAddr +  1;
B ib liog raphy
141
B ib liog raphy
[1] Abu-Sufah, W., Kuck , D., AND Lawrie, D. On the Performance Enhance­
ment of Paging Systems Through Program Analysis and Transformations. IEEE 
Transactions on Computers C-30, 5 (May 1981), 341-356.
[2] Agarwal, A., Lim , B., Kranz, D., and Kubiatowicz, J. APRIL: A Pro­
cessor Architecture for Multiprocessing. In Proceedings of the 17th International 
Symposium on Computer Architecture (Seattle, WA, May 1990).
[3] Aho , A., JOHNSON, S., and Ullman, J. Compilers: Principles, Techniques, 
and Tools. Addison-Wesley, Reading, MA, 1986.
[4] Albert, E., Knobe, K., Lucas, J., and Steele J r ., G. Compiling Fortran 
8x Array Features for the Connect Machine Computer System. In Proceedings 
of the ACM SIGPLAN Symposium on Parallel Programming: Experience with 
Applications, Languages and Systems (PPEALS) (New Haven, CT, July 1988).
[5] Allen, F., Burke, M., Charles, P., Cytron, R., and F erreante, J. An 
Overview of the PTRAN Analysis System for Multiprocessing. In Proceedings of 
the First International Conference on Supercomputing (Athens, Greece, 1986).
[6] Allen, F., and Cocke, J. A Catalogue of Optimizing Transformations. In 
Design and Optimization of Compilers, R. Rustin, Ed. Prentice-Hall, Englewood 
Cliffs, NJ, 1971, pp. 1-30.
[7] Allen, J. Dependence Analysis for Subscripted Variables and Its Application 
to Program Transformations. PhD thesis, Computer Science Department, Rice 
University, Houston, TX, 1983.
142
Bibliography 143
[8] A llen , J., Callahan, D., and K ennedy , K. Automatic Decomposition of 
Scientific Programs for Parallel Execution. In Proceedings of the Fourteenth An­
nual ACM Symposium on the Principles of Programming Languages (Munich, 
Germany, January 1987).
[9] Allen, J., and Kennedy, K. Automatic Loop Interchange. In Proceedings of 
the SIGPLAN Symposium on Compiler Construction (Montreal, Quebec, June 
1984), pp. 233-246.
[10] Allen, R., and Kennedy, K. Automotive Translation of Fortran Programs to 
Vector Form. ACM Transactions on Programming Languages and Systems 9, 4 
(1987), 491-542.
[11] Ancourt, C., and Irigoin, F. Scanning Polyhedra with DO Loops. In Pro­
ceedings of the Third ACM SIGPLAN Symposium on Principle and Practice of 
Parallel Programming (Williamsburg, VA, April 91), pp. 39-50.
[12] B aco n , D., G raham , S., and Sh a r p , O. Compiler Transformations for High 
Performance Computing. ACM Computing Surveys 26, 4 (December 1994), 345- 
420.
[13] B a ll , J. Predicting the Effects of Optimization on a Procedure Body. In 
Proceedings of the SIGPLAN Symposium on Compiler Construction (Denver, 
CO, August 1979), pp. 214-220.
[14] B a n e r je e , U. Dependence Analysis for Supercomputing. Kluwer Academic 
Publishers, 1988.
[15] Bath, L. M., Davis, M. A., Williamson, D. L., Williamson, G. S., and 
Wolski, R. J. Users’ Guide to CCM1. Tech. Rep. NCAR Technical Note 
NCAR/TN-286+IA, National Centre for Atmospheric Research, Boulder, Color­
ado, USA, 1987.
[16] Be n n e t t , J ., Carter , J ., and Zw aenepoel, W. Munin: Distributed Shared 
Memory Based on Type Specific Memory Coherence. In Proceedings of the Second 
ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming 
(Seattle, WA, March 1990).
[17] B o u r k e , W. A Multi-level Spectral Model. Mon. Wea. Rev 102 (1974), 687-701.
[18] B reza ny , P., G erndt , M., Sipkova , V., and Zim a , H. SUPERB Support 
for Irregular Scientific Computations. In Proceedings of the 1992 Scalable High 
Performance Computing Conference (Williamsburg, VA, April 1992).
[19] B rinch  Hansen , P. Concurrent Programming Concepts. ACM Computing 
Surveys 5, 4 (1973), 223-245.
[20] Callahan, D., Cooper, K., Hood, R., Kennedy, K., and Torczon, L. 
Parascope: A Parallel Programming Environment. The International Journal of 
Supercomputing Applications 2: 4 (1988), 84-99.
Bibliography 144
[21] Callahan, D., Cooper, K., Kennedy, K., and T orczon, L. Interproced­
ural Constant Propagation. In Proceedings of the SIGPLAN Symposium (Palo 
Alto, CA, June 1986), pp. 152-161.
[22] Callahan, D., and Kennedy, K. Compiling Programs for Distributed 
Memory Multiprocessors. The Journal of Supercomputing 2, 2 (1988), 151-170.
[23] Carriero, N., and Gelernter, D. Linda in Context. Communications of the 
ACM 32, 4 (April 1989), 444-458.
[24] Chaiken, D., Kubiatowicz, J., and Agarwal, A. LimitLESS directories: 
A Scalable Cache Coherence Scheme. In Proceedings of the Fourth International 
Conference on Architectural Support for Programming Languages and Operating 
Systems (Santa Clara, CA, April 1991).
[25] Chapman, B., Herbeck , H., and Zima, H. Automatic Support for Data 
Distribution. In Proceedings 6th Distributed Memory Computing Conference 
(Portland, Oregon, April 1991).
[26] Chapman, B., Mehrotra, P., and Zima, H. Handling Distributed Data in 
Vienna Fortran Procedures. In Proceedings of the Fifth Workshop on Languages 
and Compilers for Parallel Computing (New Haven, CT, August 1992).
[27] Chapman, B., Mehrotra, P., and Zima, H. Programming in Vienna Fortran. 
Scientific Programming 1, 1 (1992), 31-50.
[28] Chase, J., Amador, F., Lazowska, E., Levy, H., and Littlefield, R. 
The AMBER System: Parallel Programming on a Network of Multiprocessors. 
In Proceedings of the 12th ACM Symposium on Operating Systems and Principles 
(December 1989).
[29] Chen, D., Su , H., and Yew, P. The Impact of Synchronisation and Granu­
larity on Parallel Systems. In Proceedings of the 17th International Symposium 
on Computer Architecture (Seattle, WA, 1990).
[30] Chen, M., Choo, Y., and Li , J. Compiling Parallel Programs by Optimizing 
Performance. The Journal of Supercomputing 1, 2 (1988), 171-207.
[31] Chen, M., Choo, Y., and Li , J. Theory and Pragmatics of Compiling Efficient 
Parallel Code. Tech. Rep. YALEU/DCS/TR-760, Yale University, December 
1989.
[32] Cox, A., and Fowler, R. The Implementation of a Coherent Memory Ab­
straction on a NUMA Multiprocessor: Experiences with Platinum. In Proceedings 
of the Twelfth Symposium on Operating Systems Principles (1989).
[33] Dally, W. J., and Seitz, C. L. Deadlock Free Message Routing in Multipro­
cessor Interconnection Network. IEEE Transaction on Computers 36, 5 (1987), 
547-553.
Bibliography 145
[34] D ijk st r a , E. Cooperating Sequential Processes. In Programming Languages, 
F. Genuys, Ed. Academic Press, 1968, pp. 43-112.
[35] Dongarra, J., Croz, J. D., Hammarling, S., and Du ff , I. A Set of 
Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical 
Software 16 (1990), 1-17.
[36] Dongarra, J., Croz, J. D., Hammarling, S., and Duff , I. Algorithm 679: 
A set of Level 3 Basic Linear Algebra Subprograms: Model Implementation and 
Test Programs. ACM Transactions on Mathematical Software 16 (1990), 18-20.
[37] Fahringer , T., B lasko, R., and Zima, H. Automatic Performance Prediction 
to Support Parallelization of Fortran Programs for Massively Parallel Systems. 
In Proceedings of the 1992 ACM International Conference on Supercomputing 
(Washington, DC, July 1992).
[38] F o ste r , I., AND Taylor, S. Strand: New Concepts in Parallel Programming. 
Prentice-Hall, Englewood Cliffs, NJ, 1990.
[39] Fox, G., Hiranandani, S., Kennedy, K., Koelbel, C., and Kremer, 
U. Fortran D Language Specification. Tech. Rep. TR90-141, Department of 
Computer Science, Rice University, April 1991.
[40] Fox, G., J ohnson, M., Lyzenga, G., O t to , S., Salmon, J., and Walker, 
D. Solving Problems on Concurrent Processors. Prentice Hall, 1988.
[41] F ujitsu  Laboratories Lt d . CAP-II Library Manual (Cell), version 1.0 ed. 
Kawasaki, Japan, March 1990.
[42] F ujitsu Laboratories Ltd . CAP-II Library Manual (Host), version 1.0 ed. 
Kawasaki, Japan, March 1990.
[43] Fujitsu Laboratories Ltd . CAP-II Program Development Guide: Fortran 
Language Interface, version 1.0 ed. Kawasaki, Japan, March 1990.
[44] Geist , G., and Sunderam, V. Network-Based Concurrent Computing on the 
PVM System. Concurrency: Practice and Experience f, 4 (June 1992).
[45] Ge r n d t , M. Automatic Parallelization for Distributed Memory Multiprocessor 
Systems. PhD thesis, Bonn University, 1989.
[46] Gerndt , M. Updating Distributed Variables in Local Computations. Concur­
rency: Practice and Experience 2, 3 (September 1990), 171-193.
[47] Gerndt, M. Work Distribution in Parallel Programs for Distributed Memory 
Multiprocessors. In Proceedings of the 1991 ACM International Conference on 
Supercomputing (Cologne, Germany, June 1991).
[48] G e r n d t , M. Program Analysis and Transformations for Message-passing 
Programs. In Proceedings of the 1992 Scalable High Performance Computing 
Conference (Williamsburg, VA, April 1992).
Bibliography 146
[49] GiRKAR, M., AND POLYCHRONOPOULOS, C. Compiling Issues for Supercom­
puters. In Proceedings of Supercomputing ’88 (Orlando, FL, 1988), pp. 164-173.
[50] Gross, T., Hinrichs, S., O ’Hallaron, D., Stichnoth, J., and Subhlok, 
J. Compiling Task and Data Parallel Programs for iWarp. In Proceedings of the 
Workshop on Languages, Compilers and Run-time Environments for Distributed 
Memory Multiprocessors (Boulder, CO, October 1992).
[51] Hatcher, P., AND Quinn, M. Data Parallel Programming on MIMD Com­
puters. MIT Press, Cambridge, MA, 1991.
[52] Hatcher, P., Quinn, M., Lapadula, A., Seevers, B., Anderson, R., and 
J ones, R. Data-parallel Programming on MIMD Computers. IEEE Transactions 
on Parallel and Distributed Systems 2, 3 (July 1991), 377-383.
[53] High P erformance F ortran Forum. High Performance Fortran Language 
Specification, version 1.0. Tech. Rep. CRPC-TR92225, Rice University, Houston, 
TX, January 1993.
[54] Hillis, W., AND Steele J r ., G. Data Parallel Algorithms. Communications 
of the ACM 29, 12 (December 1986), 2270-1183.
[55] Hinrichs, S., and Gross, T. Utilizing New Communications Features in Com­
pilation for Private Memory Machines. In Proceedings of the Fifth Workshop on 
Languages Compilers for Parallel Computing (New Haven, CT, August 1992).
[56] Hiranandani, S., Kennedy, K., Koelbel, C., Kremer, U., and T seng, 
C. An Overview of the F ortran D Programming System. Tech. Rep. TR90-154, 
Department of Computer Science, Rice University, March 1991.
[57] Hiranandani, S., Kennedy, K., and T seng, C. Compiler Support for 
Machine Independent Parallel Programming in Fortran D. In Compilers and 
Runtime Software for Scalable Multiprocessors, J. Saltz and P. Mehrotra, Eds. 
Elsevier, 1991.
[58] H iranandani, S., Kennedy, K., and T seng, C. Compiler Optimizations for 
Fortran D in MIMD Distributed-Memory Machines. In Proceedings of Supercom- 
puting’92 (1992).
[59] Ho, C. Optimal Communication Primitives and Graph Embeddings on Hyper­
cubes. PhD thesis, Yale University, 1990.
[60] H oare, C. Monitors: An Operating System Structuring Concept. Communica­
tions of the ACM 17, 10 (1974), 549-557.
[61] Horie , T., Ikesake, M., a nd  Ishihata, H. Interconnection Networks for Mul­
tiprocessors. In Proceedings of the First Fujitsu-ANU CAP Workshop (November 
1990), pp. 1-11.
Bibliography 147
[62] Ikudome, K., Fox, G., Kolawa, A., and F lower, J. An Automatic and 
Symbolic Parallelization System for Distributed Memory Parallel Computers. In 
Proceedings of the Fifth Distributed Memory Computing Conference (Charleston, 
SC, April 1990).
[63] Irigoin , F., and T riolet, R. Supernode Partitioning. In Proceedings of 
the AC M  Symposium on Principles of Programming Languages (San Diego, CA, 
January 1988), pp. 319-329.
[64] Ishihata, H., Horie, T., Shimizu, T., and Kato, S. CAP-II Architecture. In 
Proceedings of the First Fujitsu-ANU CAP Workshop (Kawasaki, Japan, Novem­
ber 1990).
[65] J ohnsson, S. L. Communication Efficient Basic Linear Algebra Computations 
on Hypercube Architectures. Journal of Parallel and Distributed Computing 4, 
2 (April 1987), 133-172.
[66] Ka rp , A. Programming for Parallelism. IEEE Computer 20, 5 (1987), 43-57.
[67] Kendall Square R esearch. KSR1 Technical Summary. 1992.
[68] Koelbel, C. Compiling Programs for Nonshared Memory Machines. PhD thesis, 
Purdue University, West Lafayette, IN, 1990.
[69] KOELBEL, C. Compile-Time Generation of Regular Communication Patterns. In 
Proceedings of Supercomputing ’91 (Albuquerque, NM, November 1991), pp. 101- 
110.
[70] Koelbel, C., Mehrotra, J., Saltz, J., and Berryman, S. Parallel Loops on 
Distributed Machines. In Proceedings of the Fifth Distributed Memory Computing 
Conference (Charleston, SC, April 1990).
[71] Koelbel, C., and Mehrotra, P. Compiling Global Name-Space Parallel 
Loops for Distributed Execution. IEE Transactions on Parallel and Distributed 
Systems 2, 4 (October 1991), 440-451.
[72] Koelbel, C., and Mehrotra, P. Programming Data Parallel Algorithms 
on Distributed Memory Machines using Kali. In Proceedings of the 1991 ACM  
International Conference on Supercomputing (Cologne, Germany, 1991).
[73] Koelbel, C., Mehrotra, P., and Van Rosendale, J. Supporting Shared 
Data Structures on Distributed Memory Machines. In Proceedings of the Second 
ACM  SIGP L A N  Symposium on Principles and Practices of Parallel Programming 
(Seattle, WA, March 1990).
[74] Kremer , U., Zima, H., Bast, J., and Gerndt, M. Advanced Tools and 
Techniques for Automatic Parallelisation. Parallel Computing 7 (1988), 387- 
393.
Bibliography 148
[75] Kuck, D. A Survey of Parallel Machine Organization and Programming. ACM  
Computing Surveys 9, 1 (March 1977), 25-59.
[76] Kuck, D., Davidson, E., Lawrie, D., and Sameh, A. Parallel Computing 
Today and the Cedar Approach. Science, 231 (February 1986), 967-974.
[77] Kuck, D., Kuhn, R., Leasure, B., and Wolfe, M. The Structure of and 
Advanced Retargetable Vectorizer. In Tutorial on Supercomputers, K. Hwang, 
Ed. IEEE Press, 1984, pp. 163-178.
[78] Kuck, D., Lawrie, D., Cytron, R., Sameh, A., and Gajski, D. The 
Architecture and the Programming of the Cedar System. Tech, rep., Laboratory 
for Advanced Supercomputers, Department of Computer Science, University of 
Illinois, August 1983.
[79] Kung, H., AND Subhlok, J. A New Approach for Automatic Parallelization 
of Blocked Linear Algebra Computations. In Proceedings of Supercomputing ’91 
(Albuquerque, NM, November 1991).
[80] Lamport, L. The Parallel Execution of DO Loops. Communications of the 
ACM 17, 2 (February 1974), 83-93.
[81] Laplante, P. Fractal Mania. Windcrest/McGraw Hill, New York, 1994.
[82] Lenoski, D., Laudon, J., Joe, T., Nakahira, D., Stevens, L., Gupta, A., 
AND Hennessy, J. The DASH Prototype: Implementation and Performance. 
In Proceedings of the 19th International Symposium on Computer Architecture 
(Gold Coast, Australia, May 1992).
[83] Li, J., AND Chen, M. Generating Explicit Communications from Shared 
Memory Program References. In Proceedings Supercomputing ’90 (November 
1990), pp. 865-876.
[84] Li, J ., and Chen, M. Index Domain Alignment: Minimizing Cost of Cross- 
Referencing Between Distributed Arrays. In Proceedings Frontiers ’90: The 3rd 
Symposium on the Frontiers of Massively Parallel Computation (October 1990), 
pp. 424-433.
[85] Li, J., AND Chen, M. Compiling Communication-Efficient Programs for 
Massively Parallel Machines. IEEE Transactions of Parallel and Distributed Sys­
tems 2, 3 (July 1991), 361-376.
[86] Li, J., and Chen , M. The Data Alignment Phase in Compiling Programs for 
Distributed Memory Machines. Journal of Parallel and Distributed Computing 
13, 2 (October 1991), 213-221.
[87] Li, K., AND Hudak, P. Memory Coherence in Shared Virtual Memory Systems. 
ACM Transactions on Computer Systems 7, 4 (November 1989), 321-359.
Bibliography 149
[88] LiSKOV, B., and Scheifler, R. Guardians and Actions: Linguistic support 
for robust, distributed programs. ACM Transactions on Programming Languages 
and Systems 5, 3 (1983), 381-404.
[89] LOVEMAN, D. Program Improvement by Source-to-source Transformation. 
Journal of ACM 1, 24 (January 1977), 121-145.
[90] Lucco, S., and Sharp, O. Parallel Programming with Coordinate Structures. 
In Proceedings of the Eighth Annual ACM Symposium on the Principles of Pro­
gramming Languages (Orlando, FL, January 1991).
[91] MacDonald, T., Pase, D., and Meltzer, A. MPP Fortran Programming 
Model. Tech, rep., Cray Research, 1992.
[92] Maspar C omputer Corporation. MasPar Fortran User Guide, version 
1.1 ed., August 1991.
[93] Mehrotra , P., and Rosendale, J. V. Programming Distributed Memory 
Architectures using Kali. In Advances in Languages and Compilers for Parallel 
Computing. MIT Press, Irvine, CA, August 1990.
[94] Message Passing Interface F orum. MPI: A Message Passing Interface 
Standard. Tech. Rep. CS-94-230, University of Tennessee, Knoxville, 1994.
[95] M O SS, J. Nested Transactions: An Approach to Distributed Computing. MIT 
Press, Cambridge, Mass, 1985.
[96] Muraoka, Y. Parallelism Exposure and Exploitation in Programs. PhD thesis, 
University of Illinois at Urbana-Champaign, 1971.
[97] Olander, D., and Schnabel, R. Preliminary Experience in Developing a 
Parallel Thin-layer Navier Stokes Code and Implications for Parallel Language 
Design. In Proceedings of the 1992 Scalable High Performance Computing Con­
ference (Williamsburg, VA, 1992).
[98] Padua, D., Kuck, D., and Lawrie, D. High Speed Multiprocessors and 
Compilation Techniques. IEEE Transactions on Computers 29, 9 (September 
1980), 763-776.
[99] PADUA, D., AND W olfe, M. Advanced Compiler Optimizations for Supercom­
puters. Communications of the ACM 29, 12 (1986).
[100] Pancake, C., and Bergmark, D. Do Parallel Languages Respond to the 
Needs of Scientific Programmers? IEEE Computer 23, 12 (December 1990), 
13-23.
[101] Parasoft Corporation. Express User’s Manual, 1989.
[102] P fister , G. The IBM Research Parallel Processor Prototype (RP3): Intro­
duction and Architecture. In Proceedings of the 1985 Conference on Parallel 
Processing (1985).
Bibliography 150
[103] PlNGALl, K., AND Rogers, A. Compiling for Locality. In Proceedings of the 
1990 International Conference on Parallel Processing (St. Charles, IL, 1990).
[104] POLYCHRONOPOULOS, C. Parallel Programming and Compilers. Kluwer Aca­
demic Publishers, Boston, MA, 1988.
[105] P olychronopoulos, C., G irkar, M., Haghighat, M., Lee , C., Leung,
B., AND SCHOUTEN, D. Parafrase-2: An Environment for Parallelizing,
Partitioning, Synchronizing and Scheduling Programs on Multiprocessors. In 
Proceedings of 1989 International Conference on Parallel Processing (August
1989) .
[106] P olychronopoulos, C., Girkar, M., Haghighat, M., Lee , C., Leung, 
B., and Schouten, D. The Structure of Parafrase-2: An Advanced Parallelizing 
Compiler for C and Fortran. In Languages and Compilers for Parallel Computing, 
D. Gelernter, A. Nicolau, and D. Padua, Eds. MIT Press, August 1989.
[107] Quinn, M., and Hatcher, P. Data Parallel Programming on Multicomputers. 
IEEE Software 7, 5 (September 1990), 69-76.
[108] R ibas, H. Obtaining Dependence Vectors For Nested-Loop Computations. In 
Proceedings of the 1990 International Conference on Parallel Processing (August
1990) , vol. II Software, pp. 212-219.
[109] R obertson, S. Polytopes and Symmetry. Cambridge University Press, 1984.
[110] Rogers, A., and P ingali, K. Process Decomposition Through Locality of 
Reference. In Proceedings of the SIGPLAN ’89 Conference on Programming 
Language Design and Implementation (Portland, OR, June 1989).
[111] Rose, J., and Steele J r ., G. C*: An Extended C Language for Data Par­
allel Programming. In Proceedings of the Second International Conference on 
Supercomputing (Santa Clara, CA, May 1987).
[112] R osing, M., and Schnabel, R. An Overview of DINO—A New Language 
for Numerical Computation on Distributed Memory Multiprocessors. Tech. Rep. 
CU-CS-385-88, University of Colorado, Boulder, March 1988.
[113] R osing, M., Schnabel, R., and W eaver, R. Expressing Complex Parallel 
Algorithms in DINO. In Proceedings of the 4th Conference on Hypercube Con­
current Computers and Applications (Monterey, CA, March 1989).
[114] R osing, M., Schnabel, R., and W eaver, R. The DINO Parallel Program­
ming Language. Journal of Parallel and Distributed Computing 13 (1991), 30-42.
[115] R osing, M., Schnabel, R., and W eaver, R. Scientific Programming Lan­
guages for Distributed Memory Multiprocessors: Paradigms and Research Issues. 
In Languages, Compilers and Run-time Environments for Distributed Memory 
Machines, J. Saltz and P. Mehrotra, Eds. North Holland, Amsterdam, 1992.
Bibliography 151
[116] Saltz, J., Crowley, K., Mirchandaney, R., and Berryman, H. Run­
time Scheduling and Execution of Loops on Message Passing Machines. Journal 
of Parallel and Distributed Computing 8 (1990), 303-312.
[117] Sawdayi, R., Wagenbreth, G., and W illiamson, J. MIMDizer: Functional 
and Data Decomposition; creating parallel programs from scratch, transforming 
existing Fortran programs to parallel. In Compilers and Runtime Software for 
Scalable Multiprocessors, J. Saltz and P. Mehrotra, Eds. Elsevier, Amsterdam, 
1991.
[118] SCHEIFLER, R. An Analysis of Inline Substitution for a Structured Programming 
Language. Communications of the ACM 20, 9 (September 1977), 647-654.
[119] Skillicorn, D. Architecture-independent Parallel Computation. IEEE Com­
puter 23, 12 (December 1990).
[120] Spivey , J. The Z Notation, second ed. Prentice Hall, 1992.
[121] SUN MICROSYSTEMS. Sun Fortran Programmers Guide, revision a ed. Mountain 
View, CA, March 1988.
[122] T hakkar, S., Gifford , P., and F ielland, G. The Balance Multiprocessor 
System. IEEE Micro (February 1988), 58-69.
[123] T hinking Machines C orporation. CM Fortran Reference Manual, version 
0.7-f ed. Cambridge, MA, November 1990.
[124] T hinking Machines Corporation. Connection Machine CM-5 Technical 
Summary. Cambridge, MA, November 1992.
[125] TSENG, P. A Parallelizing Compiler for Distributed Memory Parallel Computers. 
In Proceedings of the SIGPLAN  790 Conference on Programming Language 
Design and Implementation (White Plains, NY, June 1990).
[126] T seng, P. A Systolic Array Parallelizing Compiler. Journal of Parallel and 
Distributed Computing 9, 2 (June 1990), 116-127.
[127] Valiant, L. A Bridging Model for Parallel Computation. Communications of 
the ACM 33, 8 (August 1990), 103-111.
[128] W egman, M., and Zadeck, F. Constant Propagation with Conditional 
Branches. ACM Transactions on Programming Languages and Systems 13, 2 
(April 1991), 181-210.
[129] W olfe, M. Optimizing Compilers for Supercomputers. MIT Press, Cambridge, 
MA, 1989.
[130] Wu, J., Saltz, J., Berryman, H., and Hiranandani, S. Distributed 
Memory Compiler Design for Sparse Problems. Tech. Rep. ICASE Report 91-13, 
Institute for Computer Applications in Science and Engineering, Hampton, VA, 
January 1991.
Bibliography 152
[131] ZIMA, H., Bast, H., and Gerndt, M. SUPERB: A Tool for Semi-Automatic 
MIMD/SIMD Parallelization. Parallel Computing 6 (1988), 1-18.
[132] Zima, H., Brezany, P., Chapman, B., Mehrota, P., and Schwald, A. 
Vienna Fortran: A language Specification (Version 1.1). Tech, rep., 1991.
