Exploiting iteration-level parallelism in declarative programs by Roy, John M.A.
UC Irvine
ICS Technical Reports
Title
Exploiting iteration-level parallelism in declarative programs
Permalink
https://escholarship.org/uc/item/0j11x8nt
Author
Roy, John M.A.
Publication Date
1991
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
Department of Information and Computer Science 
University of California at Irvine 
Irvine, CA 92717 
Exploiting lteration-Level Parallelism 
- in Declarative Program~ 
3 
)1 (J' 
John M.A . .B_oy 
- :;;:::-
March 1991 
Notice: This Material 
may be protected 
by Copyright Law 
(Title 17 U.S.C.) 
Technical Report #91-24 
Abstract 
In arder to achieve viable parallel processing three basic criteria must be met: (1) the system must provide a 
programming environment which hides the details of parallel processing from the programmer; (2) the 
system must execute efficiently on the given hardware; and (3) the system must be economically attractive. 
The first criterion can be met by providing the programmer with an implicit rather than explicit 
programming paradigm. In this way all of the synchronization and distribution are handled automatically. 
To meet the second criterion, the system must perform synchronization and distribution in such a way that 
the available computing resources are used to their utmost. And to meet the third criterion, the system 
must not require esoteric or expensive hardware to achieve efficient utilization. 
This dissertation reports on the Process-Oriented Dataflow System (PODS), which meets all of the above 
criteria. PODS uses a hybrid von Neumann-Dataflow model of computation supported by an automatic 
partitioning and distribution scheme. The new partitioning and distribution algorithm is presented along 
with the underlying principles. Four new mechanisms for distribution are presented: (1) a distributed array 
allocation operator for data distribution; (2) a distributed L operator for code distribution; (3) a range filter 
for restriction index ranges for different PEs; and ( 4) a specialized apply operator for functional parallelism. 
Simulations show that PODS balances communication overhead with distributed processing to achieve 
efficient parallel execution on distributed memory multiprocessors. This is partially due to a new software 
array caching scheme, called remote caching, which greatly reduces the amount ofremote memory reads. 
PODS is designed to use off-the-shelf components, with no specialized hardware. In this way a real PODS 
machine can be built quickly and cost effectively. The system is currently being retargeted to the Intel 
iPSC/2 so that it can be run on commercially available equipment. 
Keywords: single assignment, dataflow, multiprocessor, declarative programming, 
matrix multiply, SIMPLE 
UNIVERSITY OF CALIFORNIA 
IRVINE 
Exploiting Iteration-Level Parallelism 
in Declarative Programs 
DISSERTATION 
submitted in partial satisfaction of the requirements for the degree of 
DOCTOR OF PHILOSOPHY 
in Information and Computer Science 
by 
John Marc Andre Roy 
Dissertation Committee: 
Professor Lubomir Bic, Chair 
Professor Nikil Dutt 
Professor Alexandru Nicolau 
1991 
© by John Marc Andre Roy 
Al! rights reserved. 
ii 
The dissertation of John Marc Andre Roy is approved, 
and is acceptable in quality and fonn 
for publication on microfilm: 
University of California, lrvine 
1991 
iii 
DEDICATION 
To 
my wonderful wife 
for ali of her love, support, and understanding. 
1 love you Charlene. 
iv 
TABLE OF CONTENTS 
LIST OF FIGURESºººººººººººººº ººººººººººº·······º·º'ºº'º"····º····º··º·º············º····º··ºvili 
LIST OF TABLES . º ... º º º º. º º º º º º º º º º .. º º º º º º .. º .. º º º .......... º ... º. º .... º .... º. º .... º. º ........ xi 
ACKN"OWLEDGEJ\1ENTS .. º .... º .. º ........... º ....... º ........ º º º .... º .... º .. º .............. xii 
CURRICULUM VITAE ··º·········º··········o····º·º·······º········º··ºº'''·················x.iii 
PUBLICA TI O NS .. º º º º .. º º º º º º º ...... º º º º º º . º. º º º º º º º ... º º º º. º .. º .... º º º º º ... º º º º º º º .. º º º º º º º . º .x.iii 
ABSTRACT OF THE DISSERTATIONº .. º""ºº"ºº"º''º"'''""º'''ºº'ºººº'ººº"ººº'ººººxiv 
Background .... º .. º ............... º º. º. º ...... º. º ...... º ............................................ 1 
1.1. Basic Issues in Parallel Processing .......................... º ....... º ..... º º º º º .. 5 
1.1º1 º Parallel Programming º º º .... º ....... º º º. º º .. º º ............ º ...... º. º º .. º 5 
1.1.2. Distributed Memory MThID .... 'º ºº .. ºº º' ºº' ºº'º ºº ºº' º ºº' ºº' º' ''ºº' ºº .º6 
1. 2. Previous Research ... º ... º ... º º ... º º . º . º .. º º. º ............................ º ...... º .. 6 
1.2.1. Single Assignrnent Principie.º ........ º .... º ............... º.º ...... º.º 6 
1.2º2. ID Nouveau Dataflow Language 'ººº"ºº''ºº"'''""'"""""''""'8 
Single Assignment Approach ........ º ................ º. º .............. 9 
Iteration ............... º. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 O 
I-Structures ............................................................. 11 
Discussion ............................................................. º 13 
1.2.3. Hybrid Dataflow ........ º .............. ºº'ºººº'º'ºº'"""º"'º'""····º· 13 
1 . 3. Overview of PODS Execution Model º ......... º º .......................... º . . . 14 
1.3.1. Subcompact Processes (SP) ........ º ................................ 15 
1. 3. 2. S tate Transitions ......................... º .................. º . . . . . . . . . . 17 
1.3.3. Distributed Memory Approach .. º .................................... 19 
1.3.4. Discussion .............................................................. 20 
1.4. Contributions of this Research .................................................. 21 
1.4 .1 . Execution Model Extensions ............. º ..... º .. º .. º º .............. 21 
1. 4. 2. Partitioning and Distribution Model ................................. 21 
1.4.3. Remote Array Caching ................................................ 22 
1. 4. 4. .Logical Architecture ................................................ º •• 22 
1.4.5. Simulationsº ....... ºººº'º'""ººº'"º"º'"'""''ºººº"'"º"''ºº'ºº"º'º 23 
PODS Partitioning and Distribution Model.. ......... º""º'"'""º"'º"""º'"º"ºº""ºº"º 24 
2. 1 º Overview ...... º º . º .... º ......................... º ............................. º ... 25 
2. 2. U nderl ying Principies .................................................... º ....... 27 
2.2.1. Basic Principies ........................................................ 28 
2.2.2. PODS Specific Principles ............................................ 30 
Grouping Principie .................................................... 30 
Virtual Sources Principle ............................................. 31 
Collector Writes Principie ............................................ 32 
2. 3. PODS Instructions and Processes .............................................. 33 
2.3.1. ActivityNames ........................................................ 34 
2.3.2. PODS Instruction Fonnat ......................... º"ººº""""""º 36 
V 
2. 3. 3. PODS Dataflow Operator Implementation .......................... 38 
Arithmetic and Logical Operators .................................... 40 
switch and forkjump .................................................. 41 
d and d_inverse ........................................................ 43 
l and l_inverse ......................................................... 45 
a and a_inverse ........................................................ 47 
2.4. Array Partitioning and Distribution ............................................. 48 
2.5. Distributing Processes ........................................................... 56 
2.5.1. Data-Distributed Execution Principie ................................ 57 
2.5.2. Range Filters ............................................... · ............ 61 
Objective and Usage .................................................. 61 
Boundary Table ........................................................ 64 
Master Array ........................................................... 65 
Algorithrn ............................................................... 65 
2.5.3. LCD Effects ............................................................ 67 
2.5.4. Remoce Array Accesses ............................................... 71 
Remoce Reads .......................................................... 71 
Remote Writes ......................................................... 73 
2.5.5. For-Loop Distribution Algorithm .................................... 74 
2.5.6. Examples ............................................................... 76 
LCD Examples ......................................................... 76 
Matrix Multiply ........................................................ 84 
2.6. Functional Distribution ........................................................... 89 
2. 7. Deadlock Handling ............................................................... 90 
PODS Logical Implementation .................................................................. 95 
3 .1. System Overview ................................................................. 95 
3.2. Logical PE Architecture .......................................................... 97 
3.2.1. Execution Unit ......................................................... 99 
3.2.2. Routing Unit .......................................................... 100 
• 3.2.3. Array Manager ........................................................ 102 
3.2.4. Memory Manager ..................................................... 104 
3.2.5. Matching Unit ......................................................... 104 
3.3. Remote Array Caching .......................................................... 104 
3.4. Software Suppon ................................................................ 108 
3.4.1. ID World and GITA Compiler ...................................... 109 
3.4.2. Translator .............................................................. 109 
3.4.3. Partitioner ............................................................. 111 
3.4.4. Simulator .............................................................. 113 
PODS Simulations .............................................................................. 114 
4. 1. Overview ....... : ................................................................. 114 
4. 1.1. Simulator Approach .......... _ ... : .................................... 114 
4.1.2. Timing Assumptions ................................................. 116 
Execution Unit. ....................................................... 116 
Array Manager ........................................................ 117 
Routing Unit .......................................................... 118 
Memory Manager ..................................................... 119 
Matching Store ........................................................ 119 
Network ............................................................... 119 
4.2. Measures of Effectiveness (MOEs) ............................................ 119 
vi 
4.3. Exarnple Prograrns .............................................................. 121 
4.3.1. ~ttixM~tiply ....................................................... 121 
D1scuss1on ............................................................. 121 
Results ................................................................. 122 
4.3.2. SThiPLE ............................................................... 128 
Discussion ............................................................. 129 
Results ................................................................. 134 
4.4. Summary ......................................................................... 142 
Conclusions ...................................................................................... 145 
5.1. Related Work ..................................................................... 145 
5 .1.1. Iannucci's Hybrid Architecture ..................................... 145 
5.1.2. Gao's Hybrid Machíne .............................................. 146 
5 .1.3. Alfalfa .................................................................. 147 
5. l. 4. Decoupled Multilevel Dataflow Model ............................. 14 7 
5.1.5. Dynamic Structured Dataflow ....................................... 148 
5 .1.6. Pingali and Rogers' Compiler ...................................... 148 
5 .2. Advantages and Disadvantages of Single Assignment ...................... 149 
5.3. Summary ......................................................................... 150 
5 .4. Future Research .................................................................. 153 
5.4.1. HyperPODS ........................................................... 153 
5 .4.2. PODS Compiler ...................................................... 154 
References ........................................................................................ 156 
Appendix A: Range Filter Algorithms ........................................................ 165 
vii 
Figure 1.1. 
Figure 1.2. 
Figure 1.3. 
Figure 1.4. 
Figure 1.5. 
Figure 1.6. 
LIST OF FIGURES 
Lines of Research .................................................................... 2 
ID Nouveau Quicksort Code ....................................................... 9 
ID Nouveau Iteration Example .................................................... 11 
ID Nouveau I-Structure Example ................................................. 11 
Subcompact Process Example Code ............................................. 15 
PODS Subcompact Processes Example .......................................... 17 
Figure 1.7. Process State Transition Diagram ................................................. 18 
Figure 1.8. PODS Memory Accessing Scheme ............................................... 20 
Figure 2.1. Simple Array Assignment. ........................................................ 28 
Figure 2.2. Equal Distribution Principle ....................................................... 29 
Figure 2.3. Grouping Principie ................................................................. 30 
Figure 2.4. Virtual Sources Principle .......................................................... 31 
Figure 2.5. Collector Writes Principie ......................................................... 32 
Figure 2.6. Basic Dataflow Operator ........................................................... 34 
Figure 2.7. Activity Name Components ....................................................... 35 
Figure 2.8. SP Components ..................................................................... 38 
Figure 2.9. ID vs PODS Statement "Addressing" ............................................ 40 
Figure 2.10. PODS SWITCH and FORKJUMP Instruction Examples .................... .42 
Figure 2.11. PODS Branch ....................................................................... 43 
Figure 2.12. PODS Code Fragment for a Loop ................................................ 45 
Figure 2.13. Example L Operators .............................................................. .46 
Figure 2.14. Example Apply and Inv _Apply Operators ....................................... 48 
Figure 2.15. Matrix Multiply ID Nouveau Source Code ..................................... .50 
Figure 2.16. PODS Partitioning of A 2-D Array ............................................... 52 
Figure 2.17. 2-D Array Read Pseudo-Code .................................................... 55 
viii 
Figure 2.18. Example 2-D Array Remote Read ............................................... .55 
Figure 2.19. Example 2-D Array Local Read. .................................................. 56 
Figure 2.20. Partitioning a 2D Iteration Space ................................................. 58 
Figure 2.21. Partitioning a 3D Iteration Space ................................................. 59 
Figure 2.22. Simple 2-D Array Fill .............................................................. 62 
Figure 2.23. 2-D Array Fill with Range Filter .................................................. 63 
Figure 2.24. Algorithm for Second Level, Descending Range Fil ter for 
A[ '*' ki .,, .. k'] 66 Cl l+ ,CJ J+ J ................................................................. . 
Figure 2.25. Non-rectangular Array Partitioning Example .................................... 67 
Figure 2.26. Effects of Communication Speed on Overlapping Iterations .................. 69 
Figure 2.27. Remote Read Code Example ...................................................... 72 
Figure 2.28. Remote Write Code Example ...................................................... 73 
Figure 2.29. Impossible Collector Writes ....................................................... 74 
Figure 2.30. Simple Array Filling Example Code .............................................. 76 
Figure 2.31. Simple Row-Major Array Partitioning ........................................... 77 
Figure 2.32. LCD Execution Wavefronts ....................................................... 84 
Figure 2.33. Example Execution Trace for Matrix Multiply on 4 PEs ....................... 88 
Figure 2.34. ID Nouveau Deadlock Code Example ............................................ 92 
Figure 3.1. Logical Units of a PODS PE ...................................................... 98 
Figure 3.2. Routing Table ..................................................................... 100 
Figure 3.3. Routing Unit Block Diagram .................................................... 101 
Figure 3.4. Effects of Cache Size on Percentage of Remote Reads ...................... 107 
Figure 3.5. Remote Reads for the Livermore Loops using Remote Caching ............ 108 
Figure 3.6. PODS Programming System .................................................... 109 
Figure 3.7. PODS Partitioner Block Diagram. .............................................. 111 
Figure 4.1. 2-D Array Read Pseudo-Code .................................................. 117 
Figure 4.2. Matrix Multiply ID Nouveau Source Code .................................... 122 
lX 
Figure 4.3. Urilization for Each Functional Unit (16 x 16 MM) .......................... 123 
Figure 4.4. Average Execution Unit Utilizarion for Matrix Multiply ..................... 124 
Figure 4.5. Urilization for each Execution Unit (16 x 16 MM on 8 PEs) ................ 125 
Figure 4.6. Utilization for each Execution Unit (16 x 16 MM on 16 PEs) .............. 126 
Figure 4.7. Speed-Up of Matrix Multiply ................................................... 128 
Figure 4.8. Sweep For-Loops in Conduction Code ....................................... 130 
Figure 4.9. Original Conduction Code with Multiple LCDs ............................... 131 
Figure 4.10. Scalar Expanded Conduction Code Fragment ................................. 132 
Figure 4.11. Utilization for Each Functional Unit (16 x 16 SIMPLE) ..................... 135 
Figure 4.12. Execution Unit Utilization for SIMPLE ........................................ 136 
Figure 4.13. Execution Unit Utilization (16 x 16 SIMPLE on 32 PEs) ................... 137 
Figure 4.14. Execution Unit Utilization (32 x 32 SIMPLE on 32 PEs) .................. 138 
Figure 4.15. Executi.on Unit Utilization (64 x 64 SIMPLE on 32 PEs) .................. 139 
Figure 4.16. Speed-Up of SIMPLE ........................................................... 141 
Figure A. l. Base Range Filter Algorithm for Outermost Level Distribution ............. 165 
Figure A.2. Base Range Filter Algorithm for Second Outermost Level Distribution ... 166 
Figure A.3. Base Range Filter Algorithm for Third Outermost Leve! Distribution ...... 167 
Figure A.4. Range Filter Algorithm for Stepsize -!.. ....................................... 168 
Figure A.5. Second Leve! Distribution Range Filter for A[ci*i+ki,cj*j+kj] ............. 169 
Figure A.6. Range Filter for Third Level Distribution with Stepsize C ................... 170 
X 
Table 2.1. 
Table 2.2. 
Table 2.3. 
Table 2.4. 
Table 2.5. 
Table 2.6. 
Table 2.7. 
Table 4.1. 
Table 4.2. 
Table 4.3. 
Table 4.4. 
LIST OF TABLES 
PODS Array Header Information ................................................. 53 
2-D Array Example Header ........................................................ 54 
Example Boundary Table for a Given PE ....................................... 64 
Effects of Outer Loop Distribution with No LCDs ............................. 78 
Effects of Inner Loop Distributi.on with No LCDs ............................. 79 
Effects of Inner Loop Distribution with LCDs .................................. 81 
Effects of No Distribution dueto LCDs ......................................... 83 
Measured Times of Operations on iPSC/2 ..................................... 116 
Percent Overhead Instructions for Matrix Multi.ply ........................... 126 
SP Stati.stics for Conduction .................................................... 134 
Percent Overhead Instructions for SIMPLE ................................... 139 
xi 
ACKNOWLEDGEMENTS 
I would lik:e to express my thanks to my committee chair, Professor Lubomir Bic, for his 
continuing insights through out the years, and for this diligence in helping me prepare this 
dissertation. 
I would lik:e to particularly thank my research associate, Mark Na gel, for his ideas and 
progra.mming expenise. Without Mark this work would still be in the programming stages. 
Good luck Mark. 
Special thanks to my parents for their support and encouragement through out my entire 
life. 
Financia! support has been provided by a number of sources through out the years: Hughes 
Aircraft Company, Fail-Safe Technology, JMAR Research Group, and from my wonderful 
wife, Charlene. I would also like to thank the N ational Science Foundation for their 
support of the PODS research through NSF grant #CCR-8709817. 
xii 
CURRICULUM VITAE 
John M.A. Roy 
1982 B.S. in Electrical Engineering, University of California, San Diego. 
1982-1987 Systems Engineer, Hughes Aircraft Company, Fullerton, California. 
1984 M.S. in Electrical Engineering, University of Southern California. 
1987-1988 Senior Member of Technical Staff, Fail-Safe Technology, Los Angeles, 
California. 
1988-1989 Systems Consultant, JMAR Research Group, Irvine, California. 
1989 M.S. in lnformation and Computer Science, University of California, 
Irvine. 
1989-1990 Vice-President, Engineering and Operations, Trintech USA, Irvine, 
California. 
1991-Present Vice-President, Engineering, National Paging, Santa Ana, California. 
1991 Ph.D. in lnformation and Computer Science, University of California, 
Irvine. 
Dissertation: "Exploiting Iteration-Level Parallelism in Declarative 
Programs." 
Professor Lubomir Bic, Chair. 
PUBLICA TI O NS 
L. Bic, M. D. Nagel, J. M. A. Roy. Automatic Data/Program Partitioning Using the 
Single Assignment Principie. Supercomputing '89 ( 1989), pp. 551-556. 
L. Bic, M. D. Nagel, J. M. A. Roy. Executing Matrix Multiply on a Process Oriented 
Dataflow Machine. Technical Report 90-08 (April 1990), Department of ICS, University 
of California, Irvine. 
L. Bic, M. D. Nagel, J. M. A. Roy. On Array Partitioning in PODS. In Advanced Topics 
in Data-Flow Computing. J. L. Gaudiot, L. Bic, Eds. (Prentice Hall, Englewood Cliffs, 
New Jersey, 1990), pp. 305-325. 
J. M. A. Roy, M. D. Nagel, L. Bic. Partitioning Declarative Programs into 
Communicating Processes. Supercomputing '90 (1990), pp. 846-855. 
xiii 
ABSTRACT OF THE DISSERTA TION 
Exploiting Iteration-Level Parallelism 
in Dedarative Programs 
by 
John Marc Andre Roy 
Doctor of Philosophy in Information and Computer Science 
University of California, Irvine, 1991 
Professor Lubomir Bic, Chair 
In order to achieve viable parallel processing three basic criteria must be met: (1) the system 
must provide a programming environment which hides the details of parallel processing 
from the programmer; (2) the system must execute efficiently on the given hardware; and 
(3) the system must be economically attractive. 
The first criterion can be met by providing the programmer with an implicit rather than 
explicit programming parad.igm. In this way ali of the synchronization and distribution are 
handled automatically. To meet the second criterion, the system must perform 
synchronization and distribution in such a way that the available computing resources are 
used to their utmost. And to meet the third criterion, the system must not require esoteric 
or expensive hardware to achieve efficient utilization. 
This dissertation reports on the Process-Oriented Dataflow System (PODS), which meets 
all of the above criteria. PODS uses a hybrid von Neumann-Dataflow inodel of 
computation supported by an automatic partitioning and distribution scheme. The new 
partitioning and distribution algorithm is presented along with the underlying principles. 
Four new mechanisms for distribution are presented: (1) a distributed array allocation 
operator for data distribution; (2) a distributed L operator for code distribution; (3) a range 
fil ter for restriction index ranges for different PEs; and ( 4) a specialized apply operator for 
functional parallelism. 
Simulations show that PODS balances communication overhead with distributed 
processing to achieve efficient parallel execution on distributed memory multiprocessors. 
This is partially due to a new software arra y caching scheme, called remote caching, which 
greatly reduces the amount of remote memory reads. PODS is designed to use 
off-the-shelf components, with no specialized hardware. In this way a real PODS machine 
can be built quickly and cost effectively. The system is currently being retargeted to the 
Intel iPSC/2 so that it can be run on commercially available equipment 
XIV 
CHAPTERl 
Background 
Scientific prograrnmers are the primary users of parallel systems today. The current 
parallel programming systems do not meet the needs of this important group. Recent user 
surveys show that only one user program in twenty executed on the Comell supercomputer 
is parallel, [P&B90]. These su.rveys also indicare that many more scientists would 
program f or parallel systems if they were not so difficult to progra.m. Hand-coded 
parallelism is too difficult and time consuming, while parallelizing compilers do not achieve 
sígnificant speed-up. 
What is needed is a system which provides scientific programmers with a means to express 
their problem clearly and to have it execute efficiently in parallel automatically. Add to this 
the desire to run on standard MIMD architectures (e.g., iPSC/2) and the problem becornes 
very difficult MIMD architectures require that programs be decomposed into independent 
processes, running asynchronously on the different processor nodes and communicating 
with one another through rnessage passing or through shared memory. The current state of 
the art in programming such machines efficiently is to let the programmer explicitly 
partition the program into processes and insert the necessary synchronization and 
cornrnunication primitives. This is very time-consuming and error-prone. Automatic 
generation of para.lle! programs from conventional lapguages has not, as yet, achieved 
sufficient speed-up to warrant wide-spread usage. 
To achieve these goals many declarative programming languages [A&E88] have been 
designed. Declarative programming languages are much better suited for program 
decomposition than procedural languages such as Cor FORTRAN. Declarative languages 
allow the programmer to describe the problem using high-level constructs, yet their 
1 
2 
sernantics eliminare uncontrolled side-effects though functional expressions and single 
assignment restrictions. 
Declararive languages have been developed primari.ly in the context of approach of radically 
different cornputer architectures, in particular, dataflow architectures, where parallelisrn is 
to be exploited at the instruction leve/. For conventional loosely-coupled MIMD systerns, 
this level of parallelisrn is too low; the communicarions costs are too high. By moving to 
iteration leve/ parallelism this problem can be overcome [Burns, 88]. Iteration leve! 
parallelism is achieved when clifferent iterations (or groups of iterations) from the same 
loop are run on clifferent PEs. 
Process-Oriented Dataflow Systems (PODS) make use of iterarion leve! parallelism and 
declarative programing on distributed rnemory MIMD machines. The PODS line of 
research is show in Figure 1.1 as the bold arrow. 
Imperative Languages with 
Parallel Extensions 
Networks of 
von N eumann Processors 
Declara ti ve 
Languages 
Dataflow 
Architectures 
FIGURE 1.1. LINES OF RESEARCH. 
Figure 1.1 shows the different lines of research in parallel processing. The first line 
involves running imperative languages with parallel extensions (e.g., FORTRAN* 
3 
[K&B88]) on Networks of von Neumann Processors (e.g., iPSC/2 [Intel, 89]). This 
approach is the least revolutionary and has had sorne commercial success. The second line 
of research is to take imperative languages and execute them on dataflow architectures 
(e.g., Monsoon [Pap88]). This direction has not seen much research, only the ASTOR 
[U&Z89, Z&U87] project in Germany has looked into this. The next line of research is to 
take a declarative language (e.g., ID [ANP87b], SISAL[A&085, MSS85]) and run them 
on von Neumann networks. This is where PODS is, and there are a number of others, 
notably Pingali and Rogers at Cornell [P&R90]. The final approach is the most 
revolutionary, running declarative languages on dataflow architectures invoives both new 
hardware and software. P-RISC [N&A89] and the Monsoon project are both taking this 
approach. 
In [Bic87], the basic principies of PODS were presented. The algorithms for subdividing 
dataflow graphs into communicating processes, however, were too simplistic, 
concentrating on oniy functional parallelism. In scientific code, most parallelism comes 
from loops iterating over large data structures (i.e., data parallelism). This issue has been 
add.ressed in subsequent studies [BNR89a, Bic90, BNR90a, BNR90b] which show that, 
for languages based on the single assignment principies (deciarative languages), a simple 
automatic partitioning of arrays exposes significant parallelism that can be exploited at run-
time. 
In PODS, the programming language ID Nouveau [Nik88] is used because it is one of the 
most developed and supported dataflow languages to date and has singie-assignment. 
Single-assignment is central to PODS. Given an ID Nouveau program, a compiler would 
produce a dataflow graph, where nodes represent individual instructions and ares show all 
data dependencies. This graph is then used to generate light-weight processes, referred to 
as "subcompact processes" (SPs). This is accomplished by partitioning the data.flow graph 
into subgraphs, each of which is executed as a sequential process on a given processing 
element (PE). 
4 
This d.issertation describes the partitioning method used to form the SPs, the SP 
d.istribution criteria, the logical implementation of PODS, the remote caching scheme used, 
and the results of experiments with an event-driven, instruction-level, simulator. The 
dissertation is organized as follows: 
• Chapter 1 Background - an overview of the pertinent basic concepts. This 
includes discussions on parallel programming, distributed memory MIMD a.rchitectures, 
the ID dataflow language, and the previous work on PODS. Knowledgeable readers may 
skip any or all of this chapter. 
• Chapter 2 PODS Partitioning and Distribution Model - a detailed d.iscussion of 
the inner workings of the partitioning of programs into SPs and their distribution. 
• Chapter 3 PODS Logical lmplementation - a discussion of the tasks necessary to 
make the PODS system work. The array caching scheme is presented along with a 
discussion of the special PODS instructions. This is followed up with a description of the 
PE a.rchitecture and the necessary support software. 
• Chapter 4 Simulations - a presentation of the experiments using Matrix Multiply 
and SIMPLE. The simulation approach is discussed and the results are examined. 
• Chapter 5 Conclusions - a discussion of the findings about PODS. Future 
research and related work are also discussed. 
5 
1. 1. Basic lssues in Parallel Processing 
1. 1. 1. Parallel Programming 
Parallel processing has been touted as the wave of the future for a number of years, yet its 
use is not yet common. This is because parallel processing requires parallel programrning. 
Fer the average, highly-intelligent, but inexperienced, scientific progranuner, the task of 
programming a parallel system can be daunting. 
In [K&B88], Karp and Babb discuss the complications which arise when trying to 
program in any ene of twelve parallel FORTRAN dialects. They state that even trivial 
examples frequently become a challenge. Programming parallel systems present 
' complications not found in sequential programming. Often parallel programing 
environments force the programmer to explicitly partition function and data according to the 
constraints of the architecture. Thus requiring the scientific programmer to become 
knowledgeable about the particular computer architecture being used. 
In debugging parallel programs, synchroniz.ation and timing are often the problem 
[K&T88]. By requiring the programmer to explicitly state the communication and 
synchroniz.ation points in a program, the system is opening itself to subtle timing errors. 
The difficult thing about timing errors is their unpredictability. Often a timing error may 
disappear based upon sorne seemingly unrelated fact (e.g. the load on the 1/0 network to 
the host), and reappear at a later date. 
In their 1989 report on supercomputers, the IEEE Scientific Supercomputer Subcommittee 
sited the lack of software as the major problem [IEEE89] in supercomputing tod.ay. 
6 
Ali of the above problems are addressed in PODS, the parallelization is implicit not explicit, 
the synchronization is hand.led automatically, and, due to the dataflow nature of PODS, the 
special timing problems of parallel programs are non-existent 
1.1. 2. Distributed Memory MIMD 
Distributed-memory MTh1D computers can be rnade massively parallel by adding PE's in a 
modular fashion. This modularity allows dramatic increases the theoretical maximum 
speed. As an example, the latest supercomputer frorn Intel, called the Delta System, will 
incorporate 528 i860 microprocessors and have a theoretical peak processing rate of 32 
billion floating-point operations per second [Ins91]. The problem is exploiting all of this 
parallelism. 
1. 2. Previous Research 
l. 2. l. Single Assignment Principie 
The Single Assignment Principle simply states that no variable will be assigned a value 
more than once. This would seem like a very limiting restriction, i.e. one rnay not even 
write x = x +l. However, researchers have found that a number of benefits can be derived 
from using single assignment in combination with afunctional language. A functional 
language is one which is based on function application and is therefore free of side effects. 
Sorne of the programming benefits of single assignm.ent functional languages [Veg88] are: 
• Programs can be written at a higher level. Time can be spent 
concentrating on the algorithm rather than the program details. 
• More algorithmic work can be expressed per line of code. This 
is important because evidence suggests that the number of lines 
of correct code per da.y is roughly a constant for a given 
programmer, independent of the language used 
• Functional languages a.re free of side effects. This greatly 
reduces unexpected modification of variables in other routines. 
Programs are easier to verify because proofs can be based upon 
the concept of a function rather than sorne complex von 
N eumann model. 
Functional programs can contain a great deal of implicit rather 
than explicit parallelism This is crucial to the PODS concept 
7 
As is described in the next section, ID Nouveau is the single assignment functional 
language which PODS uses. Sorne of the basic ID Nouveau principles are discussed in the 
next section. 
PODS specifically uses the following abilities of single assignment functional languages: 
• Irnplicit Parallelism - the ability of a programmer to code a 
parallel program without explicitly specifying the parallelism 
• Parallel Program Synchronization - single assignment 
automatically synchronizes the da.ta reads and writes of a 
program, thus preventing innocuous timing bugs. 
• Automatic Cache Coherency - single assignment allows remote 
caching to avoid the cache coherency problem. Thus an efficient 
implementation can be designed, see Section 3.3, Remote Array 
Caching. 
8 
1. 2. 2. ID Nouveau Dataflow Language 
ID (1rvine Dataflow) was bom at the University of California, Irvine in a 1978 technical 
report [AGP78]. Thís report laíd the foundations far all further versions of ID. ID has 
gone through rnany changes but still retaíns the basic dataflow ideas, the single assignment 
concept, and the compiler approach outlined by Arvind. The latest version is being worked 
at MIT and is called ID Nouveau. The ID Nouveau language environment, called 
ID-World, is a complete parallel language simulator. There are over twenty sites using 
ID-World and man y more will be appearing as ID-World expands outside of the LISP 
machine world and onto UNIX workstations. 
The syntax of ID Nouveau and its functional nature lead to clean algorithms, which in tum 
is easier toread and understand. Consider the quicksort code in Figure 1.2 below. Notice 
that ID Nouveau allows standard list operations which are easy to understand. 
def Quicksort A = 
{ 
Split L = 
{ startvalue = hd L; 
for v in L; 
if (v < startvalue) then cons Llist v; 
if (v == startvalue) then center = v; 
if (v > startvalue) then cons Rlist v; 
end for 
in 
Llist, center, Rlist 
} ; % Split 
in 
% Quicksort routine body 
if (length A < 2) 
then 
A 
el se 
{ 
L, Middle, R = Split(Data) 
in 
cons Quicksort(L) Middle Quicksort(R); 
} 
}; % Quicksort 
FIGURE 1.2. ID NOUVEAU QUICKSORT CODE. 
The split function is repeatedly called until each sublist has only one element it it Then the 
sublists are concatenated in order. This is a very clean and clear program for quicksort. 
Sin~le Assi~nment Approach 
The central issue for PODS in ID is its single assignment nature. All dataflow languages 
begin with single assignment, yet many diverge as further developments are made. ID has 
tried to stay true to its original single assignment concept: 
... a dataflow operation is purely functional and produces lli2 side-
effects as a result of its execution. 
9 
10 
This is the essence of single assignrnent; however, the issue of array handling is in conflict. 
To provide arrays this constraint has to be relaxed. ID Nouveau arrays (called 1-structures) 
produce a side-effect, but are not allowed to be updated to ensure deterrninacy. Yet, with 
no upd.ate how useful is an array? The answer to this question is still being researched. 
Arvind. Nikhil, and Pingali feel that they are very useful and that this is the best approach 
[ANP87a]. They believe that an upd.ate operator is inadequate and over-specifies 
algorithms is such a way that unnecessary copying of intermediare data structures and 
substantial unnecessary sequentialization occur. They also feel that automatic detection is 
not tractable in general, contrary to other researchers beliefs [A&K87, A&N87, P&W86]. 
Iteration 
Iteration is a major source of parallelism. How a language handles iteration is going to 
affect the ability of the programmer and compiler to exploit the parallelism in the loops. In 
ID N ouveau the evaluation of loops and conditionals is not eager. This is the sarne as 
V ALJSISAL for expressions [A&085]. This forces the predicare to be fired before either 
of the two branches of a conditional are fired. 
Asan example of iteration, consider the program below, tak:en from [Tra86]. It fiUs each 
element of its argument array with a value and retums the sum of all the elements. The 
loop body contains ordinary bindings (like the variable val), I-structure stores (for A[i]), 
and sorne newified variable bindings. These newified variable bindings describe how to 
compute the values the newified variables take on the next iteration of the loop, e.g. the 
variable i is incremented each time through the loop. These newified variables must have 
an initial binding outside the loop, otherwise it would have no value for the first iteration. 
Newified variables do not make sense outside of loops and are not allowed there. 
def fill it A = 
let 
i = lower bound A; 
sum O; 
in 
while i ~ upper bound A; 
val = (upper bound A - lower bound A) ~ 2 - i*i; 
A[i] = val; -
new sum = sum + val; 
new i = i + 1; 
return sum 
FIGURE 1.3. ID NOUVEAU ITERATION EXAMPLE. 
1 1 
I-Strnctures 
The basic array structure mechanism in ID Nouveau is the I-structure [ANP87a]. An 
I-structure is an incremental structure which obeys the single assignment rule. An 
I-structure is available as soon as it is allocated and the array elements are individually 
accessible. Consider the wavefront example below: 
A= matrix ((1, m) , (1, n)) ; 
{for i from 1 to m do 
A[i,1] = 1} 
{for j from 2 to n do 
A[l, j] = 1} 
{for i from 2 to m do 
{for j from 2 to n do 
in 
A} 
A[i, j] = A[i-1, j] + A[i-1, j-1] + A[i, j-1] } } 
FIGURE 1.4. ID NOUVEAU I-STRUCTURE EXAMPLE. 
Here a matrix has its upper and left borders filled with 1 's, while its interior is filled with 
the sum of the upper, left, and diagonal elernents. The matrix A will be returned as the 
value of the entire expression as soon as it is allocated. Meanwhile, ali the loop bodies are 
initiated in parallel, but sorne will be delayed until the loop bodies to the left and top 
(cartesian coordinare wise) complete. Thus a "wavefront" of processes fills the matrix in 
parallel. 
To achieve this flexibility I-structures use a presence bit. Each cell of an I-structure has a 
logical bit attached to it to determine if the cell's value is present. If a read occurs before 
the cell is written, the read is enqueued by the I-structure. When a write occurs, ali 
pend.ing reads are dequeued and processed. If a write occurs to a cell which has already 
been written, then a run-time error occurs. This is an efficient way to enforce single 
assignment. 
12 
I-structures do have a referential transparency problem. Referential transparency demands 
that the values returned by two calls to the same constructor function with the same 
arguments must never be distinguishable. Thus, in a functional language, one can never 
alter a data structure once it has been created, and consequently one must specify the 
contents of all elements of the structure at creation time (as in V AL/SISAL [A&085] and 
LUCID [W &A85]). Since ID Nouveau includes I-structures, and I-structures do not 
specify the contents of all elements at creation, ID Nouveau is nota completely functional 
language. Yet it is still single assignment and declarative. 
Referential transparency can be given up but determinacy cannot If a language possess the 
Church-Rosser property [Lan65], also called the confluence property, then overall program 
determinacy is guaranteed even if the machine exhibits non-determinacy in instruction 
scheduling. The Church-Rosser property requires that the answer computed by an 
expression be unaffected by the choice of which subexpressions are evaluated first. Since 
I-structures enqueue ali early reads until the cell is written to, and each cell is single 
assignment, 1-structures have the Church-Rosser property. No matter how one interleaves 
the execution of reads and writes, every fetch to a given I-structures element always returns 
the same value. 
13 
Djscussioo 
ID Nouveau is highly developed language system with rnany sites using its development 
environment (ID-World). The ID Nouveau language reference manual [Nik87a] describes 
a complete environment with a compiler, a context sensitive editor, and simulators with 
parallelism detectors. 
In [A&E88] a convincing argument is made for single assignment programming of 
scientific programming. In this technical report the SIMPLE hydrodynamics and heat 
conduction problem is detailed, andan efficient ID Nouveau program is designed. This 
design is then contrasted with a parallel version of the program in annotated FORTRAN 
where each program does the same number of arithmetic, load. and store operations. 
1.2.3. Hybrid Dataflow 
Since Dennis first described the first dataflow execution model [Den75], many architecture 
designers have attempted to apply the model to real systems. Dataflow is attractive because 
ali parallelism in a program is exposed for potential concurrent execution. In spite of the 
elegance of the model, dataflow is not widely used after more than twenty years of 
research. The focus has instead turned to the evolution of modem systems by extending 
them with dataflow techniques. The results of research in this a.rea include hybrid systems 
using large-grain or macro dataflow [Bab84, B&E87, DFL89, Ian88, Kap86, L&G86, 
S&H87]. 
Iannucci [Ian88] has reported on a hybrid dataflow / von Neumann architecture. This 
approach is similar to PODS in its use of ID Nouveau as the input language and split-
phased structure access. However the lannucci approach uses a finer gra.in scheduling 
approach, called scheduJing quanta (SQ). An SQ of two to three instructions is desirable 
for Iannucci's approach, and each iteration of a loop is a new SQ. In PODS, however, the 
l..+ 
natural decomposition of the prograrn is used and SPs are allowed to run-in-place, thus 
reducing overhead Another d.ifference is in data structure distribution. There is no 
mechanism for spreading iterations of a single loop across processors in Iannucci's 
approach. Combining data structure distribution with loop distribution is a central goal in 
PODS. Finally Iannucci's mcxiel requires a special purpose architecture capable of fast 
context switching among very small SQs. PODS tries to generate SPs large enough to 
produce good computation-communication ratios on available distributed memory 
multiprocessors. Certainly PODS would benefit from a tailored architecture, but the model 
itself is not restricted to such. 
In [G&H89], Goldberg and Hudak presented Alfalfa, a system similar ata high level to 
PODS. They have implemented the ALFL functional programming language and run-time 
system on an Intel iPSC hypercube using what they call serial combinators. Serial 
combinators are similar to PODS SPs in that they are sequential threads that execute on a 
van Neumann processor. The run-time system handles thread creation and distribution. 
The main focus of their work is the study of the effects of dynamic sched.uling (diffusion 
scheduling) of parallel threads of execution. They show that diffusion scheduling works 
well in many cases, however, they have not addressed. the problem of distributing large 
data structures such as arrays. This is illustrated through the relatively poor performance 
achieved with the Matrix Multiply algorithm. 
1. 3. Overview of PODS Execution Model 
The primary objective of PODS is to achieve an efficient execution model for dataflow 
programs by reducing the overhead associated with scheduling each instruction individually 
[Bic90]. The greatest deficiency of the pure dataflow model is the excessive 
communication and token matching overhead associated with passing data from one 
operation to another. These operations may lie on the same or different processors, thus 
potentially forcing token traffic over the processor interconnection network. 
15 
Originally it was thought the normal communication overhead could be reduced by 
grouping the instructions into threads. This was based on the observation that man y 
threads of instructions in the dataflow graph must be executed sequentially due to inherent 
data dependencies. Grouping instructions in this manner is similar to Babb's Large Grain 
Data Flow (LGDF) [Bab84]. However, it was found this produced SPs which were too 
small for the communication to computation ratio of typical distributed-memory machines. 
1. 3 .1. Subcompact Processes (SP) 
In order to overcome the small SP problem, a different approach was tried and found to be 
sufficient. This approach uses the code-blocks inherent in the program. Each code-block 
is a different SP, which will then be distributed by the Partitioner as necessary. This is 
how PODS exploits the iteration leve! parallelism in a program. 
The code fragment below in Figure 1.5 shows a simple nested loop. For this loop there are 
three different program scopes which turn into SPs. The first takes care of initial actions, 
mainly array allocation. The second handles the L level of the loop, and the third handles 
the K leve! and the actual computations. 
(initial A := < >; Y := < >; ZX := < > 
f or L from 1 to LOOP do 
new A := (initial X := < > 
for K from 1 to 1000 do 
new X[K] := Q + Y[K]*(R* ZX[K+lO]+T*ZX[K+ll]) 
return X) 
return A[l]) 
FIGURE 1.5. SUBCOMPACT PROCESS EXAMPLE CODE. 
16 
Figure 1.6 shows che code fragment as a d.ataflow diagram. The SPs are outlined in bold 
lines. Notice that the SPs are grouped so that each one will be as independent from the 
others as pos si ble. This is were the parallelism is. SP 1 allocates the arrays and then 
passes that inforrnation on to SP2. There may be multiple versions of SP2 running (if it is 
distributed), each executing only part of the L-loop. Each SP2 will then spawn SP3, 
which will run in-place (SP3 would never be distributed if SP2 were). In Chapter 2 the 
algorithm for distributing SPs is discussed in detail. 
17 
X 
SP2 
SP3 
FIGURE 1.6. PODS SUBC01\1PACT PROCESSES E:XA1\1PLE. 
l. 3. 2. State Transitions 
Once the static SPs are formed they will need to be scheduled for execution. Instead of 
scheduling individual operators of a dataflow graph for execution, the level of granularity is 
changed to that of an SP. An SP is passive as long as itsfirst operator is disabled (i.e., it 
is still missing sorne operands). A passive SP resides in program memory. When all 
18 
operands for the first operator have arrived, the SP becomes active. This is accomplished 
by load.ing the SP into execution memory and creating a simple process control block 
(PCB) for it The PCB contains the following information: 
• the starting address of the SP in execution memory 
• a program counter pointing to the current instruction 
•a status field indicating whether the process is running, ready, or blocked 
The three states are defined as follows. An SP is said to be running when a PE is currently 
fetching and executing instructions from that sequence. An SP is ready when its current 
instruction is enabled (has all its operands), but the PE is not available to execute that SP. 
Finally, an SP is blocked when its current instruction is not enabled. 
current instruction 
gets last operand 
.FIGURE 1.7. PROCESS STATE TRANSITION DIAGRAM. 
The possible state transitions are illustrated in Figure 1.7. Initially, an SP is loaded in.to 
execution memory in the ready state. Whenever the PE becomes free, it begins executing 
one of the ready SPs in its execution memory; at that time, the status of the selected SP 
changes from ready to running. The PE continues executing the SP until it reaches the end 
19 
of the SP (at which time it is destroyed) or until it encounters an operator that does not yet 
ha ve ali its operands present. In the latter case, the SP is blocked and the PE switches to 
another ready SP. The blocked SP changes its status to ready as soon as the last operand 
for the current instruction arrives. 
This process-oriented. viewpoint pennits us to execute a dataflow program as a collection of 
communicating SPs. A given dataflow program is transformed into one or more SPs, 
which are mapped onto the available PEs. Each SP continues executing as long as it has ali 
the operands necessary to perf orm its current operation. When an operation produces a 
result token destined for a subsequent operation within the same SP, it is passed directly to 
the destination operand slot using a simple memory operation. Only when the token is 
destined for a different SP must it travel through the dataflow routing network (within the 
same PE or to another PE) and pass through the matching store. It is important to note that 
the amount of resources need f or a particular SP is known at load time. With this 
information the amount of parallelism can be reduced if necessary. 
1. 3. 3. Distributed Memory Approach 
In PODS, the memory is distributed as shown in Figure 1.8 below. The physical 
separation between the PEs is recognized and exploited Remote memory requests are 
performed in a split-pha.se manner. This allows the CPU to continue processing during the 
long remote memory latency. Local memory requests are handled instantly and do not 
cause the CPU to context switch. This is one reason PODS is able to exploit the power of 
massively parallel distributed-memory machines. 
LOCAL ACCESSES IN CONSTANT TIME 
WITii NO CONTEXT SWITCH 
• • 
ONL Y REMOTE MEMORY 
ACCESSES ARE SPLIT-PHASE 
~ . 
FIGURE 1.8. PODS MEMORY ACCESSING SCHEME. 
l. 3. 4. Discussion 
This mO<iel of execution has a number of advantages. Since it uses a program counter, 
20 
loops can be run in place efficiently. If necessary, due to dependencies, PODS can drop 
into completely sequential execution. When a process block occurs, the execution unit 
performs a simple context switch (no register storage is necessary) and takes the next ready 
SP off the ready list. And array accesses are split-phased to allow the long memory latency 
to be tolerated. 
In summary, PODS uses a combination of dataflow and von Neumann models of 
computation. It uses single assignment to reduce side-effects which aides parallelism. The 
declarative nature of ID, and its implicit programming of parallelism, allows the 
programmer to ignore the architecture, which increases programmer productivity. For a 
more detailed description of the execution model, the reader is referred to [Bic87, Bic90]. 
21 
1. 4. Contributions of this Research 
This research has made contributions on many levels. It extends the existing models (the 
PODS Execution Mooel and ID lnstruction Set). It presents new principles and algorithms 
(for partitioning and distribution). It exploits the abilities of old concepts in new ways 
(Remote Arra.y Caching). It explains how all of these can work. together in a logical 
rnanner (Logical Architecture). And it shows that this approach is efficient and scalable 
(the simulations). 
1. 4 .1. Execution Model Extensions 
The PODS Execution Mooel was extended to allow iteration level parallelism The 
previous mooel, based on the concept of sequential threads, produced SPs which were too 
small. The extension to iteration level parallelism allows larger SPs which are more easily 
distributed. 
1. 4. 2. Partitioning and Distribution Model 
The new PODS Partitioning and Distribution Model is based upon two existing and three 
new principies of parallel execution. The existing principies (the Equal Distribution 
Principie and the Centralization Principie) are well known and are continually pushing in 
opposite directions. The new principies (the Grouping Principie, the Virtual Sources 
Principie, and the Collector Writes Principie) explain ways in which the two existing 
principies can be managed. 
From these five principies, two partitioning and distribution algorithms were derived. The 
first shows how data should be partitioned and distributed to balance work load and speed 
up accesses. The second describes how code should be partitioned and distributed to 
balance parallel execution with communication costs. 
22 
Three primary and two secondary mechanisms were devised to make these algorithms 
work. The fust primary mechanism is a distributed array allocate operator which 
distributes data. The second is a distributed L operator, it spawns processes across the PEs 
to distribute code. The third is an index range filter for restricting the indices for different 
PEs. These form the basis for PODS distributed processing. The secondary mechanisms 
are: an APPL Y operator for functional distribution; and remote arra y caching for efficient 
array accesses. Together these provide an efficient means of applying the new partitioning 
and distribution algorithms. 
1. 4. 3. Remote Array Caching 
Remote Array Caching is a new approach similar to the concept of virtual memory and 
based upon the Virtual Sources Principle. This allows arrays to be accessed as if there 
were local to every PE. The locality-of-reference of computer programs is heavily 
exploited in Remate Array Caching. 
1. 4. 4. Logical Architecture 
A description of how all of these new concepts and approaches are implemented are 
contained in the Logical Architecture. The functional units in a PODS PE are: the 
Execution Unit, the Matching Store, the Routing Unit, the Array Manager, and the Memory 
Manager. Each of these is designed to run in parallel with the others. 
Extensions to the ID instruction set were necessary to allow PODS to execute on a von 
Neurnann CPU. Sorne of these extensions involve the addition of a program counter to 
each instruction's semantics. Others involve extensive modifications of existi.ng 
instructions (e.g. the L operator), and finally others involved totally new instructions to 
support the PODS Range Filters (e.g. INTERV AL_COUNT). 
23 
1. 4. S. Simulations 
The PODS Translator, Pa.nitioner, and Simulator were designed and written to test PODS 
concepts. The simulations were necessary to test the logical architecture for correctness 
and efficiency. These simulations have shown PODS to be an efficient and viable 
approach. 
CHAP1ER2 
PODS Partitioning and Distribution Model 
The performance of PODS comes from its ability to map the inherent granularity of a 
program onto a given archítecture. The inherent granularity of a program comes from its 
block structure. The larger (smaller) the loops and procedures, the larger (smaller) the 
granularity. This granularity controls the size of the PODS SPs. The partitioning and 
d.istribution model allows the hybrid nature of PODS to be exploited: sequential code is ron 
on an efficient von Neumann processor, and parallel code is distributed such that 
communicarion costs are not prohibitively high. This is not to say that ali programs will 
run well on PODS, bad code can be written f or any computer system. The aim of this 
model is to handle the large majority of code which will be executed on distributed memory 
MIMD machines and to flag code which is poorly written. 
The key elements of PODS partitioning and d.istribution are: 
1 . array partitioning, which uses a simple page grouping scheme to 
allow equal load across the PEs; 
2. arra y d.istribution, whích follows the partitioning such that each 
PE produces only those elements for which it is responsible; 
3. loop distribution, which considers data dependencies when 
distributing; 
4. functional distribution, whích atternpts to off-load functions if 
the calling PE is overloaded. 
24 
25 
Chapter 2 is organized as follows: (1) a quick overview of the model; (2) presentation and 
discussion of the underlying principies; (3) a detailed discussion of PODS instructions and 
processes; (4) a d.iscussion of array partition and distribution; (5) an in-depth examination 
of process distribution; (6) a discussion of functional distribution; and finally (7) a 
discussion of deadlock handling. 
2 .1. Overview 
In order to exploit a program's parallelism, the program must be partitioned, an activity that 
has been the subject of rnuch research. Because optima! partitioning is NP-cornplete, these 
partitioning techniques strive for near-optirnality, usually through the use of heuristics or 
programmer supplied directives. PODS perforrns partitioning autornatically using the 
decornposition implied by the program structure. Programs are broken into code-blocks 
by the ID Nouveau compiler and replicated on each PE, making al1 processors hornoge-
neous with respect to code. The key problern with partitioning and distribution in PODS is 
that of determining where to send tokens that activate SPs. Since the PEs are 
homogeneous, an instance of a specific SP can be executed anywhere simply by routing the 
initial activating tokens to a specific PE. Because each PE is aware only of its own state, 
this routing decision is binary: should an SP execute locally or rernotely? PODS decides 
which SPs will be distributed and which will run locally at compile time. At run-tirne 
PODS decides where the distributed SPs will be executed. The exact rnethods for this 
distribution are explained in this chapter. 
Simply put, the PODS partitioning and distribution uses data distribution to control 
execution distribution. There are two basic conceptual steps to achieve this. 
1. Using a simple global algorithm, partition the data and allocate 
each partition to a PE. 
2. Execute the program such that the owner of a particular array 
element will write that element 
26 
By using a simple global algorithm for array partitioning, each PE can easily calculate 
where a particular array element is located during execution. This additional checking costs 
29% more cycles for each array read or write, but allows arrays to be accessed in parallel 
with little orno comrnunication and without context switching. 
In order to realize the above, the following tasks are perfonned.: 
1. Arrays are cut-up into pages of fixed síze X, where X is 
determined by the hardware architecture. 
2. Arrays are grouped in to superpages which are assígned to PEs 
sequentially. 
3. Execution follows the array partitioning and distribution if it is 
executing loop code which has no Loop-Carried Dependencies 
(LCDs). 
4. For code with LCDs, the execution will stay on the current PE 
unless a function call is made. 
5. When a function call is made the execution may move to another 
PE depending upon the length to the current PE's task list. 
There are three primary mechanisms for achieving data parallelism. These mechanisms are: 
1 . The ALLOCA TE Operator: used to distribute data (data 
parallelism). 
2. The DIST-L Operator: used to spawn processes on ali PEs. 
3. The RANGE-FIL TER Operator: used to restrict loop indices 
ranges for different PEs. 
The basic approach to distribute code for data parallelism is to: 
1 . distribute the arrays 
2. decide which level of the nested loop to distribute 
3. this level gets the RANGE_FJL 1ER while its parent gets the 
DIST-L operators. 
The mechanism for functional parallelism: 
1. The APPL Y Operator: used to spawn function calls on a single 
remate PE (functional parallelism). 
In this way the work load is partitioned at compile time and distributed using an efficient 
run-time algorithm without the programmer's explicit instructions. 
2. 2. Underlying Principies 
There are two basic principies which apply to any parallel system. They are: 
1. The Equal Distribution Principie 
2. The Centralization Principie 
These two are supplemented by three PODS specific principies. These principies show 
ways in which the two basic principies can be reconciled somewhat The PODS specific 
principies are: 
27 
28 
1 . The Grouping Principie 
2. The Virtual Sources Principie 
3 . The Collector Writes Principie 
By using each of these principies, PODS is able to provide efficient execution of scientific 
programs on MIMD machines. Each principie is explained below. 
2. 2. l. Basic Principies 
For any assignment to be accomplished, the RHS calculations must be performed and the 
writing of the element must occur. Consider the simple assignment below: 
A[i] = sqrt(B[i+l] + C[i]) * exp(D[m+i]) 
FIGURE 2.1. SIMPLE ARRAY ASSIGNMENT. 
In this statement B[i+l], C[i], and D[m+i] are data sources which need to be collected 
together so that the calculations can be performed. Once they are performed the assignment 
can occur. The diagram below illustrates these how these three agents interact Note that 
each data source, the data collector, and the data storage could be on different PEs. 
29 
Data Sources Data Collector Data Storage 
FIGURE 2.2. EQUAL DISTRIBUTION PRINCIPLE. 
In order for the data sources to respond to multiple data collectors simulta.neously they 
should be spread over ali the available PEs. Since the access pattems are not know a 
priori, each PE should get an equal number of data sources. This is the Equal Distribution 
Principle. More concisely, 
Definition: EquaJ Distribution Principie 
In order to allow ma.Ximum parallel access, data sources, data collectors, 
and data storage should be distributed equally among the available PEs. 
This principie is implemented in PODS by partitioning each array and distributing the 
pieces equally among the PEs. 
The Centralization Principle concems the cost of communication and the overloading of the 
interconnect network. Once the agents are widely distributed a problem occurs. The 
communication costs become extremely high. In order to reduce the effects of 
communication delays, ali of the items (data sources, data collectors, and data storage) 
should be kept together (i.e. centralized). This is the Centralization Principie which states: 
Definition: Centralization Principie 
In order to reduce cornmunication costs and network overloading, data 
sources, data collectors, and data storage should be centralized on one PE. 
30 
These two principles are obviously in conflict. The PODS specific principles below show 
how the balance can be tilted in favor of distribution. 
2. 2. 2. PODS Specific Principies 
Groupin~ Principie 
In order to reduce the effects of cornrnunication delays without completely centralizing, the 
data sources should be grouped together until sorne size, x, is reached. The diagram below 
shows how the number of cornrnunication lines is reduced by grouping. 
• • • 
• • ·c===F~~-+--........ 
• • • 
• • • 
Grouped Data Sources Data Collector Data Storage 
FIGURE 2.3. GROUPING PRINCIPLE. 
This is the Grouping Principie which states the following. 
31 
Definition: Grouping Principie 
In arder to reduce communicati.on over the network. data sources should be 
grouped together unti.1 sorne reasonable size is reached. 
This principle fights against the Equal Distribution Principie, a balance between them must 
be maintained. In PODS this is achieved by grouping the arrays into pages of a fixed size 
which is only dependent on the hardware architecture. 
Virtual Sources Principie 
One aspect of single assignment is that data sources never need to be updated. This can be 
exploited by moving copies of the data sources into the collector for easy access. Locality 
of reference implies that the grouped data sources should be moved in toto when one of the 
data sourccs is needed. Thc diagram below shows how the amount of communication can 
be reduced by caching the data source in the collector without any cache coherency 
problems; the dashed lines are truly one way . 
• • 
--
• • • --
• • • --.--
-
• • • 
Groupecl Data Sources Data Collector Data Storage 
with Virtural Sources 
FIGURE 2.4. VIRTUAL SOURCES PRINCIPLE. 
This is the Virtual Sources Principie which states the following. 
Deflnition~ Virtual Sources Principie 
Since each data source will never need to be updated, a copy should be 
moved into the data collector when any one of the grouped data sources is 
needed. The Virtual Sources Principle states that a single assignment 
system should cache data sources in its local memory to form a virtual 
source to reduce cornmunication. 
This principle allows remete reads to be reduced in PODS, and is implemented by remete 
access caching. 
Collector Writes prjnciple 
32 
In a single assignment system there will be only one write to a particular array element 
The thick black arrow in the diagrams above represents this write. Since there is only one 
collector and one write, these two should be on the same PE. The diagram below shows 
this. 
• • • 
•• •1------ltl 
• • • 
• • • 
Grouped Data Sources 
>--< 
Data Collector 
with Vinural Sources 
and Storage 
FIGURE 2.5. COLLECTOR WRITES PRINCIPLE. 
33 
The producer of an array element is the PE which collects the RHS calcularions needed for 
the formation of a LHS value. This PE, the collector, becomes the writer by executing the 
WRITE_ARRA Y instruction which assigns that array element a value. Since the 
single-assignment principle is in force; there will be one writer. This is the Collector 
Writes Principle which states the following. 
Definítion: Collector Writes Principie 
The Collector Writes Principle states that the system should map an array 
element such that the PE which holds that array element in its local memory 
(the owner) shall be the collector of the RHS data sources, and shall also be 
the writer of that array element 
This principle, in collaboration with the other principles, forces the execution to follow the 
data distribution. In PODS this is called Data Distributed Execution. 
2. 3. PODS Instructions and Processes 
The basic concept of a dataflow operator has n.ot changed, only the implementation of that 
concept. In PODS dataflow operators are implemented using PODS instructions. The 
basic dataflow concept (shown below) allows the dataflow graph to execute cleanly; 
without leaving tokens unconswned. 
input token 
(data, tag) 
OPERATOR 
output token 
(data, tag) 
output token 
(data, tag) 
FIGURE 2.6. BASIC DATAFLOW OPERATOR. 
34 
The standard dataflow implementation of this concept performs the following steps when a 
token arri.ves: 
1 . consume input tokens 
2. compute new data value 
3. compute new tag 
4. fonn new output tokens 
5. send output tokens to destination operators 
For PODS this implementation needs to be modified to contain the concept of an SP's state. 
An SP's state is basically a PODS activity name, which is discussed next in Section 2.3.1. 
2. 3 .1. Activity Names 
An activity name is the colored tag which identifies a token's complete context What is 
presented below is a logical implementation, a physical implementation would use unique 
frame IDs. Logically, activity na.mes consist of two parts: ( 1) the static part which is 
known at compile time; and (2) the dynamic pan which is built as the token moves from 
context to context. Figure 2.7 below shows the make-up of an activity name. 
Activity Name 
~amicPart Static Part 
context I rteration ~ l mstructlon l j)_Ort 
FIGURE 2.7. ACTIVITY NAME COMPONENTS. 
The static part is know by the compiler from the dataflow graph once the SPs are built 
35 
The dynamic pan is based upon the incoming token's activity name and is only affected by 
the context rnanipulating functions: D and D_INVERSE, L and L_INVERSE, A and 
A_INVERSE. The activity name is also known as the tag. The individual subparts are listed 
below, along with their function. 
• context: holds the pointer to past activity names, affected by L 
and L_INVERSE, A and A_lNVERSE. The context holds a token's 
tag in a linked list This list represents a11 of the execution 
scopes through which a given token has passed. This 
inf ormation is necessary for PODS to know how to move a 
token from one execution scope (i.e. SP) to another. 
• iteration: holds the current iteration number, affected by D and 
D_lNVERSE. 
• sp: holds the SP number, based on partirioned dataflow graph. 
• instruction: holds the instruction number within this SP. 
" 
2.3.2. 
port : holds the port number within this instruction, usually O or 
l. 
PODS Instruction Format 
36 
There are three types of PODS instructions. These types indicate how the instruction was 
derived from the output of the ID Nouveau compiler. The fust type is formed from a 
simple mapping from TTDA instructions and PODS instructions. These are the basic 
instructions such as ADD, and ARRA Y _READ. The second type actually disappears when 
the output is translated. These are the IDENT instructions which are used for 
synchronization. These are not needed because the sequential nature of SPs synchronizes 
instructions automatically. The third type is composed of new instructions which are added 
or modified to accomplish the distribution. These are the SWITCH, FORKJUMP, D and 
D_INVERSE, L and L_INVERSE (in both dist and local forms), A and A_INVERSE, and 
Al.LOCA TE. Each of these will be explained as they are encountered in this chapter. 
PODS instructions have the following fields (see Figure 2.10 for an example): 
l. Op Code - operation to be performed. 
2. Number Arguments - the number of arguments this operation 
needs before it is ready to fire. 
3. Operand List - slots for values of operands. Initially sorne of 
the operands are constants which are set at compile time. Each 
constant is represented by the pair (value, port). Other operand 
ports are flagged with a special "sticky bit" (STKY) which means 
that once a token is received on that port, it is then held there 
and does not need to be replenished for the instruction to fue. 
4. Local Destination List - output value destinations which are 
withln th.is SP. Each destination is represented by the pair 
[instruction number, port]. 
5. Route ID - ID of mute to be used when output tokens are to be 
sent to other SPs. This is not a list because the routing 
information is stored in the Routing U nit and not in the 
Execution Unit A route ID is simply a shon-hand for: [SP ID, 
instruction number, port] [SP ID, instruction number, pon] [SP 
ID, instruction number, port] ... , see Chapter 3 for complete 
details. 
6. Comments - variable names from the source ccxl.e, shown in 
brackets, " { } ". 
Values can be sent using any of the following paths: 
1. U sing the local destination list This is the way almost ali of the 
operators communicate. Only L and A operators can send tokens 
to other SPs. 
2. U sing the route list This is performed in one of three ways 
depending on the type of L or A operator. Only L or A operators 
have routes. 
(1) the DIST-L operator sends tokens to SPs on every PE. 
(2) the LOCAL-L operator sends the token to a different SP 
on the same PE. 
37 
(3) the A operator sencis th.e token to a different SP on some 
PE. Which PE is decided by a hash function. 
2. 3. 3. PODS Dataflow Operator Implementation 
In PODS, an SP contains code anda state. The code represents the operations to be 
pe1formed and the state holds the status of these operations. 
CODE STATE 
context 
iteration number 
SPID 
program counter 
FIGURE 2.8. SP COMPONENTS. 
38 
When a token arrives at a PODS operator the state of the SP is used to decide the steps to 
execute this operator. All of the original ID operators which are not special operators are 
called basic PODS operators. All of the special operators are discussed individually after a 
discussion of the basic PODS operator implementation. 
The basic PODS operator implementation performs the following steps when a token 
arrives: 
1. Consume input tokens. 
2. Compute new data value. 
3. Compute new tag. 
4. If the context and SP ID are the same, then no tokens are 
formed, only data is stored into destination instruction and port. 
If either of these has changed, then form new output tokens and 
route them using the routes specified for this operator. 
5. Increment the program counter. 
39 
This implementation is the same as the basic dataflow version in Steps 1 - 3. Step 4 
however now checks the SP state to see how to deal with the output data, whether to store 
it locally within this SP orto forma token and route it to another SP. Notice that Step 4 
does not check the iteration nwnber of the tag. This is because the iteration number can 
only be changed by a D operator, and D operators do not change SP. Step 5 has been 
added to increment the program counter. There are a couple of operators (the D and 
FORKJUMP operators) which set the program counter to a value rather than just 
incrementing it Ali other operators follow these steps exactly. What follows is a 
description of the new PODS instructions, and why these implement the same semantics as 
the original ID operators. 
In order to show that the semantics of the original ID operators have not changed each 
operation type will be addressed. It is quite simple to understand the way in which PODS 
implements the semanti.cs of ID. The original ID had the following fields in its tag: context 
e, procedure p, statement number s, and iteration i. As explained above in the section on 
activity names, PODS uses a context e, a SP ID sp, an instructi.on number si, andan 
iterati.on i. PODS uses the context and iterati.on exactly the same, it is only the procedure 
and statement number which differ. 
40 
Basically the procedure cuts the dataflow program into subsets, and the statement number 
identifies the operator within the subset. PODS uses the same approach but just cuts the 
collection into smaller subsets. In Figure 2.9 below, the set of ali operators is cut into 
procedures Proc 1 - Proc4 (in bold lines), while the SPs are just subsets sp 1 - sp8. In this 
way the combination of the two field holds exactly the same information, i.e. the "address" 
of a particular operator. Also note that since each procedure cut is also an SP cut, then 
when a procedure change is made an SP change is also malee. 
SETof ALL 
OPERATORS 
Proc3 
Proc2 
Proc4 
FIGURE 2.9. ID VS PODS STATEMENT "ADDRESSING". 
Arithmetic and Lo~cal Q_perators 
The vast majority of ID operators fit into this the class of arithmetic and logical operators. 
In the original ID these operators only changed the statement number and the value of the 
token. This can be expressed by: 
ID Arithmetic & Logical e f ~ f e, p, s, i, v -> e, p, s , i. v 
In PODS exactly the same value calculation is performed, and the instruction number is 
changed. Expressing this in a similar format to the above: 
PODS Arithmetic & Logical e, sp, si, i, v -> e, sp, si', i, v' 
41 
Notice that the "address" (sp, si} for the output token specifics the receiving operator just 
as is done in ID with (p, s'). 
The switch operator falls into this class and is cliscussed along with a new instruction 
(forkjump) below. 
SWITCH and FORK.JUMP 
The SWITCH and RJRKJUMP work in conjunction to form a branch type of operation. The 
PODS SWITCH is much like the original ID SWITCH with the following exception: once 
tokens are passed along, the program counter is modified by a true or false relative offset 
The original ID SWITCH peñormed the following: 
ID SWITCH e, p, s, i, v ->e, p, s', i, v 
PODS performs the following which is exactly the same except the addressing clifferences, 
which are equivalent 
PODSSWITCH e, sp, si, i, v -> e, sp, si', i, v 
In order to execute a PODS SWITCH Steps 4 and 5 of the basic implementation need to be 
replaced. The new Steps 4 and 5 are: 
4. If the predicate is true, then store output values into true 
destination instructions. If the preclicate is false, then store the 
output values into the false destination instructions. 
5 . If the predicate is true, then increment the program counter by 
the true relative jump. If the predicate is false, then increment 
the program counter by the false relative jump. 
42 
Once the input tokens are present the SWITCH tires, send.ing tokens to either the true or 
false branch and jumping to the next instruction to execute. The PODS instructions below 
were taken from Matrix Multiply. As described previously, the fields have the following 
meanings: (1) instruction number, (2) op code; (3) number of arguments; (4) operand slots; 
(5) destinations; and (6) a comrnent. For SWITCH the number of arguments is always five, 
port O is the pred.icate, port 1 is the value, port 2 is the true relative offset, port 3 is the false 
relative off set, and port 4 is the number of true destinations. The destinations are ordered 
such that the false destinations are last. The FORKJUMP always takes two arguments: one 
is the value to be passed (port 0), the other is the relative offset (port 1). 
10 SWITOi 5 
18 FORK..JtM? 2 
(1.00,2) (11.00,3) (2.00,4) -> [18,0) [19,0) [21,0) {I} 
(-17 .00, 0) -> [1,0) [2,1) 
FIGURE 2.10. PODS SWITCH AND FORKJUMP INSTRUCTION EXAMPLES. 
To form a simple branch the SWITCH and FORKJUMP are used together as shown in Figure 
2.11 below. The true relative jump of the SWITCH is set to 1, the false relative jump is set 
such that the program counter will jump .to the first false instruction on a false predicate. 
The FORKJUMP is used to skip the false instructions, its relative jump is set to go to the 
beginning of the unbranched instructions. 
Switch ~ First T lnstruction ·~ Second T lnstruction 
Third T lnstruction 
... 
Forkjump ~ 
First F lnstruction :"': F Second F lnstruction 
Third F lnstruction 
... 
Last F lnstruction 
Beginning of Unbranched lnstructions 
-
FIGURE 2.11. PODS BRANCH. 
D and D INVERSE 
The D and D_INVERSE operators work in conjuncti.on with the SWITCH to execute loops. 
The PODS o and D_INVERSE operators differ slightly from the original ID operators 
because of the relati.ve jump capability and because the activity names are different in 
PODS. 
The o operator tak:es a token and perfomlS two operations: (1) it increments the iterati.on 
number of the token's tag in the outer-most context, and (2) it perfomlS a relati.ve jump. 
U sually this relative jump is negati.ve, and sends the program counter to an earlier 
instructi.on. The semantics of the ID D and D_INVERSE are: 
IDD e, p, s, i, v ->e, p, s', i+l, v 
ID D_INVERSE e, p, s, i, v -> e, p, s', O, v 
43 
44 
For PODS the implementation performs something very similar. As for arithmetic and 
logical operators, the new "address" of the output token will be (sp, si') rather than the ID 
(p, s'). Otherwise PODS does exactly the same as ID. 
PODSD e, sp, si, i, v ->e, sp, si', i+l, v 
PODS D_INVERSE e, sp, si, i, v ->e, sp, si', O, v 
In order to execute a PODS D instruction Steps 4 and 5 of the basic implementation need to 
be replaced. The new Steps 4 and 5 are: 
4. Increment the iteration number, i, and store output values into 
destination instruction and port. 
5. Increment the program counter by the relative jump. 
The D_INVERSE operator implementation is very similar to the D operator's. In merely 
resets the iteration number to zero rather incrementing it. Specifically, the Step 4 of the 
basic implementation should read: 
4. Set the iteration number, i, to O and store output values into 
destination instruction and port. 
In order to produce a loop, the SWITCH takes the iteration variable and passes it into the 
loop body on a true predicare. Inside the loop body the iteration variable is modified 
(usually just incremented by one), and the D operator is placed at the end, see the code 
fragment from Matrix Multiply below. The D operator feeds both the predicate and the 
switch so Lhat t.he loop test can be performed. In the example below the relative offset of 
the D operator is -11, which will cause the program counter to be set to 9 (20-11 =9) after 
the D operator is executed. The loop body is from instruction 11 to instruction 19. The 
45 
D_INVERSE will reset the iteration number once the loop has exited. The loop will be exited 
from the SWITCH on a false predicate. Note that the SWITCH at instruction number 10 has 
a false relative offset of 11 and the last destinati.ons offset is to instruction 21 (21 = 10 + 
11). 
9 IE 2 (STKY, 1) -> (10,0] 
10 SWI'!Oi 5 (1.00,2) (11.00,3) (2.00, 4) -> [18,0] (19, 0] [21, 0] 
11 DIST IDPERATOR 1 (STKY,0) -> (12) 
12 DIST I.OPERATOR 1 (STKY, 0) -> (14) 
13 DIST IDPERATOR 1 (STKY,0) -> (15) 
14 DIST IDPERATOR 1 (STKY,0) -> (10) 
15 DIST I.OPERATOR 1 (STKY, 0) -> (11) 
16 DIST IDPERATOR 1 (STKY,0) -> (13) 
17 DIST IDPERATOR 1 (STKY,0) -> (16) 
18 DIST IDPERATOR 1 -> (1) 
19 PLUS 2 (1.00, 1) -> (20, O] {NEXT-I} 
20 D 2 (-11.00,1) -> (9, 0] (10, l] {I} 
21 DINV 1 -> 
FIGURE2.12. PODS CODEFRAGMENTFORALooP. 
L and L INVERSE 
In order to perform code distributi.on the original ID L operators need to be changed from 
their original implementation. In PODS L and L_INVERSE are used to route tokens between 
SPs. There are also two versions of each operator: a DISTRIBUTE version and a LOCAL 
version. 
In the original ID L operators were for entering and exiting loops. This is still true; 
however, in PODS entering and exiti.ng loops means entering and exiting an SP. In the 
original ID the procedure p of a tag does not change as the token passed though the L and 
L_INVERSE, however a new and unique context e is created. The new context is the 
concatenation of the old context, statement number, and itera.don. This is shown below: 
IDL e, p, s, i, v -> (clsli), p, s', O, v 
46 
ID L_INVERSE (clsli), p, s', i', v ->e, p, s, i, v 
In PODS the implementation is as follows: 
PODSL e, sp, si, i, v -> (clspli), sp', si', O, v 
PODS L_INVERSE (clspli), sp', si', i, v ->e, sp, si, i, v 
This implementation also generates a new, unique context c. This stored context is then 
used in the L_INVERSE for returning to the previous context The only real d.ifference is 
that the change in SP must be recorded in the tag. Referring back to Figure 2.9, L 
operators move the scope from one SP to another within the same procedure (e.g. from spl 
to sp2). Since the output token no longer has the same context, it wil1 be sent to the 
Routing Unit to be routed to the receiving SP. 
L and L_INVERSE operators perform routing by referencing a particular route list. The 
figure below shows two L type operators from Matrix Multiply. The LOCAL_LOPERATOR 
is using route list 7 with the LOCAL_LINV operator is using mute list 9. A route list is a list 
of destination addresses, each consisting of an SP, an instruction, and a port. This 
information is static and known at compile time. By duplicating this route table in every 
PE, each Routing Unit can find a particular instance of an SP. 
11
2 o I.CX:'AL !DPERM'OR 
-- 12 I.CX:'AL LnN 
1 
1 
-> (7) 
-> (9) 
FIGURE 2.13. EXAMPLE L ÜPERATORS. 
The LOCAL and DISTRIBU'IE versions of each operator tell the Routing Unit to (1) send the 
token only to its own PE, or (2) to distribute copies of this token to ali PEs. Tokens are 
distributed when the receiving SP is distributed This way ali of the PEs are given the 
1 
47 
needed tokens to start their pan of a loop. The decision whether to distribute or not is 
decided in the PODS Partitioner and the LOCAL or DISTRIBUTE version of the L operator is 
used. This is the way parallel processes get spawned, as discussed later in Section 2.5, 
Distributing Processes. 
A and A INYERSE 
The A and A_INVERSE operators (also known as APPL Y and INV _APPL Y) are the 
mechanism PODS uses for procedure calls. In this logical implementation the APPL Y 
operator collects the argument tokens until all are present, as compared to sending the 
tokens off as soon as they are ready. This may be changed in the future to support eager 
function evaluation. 
The A and A_INVERSE implementations are equivalent. but somewhat different than the 
original ID versions. In ID A and A_INVERSE peñonn: 
IDA c, p, s, i, v -> (clplstli), p', s', O, v 
ID A_INVERSE (clplstli), p', s', i', v -> c, p, st, i, v 
where (p, st) is the address of the instruction to retum to. In PODS the A produces two 
tokens rather than one. 
PODSA c, sp, si, i, v -> (clspli), sp', si', O, v and (clspli), sp', ai', O, st 
where (sp', ai') is the address of the a_inverse instruction and (sp, st) is the address of the 
instruction to retum to. In this way the A_INVERSE can use the retum address to build the 
appropriate tag as follows: 
PODS A_INVERSE (clspli), sp', ai', O, v and 
(clspli), sp', ai', O, st -> e, sp, st, i, v 
This is a simple and efficient method for calling procedures and is somewhat akin to the 
fastcall apply used by Iannucci, [Ian88]. The instructi.ons below were taken from 
48 
SIMPLE, and forma function call to and retum from the procedure TLU. APPLY operaton;. 
tak:e a variable number of arguments. One far the return instruction (port 0), one for the 
number of parameters to pass (port 1), and then one for each parameter (pons 2 to n+ 1 ). 
The INV _APPLY tak:es two arguments: one for the return value (pon 0), and one far the 
instruction number to return to (pon 1). 
frcm o::NDU:TICN-3 .p:xis 
9 APPLY 6 (10.00,0) (4.00,1) (STKY,2) (3.00,5) -> (121) (TLU) 
frcm Till. p:xis 
18 INV APPLY 2 -> (121) 
FIGURE 2.14. EXAMPLE APPLY AND INV _APPLY ÜPERATORS. 
2. 4. Array Partitioning and Distribution 
In scientific code a number of large arrays are used It is critica! that access to these arrays 
be efficient. This is the idea vector processors are based upon [H&B84]. In PODS, 
modified I-structures form the basis for array operations. I-structures are data structures 
which can be resized as necessary and enforce the single assignment principle with 
presence bits [ANP87a, ANP89]. PODS also uses presence bits, but arrays are of a fixed 
size which is determined at allocation time. 
The single assignment principle guarantees that only one instruction will ever write to an 
array element; it is the producer of that data. PODS exploits this fact by attempting to map 
each array element onto the same PE as its producer instruction, this is how PODS uses the 
Collector Writes Principie. However, it is not always possible, nor efficient for the 
collector to be the owner, as is explained below. By locality ofreference, the statements 
which read an array element will be "close" to the writer. Thus having the writer and 
owner the same will allow most array reads to be local rather tllan remate. Having local 
arra y reads is important, since once the array element is written there can be read many 
more times. Making these array reads efficient is central to PODS. 
49 
In arder to make the array reads efficient, the array caching scheme detailed in Chapter 3 is 
used. This simple scheme produces excellent results [BNR89b] as long as the array is 
accessed in the same direction as it is partitioned. For two dimensional arrays this means 
that arrays accessed in a row-major manner should be partitioned row-major. Generalizing 
to multiple dimensions, this rneans that first-major (last-major) code should be used to 
access first-major (last-rnajor) arrays. 
One approach to ensure that the direction is correct is to analyze each array's accesses and 
estimate which direction would be more efficient Analyzing the one filling algorithm 
(there usually will be only one dueto the single-assignment principie) could be done, but 
the reads matter more because there are many more of them. Analyzing the reads would 
require that the entire execution trace of the program be known at compile time, which is 
not possible. To see sorne of the difficulties, considera matrix-multiply function which 
takes arrays A and B as arguments. In ID Nouveau the code would be: 
Def mm A B = { (11,ul), (12,u2) = 2D bounds A; 
e = i _ ma tri x ( ( 11 , u 1 ) , ( 12 , u 2 ) ) ; -
In 
{ Far i <- 11 To ul Do 
{ For j <- 12 to u2 Do 
s = O; 
} ; 
C[i, j] = 
{ For k <- 11 To ul Do 
} 
Next s = s + A[i,k] * B[k,j]; 
Finally s 
} ; 
Finally C 
} 
FIGURE 2.15. MATRIX MULTIPLY ID NOUVEAU SOURCE CODE. 
50 
By examining this code it is easily seen that array A should be row-major and array B 
should be column-major based on the reads. However, an array is partitioned at allocation 
time and stays that way for its entire lifetime. So if the Matrix Multiply function was called 
with MM X Y, array X should be row-major and array Y should be column-major, and if 
called with MM Y X then the reverse is true. However, the binding between A (B) and X 
(Y) is dynamic and hence PODS cannot take advantage of it This late binding also 
prevents the proper direction for each array to be used every time. 
A better approach is to pick a direction and use it, letting the programmer know which 
direction is appropriate. This is the approach used by many popular languages toda y. For 
example, 'C' is row-major and FORTRAN is column-major. PODS uses row-major 
partitioning. 
In order to better understand this partitioning, consider the following example. A two 
dimensional array which is 8 x 256 is to be partitioned and distributed over 20 PEs. Por 
the iPSC/2 and the simulations herein, the best page siz.e is 32 elements or approximately 2 
kilobytes. Previous studies have shown that this is nota critica! para.meter [BNR89b]. 
FolloVlling the simple array partitioning algorithm. each array is divided into pages of 32 
elements in row-rnajor fashion. 
Once the array is cut into pages (linearly, in row-major), the pages are grouped together 
sequentially to form superpages; one superpage per PE, see Figure 2.16 below. The 
algorithm for achieving this is as follows: 
1. calculare the number of pages, 
#pgs = floor(number of elements / page size) 
2. calculate the number of pages per PE, 
#ppp = floor(#pgs / number of PEs) 
3. each PE gets #ppp pages 
4. the extra elements left over from step 1 are assigned to the last 
PE 
5 . the extra pages from step 2 are assigned, one to each PE, 
starting with the second to last PE and continuing to the first PE 
51 
Often a superpage will wrap around the logical array limits. This only needs to be handled 
properly when the array is accessed. It is also the case that somerimes a few PEs will end 
up with one more page in its superpage than the others. Both of these situations are 
handled by the boundary table. The handling of these cases will be explained in detail in 
Chapter 3, PODS Logical Implementation. For the example PE #O through PE #15 have 3 
pages, while PE #16 through PE #19 will have 4 pages. 
pages are 
grouped 
to form 
superpages 
,,,. 
sornetirnes superpages 
will wrap around logical 
array limits ~ 
PEO PE 1 PE 2 
PE2 ..,__PE 3 PE4 
y 
7 
PE 19 
Two dimensional array (8 x 256 with 20 PEs) 
..... ~ 
... v 
-
each page = 
32 elements 
sorne PEs will 
end up with 1 
more page 
FIGURE 2.16. PODS PARTITIONING OF A 2-D ARRAY. 
One key concept of this approach is that it is known globally and requires limited 
infonnation to use. It is the ALLOCA1E instruction which performs this data distribution. 
Each ALLOCA 1E works with a FORKJUMP and performs the following: 
1. The AILOCA1E requests an array ID from the local Array 
Manager (see Chapter 3). 
2. The SP continues executing until the ALLOCA1Es companion 
FORKJUMP (placed directly after the AILOCA1E). The SP will 
either block, until the Array Manager respond.s with an array ID 
or wil1 continue executing if the value has already returned. 
3. When the Array Manger receives the allocate request, it wil1 
allocate the necessary space, build the array header, build the 
boundary table, send the array ID to the requesting SP, and then 
send a remote allocation request onto all of the other PEs with 
52 
the arra y ID attached. In this way all of the PEs ha ve the same 
ID for the same array. The PE which executes the AL.LOCA TE is 
called the host PE, this PE number is also sent as part of the 
request. 
4. The remete PEs will receive the remote allocate request and build 
the header and tables, and allocate the appropriate space. 
For a two dimensional array PODS stores the following array header information in each 
PE: 
Field Name 
beginning_oTf set 
ending_ off set 
number_of_dimensions 
siz.e_diml 
siz.e_dim2 
ELEMENT_SPACE 
beginning rangel diml 
ending rañgel_diml 
beginníng_rangel_ dim2 
ending rangel dim2 
beginníng_rangel_diml 
ending_ rangel _ diml 
beginning_rangel_dim2 
ending_rangel_dim2 
NULL 
Descri _p_tion 
stan oTiliis PEs responsibility 
end of responsibility 
2 
size of first dirnension 
size of second dimension 
space allocated for this array on this PE 
start of first range interval in dim 1 
end of first range interval in dim 1 
start of first range interval in dim 2 
start of tirst range interval in dim 2 
start of second range interval in dim 
1 
start of second range interval in dim 
1 
start of second range interval in dim 
2 
start of second range intervaJ in dim 
2 
TABLE 2.1. PODS ARRA Y HEADER INFORMA TION. 
53 
The beginning_offset and ending_ offset are the staning and stopping points of this PEs 
area-of-responsibility expressed in the row-major linearized version of the array. The 
number _ of _ dimensions, size _ diml, and size _ dim2 fields hold the number of dirnensions 
and sizes of each for this array. The ELE.rvffiNT_SPACE is where the actuiU data is stored, 
54 
excluding the cache. The beginning_rangeX_dimY and ending_rangeX_dimY fields hold 
the starting and stopping points far each range inte:rval of this array. Superpages can wrap 
around an array climension, like PE #2 in Figure 2.16 above, this causes multiple range 
intervals in the boundary table. The bolded fields malee up the boundary table for this array 
on a given PE. Boundary Tables will be discussed. in detail in the section on range filters. 
The header is similar for other climension arrays. For example, for a iliree dimensional 
arra y the number _ of_ dimensions would be 3, there would be an extra climension size field, 
size _ dim.3, and there would be an additional beginning_range and encling_range for each 
segment Notice that the header size is fixed. at allocation time and will not grow. 
Continuing with the two dimensional array example in Figure 2.16, the header for PE #2 
would be: 
Field Name 
beginning_offset 
ending_offset 
number_ of_dimensions 
siz.e_climl 
siz.e_dim2 
ELEMENT_SPACE 
beginning_range l_diml 
ending_rangel_climl 
beginning_range l_dim2 
ending_rangel_dim2 
beginning_range l_diml 
ending_rangel_climl 
beginning_range l_dim2 
ending_range l_dim2 
NULL 
varue 
1-g-2 
287 
2 
8 
256 
space allocated for this array on this PE 
o 
o 
192 
255 
o 
o 
o 
31 
TABLE 2.2. 2-D ARRA Y EXAMPLE HEADER. 
To perform a two dimensional read the off set into the array must be calculated. first. Then 
the beginning and encling off sets must be checked. If the offset is not within the bounds 
then the read is remote and a message must be sent to the owning PE. If the read is local, 
the presence bit must be checked. If it is not present then the read must be enqueued, as in 
55 
1-structures. If the value is present then the memory location is read. The pseudo-code for 
performing the read is: 
offset = size dim2 * i + j 
if (offset < beginning offset) goto REMOTE READ 
if (offset ~ ending offset) gato REMOTE READ 
if (element not present) gota ENQUEUE READ 
value = array[offset] -
FIGURE 2.17. 2-D ARRA Y READ PSEUDO-CODE. 
Continuing with the above example, assume the expression below is being executed on PE 
#2. 
result = A[0,10] + A[l,10]; 
Assuming both elements have already been written, the first array read, A[O,l], would 
perform the following read calculations. 
offset = size dim2 * i + j 
= 256 * o + 10 
= 10 
if (offset < beginning offset) goto REMOTE READ 
10 < 192 -
goto REMOTE READ 
FIGURE 2.18. EXAMPLE 2-D ARRAY REMOTE READ. 
The REMOTE_READ sends a message to the owning PE (PE #1), who will respond with 
A[0,10]. PE #2 will continue on and encounter the second array read, A[l,10]. Note that 
PE #2 did not block this SP when the read was determined to be remate. Only when the 
insttuction which consumes the result is reached will the SP block. By that time A[0,10] 
may have been received. The second array read calculations would be: 
56 
offset = size dim2 * i + j 
256 * 1 + 10 
266 
if (offset < beginning offset) gato REMOTE_READ 
266 < 192-
if (offset ~ ending offset) goto REMOTE READ 
266 :2: 287 -
if (element not present) goto ENQUEUE_READ 
present 
value = array[offset] 
A[266] 
FIGURE 2.19. EXAMPLE 2-D ARRAY LOCAL READ. 
The value of array A@ offset 266 would be stored in the consuming instruction. When 
the consuming instruction was reached the SP would block if A[O, 10] hand not yet arrived, 
and PE #2 would stan executing the next SP from the task ready list. 
Array caching complicates this somewhat, but, it is independent of the PODS partitioning 
and distribution. In Chapter 3 array caching is examined. On a typical RISC processor 
(MIPS R3000) the caching version would take 22 cycles while a regular two dimensional 
read would take 17 cycles. This 29% additional overhead is well worth it 
Note that it is not necessary to distribute ali arrays. In the future more analysis may·show 
that certain arrays should be kept local and other distributed, this is an area of current 
research. 
2. 5. Distributing Processes 
Distributing code (i.e., processes) is the key issue in parallel processing. In PODS this is 
accomplished by following an execution distribution principie which tries to map the 
calculation of an array element to its owner as much as possible (i.e., Collector Writes 
Principie). The PODS implementation of the Collector Writes Principie is called Data 
Distributed Execution (DDE). 
57 
2. S. l. DataªDistributed Execution Principie 
The central concept in PODS code distribution is to follow the data distribution as much as 
possible. Placing the execution of an operation on the same PE as the location of its data 
will reduce communication costs and context switches. A system performs DDE when it 
rnoves execution to the PE where the data resides. 
Consider an n-dimensional index space, where the dirnensions are ordered by the levels of 
nesting. Say this multiple nested loop has index levels i¡, i1, ... , in. and that there is an 
array write at the inner-most leve! (A[i1, i2, .. ., in.]= x). The goal is to distribute the 
computations evenly across the PEs using DDE. This is achieved by picking one of the 
levels of the nest, say ia, and cutting up the index space along ia into number_of_PE 
ranges. The levels previous to ia are executed on one PE, while levels after ia are executed 
on every PE. Since the array write needs the value of every index, ali of the previous 
indices (ii. ii, ... , ia-1) must be broadcast to every PE, and, every following index (ia+l• 
ia+i, ... , in) must be generated locally- it is the ia level which is used to partition the 
iteration space. However, the data distribution is still followed. 
To better visualize this considera 2-d.imensional iteration space with indices i andj. Figure 
2.20 (a) shows the data partitioning of an array where the superpage assigned to each PE 
does not reach the end of the array dimension. Figure 2.20 (d) shows the data partitioning 
of a larger array, where the superpage is larger than the dimension. When the superpage 
just happens to match the array dimension size the partitioning acts just like it were smaller 
than the array dimension. Figures 2.20 (b), (e), (e) and (f) show the iteration space 
partitioning when i or j are used for ia. 
Data 
Partitioning 
(a) 
(d) 
pe 1 al pe 2 f! 
Distributing 
in i 
(b) 
(e) 
Distributing 
inj 
(c) 
(t) 
j .... 
pe3 o pe4 a pes m 
FIGURE 2.20. PARTITIONING A 2D ITERATION SPACE. 
In order to ensure single assignrnent, the iteration space cannot exactly follow the data 
partitioning in every case. When any level other than the last level is used to partiti.on on, 
the remaining levels cannot be partitioned and rnust be assigned based upon the upper 
58 
levels. Figure 2.20 (b) and (e) show on which PE the calculations will be performed if the 
iterations were partitioned along i. This assignrnent is achieved by sirnply assigning 
iteration space areas based upon the first elernent in each row. This causes sorne interesting 
situations. In case (b) PE #3 has no iterations to run. While in case (e) PE #1 has two full 
rows to calculate. Notice that there can be sorne rernote writes, e.g. PE #1 writes to sorne 
of PE #2's elernents. 
When that last leve! is used to partition on the mapping is exact. This is because ali i¡, i2, 
... , in are available and each PE can cornpletely decide which iterati.ons to perform. 
59 
To generalize this to rnultiple dirnensions consider the figure below. In general, the data 
partitioning, case (a) below, will not exactly match any dirnension size. When a level is 
picked to distribute, all levels below it will use this level's partitioning. Case (b) shows the 
planes of iteration space responsibility when the i-th level is used. Case (e) shows the 
iteration spaces if the j-th level were used to distribute the iterations. If the k-th level were 
used the iteration space partitioning would exactly match the data partitioning, case (a). 
~i 
PE #1 
PE#2 
PE#4 
~i 
J ... 
PE #2 
~i 
j 
... 
l]PE#l 
O PE #2 (clear) 
[) PE#3 
fijPE#4 
] 
... 
FIGURE 2.21. PARTITIONING A 3D ITERATION SPACE. 
This would seem to indicare that the lower the level the better the partitioning. However, 
the upper levels must communicate their values ali the way down to the inner-most level. 
This causes ex.cessive communication. While distributing at the outer-most level can cause 
60 
miss matches, this can be overcome vía array caching. Seeing that the sooner the iterations 
are distributed the fewer the number of broadcasts necessary, PODS always distributes as 
soon as possible. 
These give rise to the distribution scheme below. 
1 . Given an array A: 
partition and distribute as described in Section 2.4, Array 
Partitioning and Distribution, above. 
2. Given a loop L: 
if L does not contain an arra y write, then do not distribute 
else distribute the outer-most level of the nest possible. 
3 . Once the level has been chosen, use the first element in that level 
to determine the iteration space partitioning. 
The reason a certain level of nesting cannot be distributed is dependent on the loop-carried 
dependencies at that level. This is explained in detail in LCD Effects Section below. 
DDE can be greatly increased by array caching. In PODS, once a page is read into .local 
memory from a remote PE it is held in a software cache which is replaced using a Least-
Recently-U sed algorithm. Array caching is explained in detail in Chapter 3. 
DDE of for-loops is achieved in PODS by generating only those loop variables which mak:e 
the arra y accesses local. This is performed by range filters. The operat:ion of range filters 
is explained in detail in the next section. 
61 
2. 5. 2. Range Filters 
In this section, the concept of range filters is explained in detail, and explains how each PE 
restricts loop execution to its own portien of an array. 
Objective and Usa~e 
The objective of the range filter construct is to control which iterations of a d.istributed loop 
are to be executed by a given PE. The d.iagram below shows a simplified dataflow of the 
simple array filling loop in the upper right hand comer. Contrast this with the d.iagram in 
Figure 2.23; the same loop after the range filter has been inserted. In PODS the loop nest 
level in which the range filter is inserted is defined to be the distributed loop. 
A dataflow d.iagram with a 2-d.imensional range filter is shown in Figure 2.23. The items 
added to Figure 2.22 are bolded. The range filter replaces the pred.icate and needs the array 
A and the outer index i from the i-loop to determine whatj's a given PE is responsible for. 
The range filter takes these and the current indexj, and produces the next index for which 
this PE is responsible. Also notice that the L operators in the i-loop are now DIST-L 
operators. 
1 so 1 10 A = m.atrix(50, 10); 
for i = 1 to 50 
for j = 1 to 
10 A{i,j] = f(i,j); 
,. ................................................................................................................................................................................................... . 
1 i ILOOP sP¡ 
r-······························· .................................................................................................... ·········;·~¿·;~--~·~·¡ 
A 
~•--••••••••••••••••••••••••••••••ouo .. 000 .. 0 .. 000 .. noo .. o .. ooouo .... o,ooooooooooooooo•••••••••••••••*'•''''"'''''''''''''''''''''''''"''''''''''"'°"'°"'"'''''"''"''''uooooo •••••••••••••••••: 
62 
retUrD. to 
ou.ter scope 
FIGURE 2.22. SIMPLE 2-D ARRAY FILL. 
1 50 1 10 
-----------------· b o 1.1.Il.dMv : 
table mio : 
A = m.atrú<(50, 10); 
Cor i = 1 to 50 
Cor j = 1 to 
10 A(i,j] = f(i,j); 
:························································ .. ·························································l ................................................................................ .. ¡ ' i !LOOP SP! 
: : 
• 1 ,··;. 
•ooooooooooooooooooooo•oooooos.oooo•ooooooo•••••••••TOoOoOOHa<ooooooo•o•oUoouo oooooooooooooHooooooooooooooooo-:: 
A 
: ........................................................................................................................................................... . 
. . 
.................................................................................................................... J ............................................................................... , 
63 
tltur:n to 
ov.ter :>cope 
FIGURE 2.23. 2-D ARRAY FILL WITH RANGE FILTER. 
64 
Boundazy Table 
Boundary tables are generated at allocation time and referenced by the range filter to 
detennine the boundaries of its area-of-responsibility. In PODS, grouped ranges are used 
because they generate fewer superpage boundaries than interleaved ranges in general. 
In the table below, an array header for PE #1 with a 8 x 4 array (page size of 6) is shown. 
The values beginning_rangeX_diml and beginning_rangeX_dim2 are the beginning values 
for a given range interval in each of the two dimensions; similarly for ending_rangeX _ diml 
and ending_rangeX_dim2. A range interval is the area-of-responsibility for a given PE 
andina given dimension; there is one range interval for each entry in a boundary table. 
For example, range interval 1 runs from 1 to 1 in the i direction, and from 1 to 4 in the j 
d.irection. 
F1eld Name 
beginning_off set 
ending_offset 
number_of_dimensions 
size_d.iml 
size_dim2 
ELEMENT_SPACE 
beginning rangel diml 
ending rañgel_diml 
beginníng rangel dim2 
ending rañgel dim2 
beginníng rangel diml 
ending rañgel diml 
beginníng rangel dim2 
ending_ rañgel _ dim2 
NULL 
Valúe 
1 
6 
2 
8 
4 
space allocated for this array on this PE 
1 
1 
1 
4 
2 
2 
1 
2 
TABLE 2.3. EXAMPLE BOUNDARY TABLE FOR A GIVEN PE. 
The boundary table fields are bolded. For different numbers of PEs (four in this example) 
different distributions are produced. The page size comes into play because pages are used 
in caching and remote accesses. In this example the page siz.e of 6 splits the array into a 
non-rectangular arca for PE #l. 
65 
Master ArraY 
In Figures 2.22 and 2.23 above only one array is being written into inside the loop. 
However there can be more than one. In PODS, only one array, the master array, controls 
the partitioning for that loop. Currently the first array written into is chosen as the master 
array. Later on a more intelligent algorithm could be used, but this approach has produced 
acceptable results. 
Al~orithm 
The algorithm for the range filter is fairly straight forward. It is important to note, 
however, that the general algorithm is parameterized. The general algorithm functions by 
repeatedly extracting range intervals from the array boundary table. While within the 
range, the filter passes indices for elements within that range. The filter also keeps the loop 
ali ve by sending a continue token to the loop switch until ali ranges have been exhausted. 
In the figure below, mis just sorne variable used to count the intervals; i andj are the loops 
indices, and continue is the signa! to the loop body telling it whether to continue or not 
There are three new PODS instructions required to implement a RANGE_FIL 1ER: 
INTERV AL_COUNT (retrieve the number of range intervals for this arra y); and B_RANGE 
/E_RANGE (retrieve the beginning and ending values for the specified range interval). 
These new instructions simply read entries from the array header (generated at allocation 
time). With RANGE_FIL1ER, each PE has the same ccxle; only the local boundary tables are 
different. 
1 m = interval count of master array 
2 if m < O then exit 
3 if (Ci*i+ki) is not in interval m then decrement m and 
gato 2 
4 set j to the minimum of the loop end and (end of the 
interval-kj)/Cj 
66 
5 if (Cj*j+kj) is not in the interval or the first element 
of this dirnension is not owned then decrement rn and gato 
2 
6 if J is within the loop bounds then set continue to TRUE 
and send j and continue into the loop body 
else decrernent rn and gato 2 
7 if continue is TRUE do the loop body else gato 10 
8 true part of loop body 
9 if new j is within loop bounds set continue to TRUE, 
send j-and continue into the loop body, and gato 5 
else set continue to FALSE, send j and continue into the 
loop body, and gato 7 (with j set to new j) 
10 false part of loop body -
FIGURE 2.24. ALGORITiiM FOR SECOND LEVEL, DESCENDING RANGE FILTER FOR 
A(C1*I+K1,C¡*J+K¡]. 
The algorithm shown in Figure 2.24 is for a descending loop with a stepsize of 1 writing 
into array A(c¡*i+k¡,cj'"j+kj]. Array A is the master array in this case. Step #7 above is the 
SWITCH which between the true and false parts of the loop body. 
For different levels of distribution (distribute the first level of nested loop vs. other levels) 
or directions (ascending vs. descending), different range filters are used, see Appendix A: 
Range Filter Algorithms. The selection of the algorithm is done at compile time, so no 
more run-time overhead is used than necessary. 
In the case where the distribution is done a level above the lowest level, the RANGE_FIL TER 
checks only the first element in a range interval to see if that element belongs to it. This 
prevents other PEs with range intervals in the same index (e.g., PEs #1 & #2 for i = 2 
below) from both trying to execute a particular iteration. The figure below shows the 
partitioning for a 8 x 4 array, page size of 6 (same as the boundary table example above). 
1 
2 
3 
4 
5 
6 
7 
8 
1 
j 
2 3 4 
PE 1 
j PE 2 
PE3 
PE4 
ARRAY A 
FIGURE 2.25. NON-RECTANGULAR ARRAY PARTITIONING EXAMPLE. 
If the RANGE_FILTER were at the i level, then each PE would be responsible for distinct 
rows of i, i.e. PE #1 has rows 1and2, PE #2 has only row 3, PE #3 has rows 4, 5, and 
6, and finally PE #4 has rows 6 and 7. 
. 2.5.3. LCD Effects 
LCDs have a major affect on the policies for code distribution. This section discusses 
those effects. 
If a for-loop performs a reduction it will have LCDs. If the for-loop fills an array it may 
have loop-carried dependencies. These LCDs prevent iterations from running in parallel. 
In PODS these LCD for-loops are executed in place justas they would on a sequential 
processor. This is the case where PODS degenerates into a sequential machine for the 
sequential code. The reason for this is the extreme cost of communication on distributed 
memory machines. 
67 
68 
For distributed memory MIMD machines the ratio of the cornmunication time to execution 
time can be as great as 400, as in the iPSC/2 [iPSC89]. This means that the LCD distance, 
D, times the number of overlapping instructions, N, rnust be at least 400. i.e. 
D * N :=:: cornmunication time / execution time. 
A distance 4 LCD means that iteration i must wait fer iteration i-4. In order to see this 
better, considera loop body with 100 overlapping instructions. If D is less than 4 then it 
is better to execute the fer-loop on one PE rather than distribute the loop. If it were to be 
distributed, the fer-loop iterations would be grouped and assigned to PEs vía DDE. For 
example, iterations 1 through 4 to PE #1, 5 through 8 to PE #2, etc. For-loops with larger 
LCD distances or larger instruction overlap may be able to perform better when distributed, 
this is a current topic of research. 
In order to see how communication delay and overlapping execution interact, consider the 
Figure 2.26. In the first cast (Non-Distributed) the loop is executed on one PE, causing no 
communication delay. The second case (Distributed with Fast Communication) performs 
the best Its completion time (indicated by the dark horizontal lines) is the earliest of the 
three. Notice how the amount overlapping instructions must be comparable to the delay 
time fer any benet"it to occur. The third and final case (Distributed with Slow 
Cornmunication) shows what would happen if loops with LCDs were distributed when 
communication is costly, e.g. the iPSC/2. 
Distributed with 
Fast Communication 
PE #1 PE #2PE #3PE #4 PE # 1 PE #2 PE #3 PE #4 PE # 1 PE #2 PE #3 PE #4 
g sequential 
execution 
• overlapping E;;,'] communication 
exectuion dela y 
_.network 
message 
FIGURE 2.26. EFFECTS OF COMMUNICATION SPEED ON ÜVERLAPPING 11ERA TIONS. 
69 
Considering the abstra.ct case shown in Figure 2.20, given LCDs in all i¡ through ik, PODS 
distributes ik+l· To understand why this is better, cohsider the three different shapes of 
arrays: rectangular, long and narrow, and short and wide. When the array is rectangular, 
like in Figure 2.20 (e) and (t), PEj and PEj+l can pipeline. This overlapping execution will 
increase as the work at each element increases. lf there is very little work at each element 
then it is run sequentially. Note that this usually only occurs when ik+l is the innermost 
level of the nested loop. When the array is long and narrow, the same rules apply except 
even more work is needed at each element for distribution to show a gain. Finally, if the 
array is short and wide, like Figure 2.20 (b) and (c), multiple wavefronts occur thus 
providing sorne parallelism. 
So, for the scope of this discussion, the communication costs of distributed memory 
architectures is too high for for-loops with LCDs to be distributed. The communication 
cost overwhelms any efficiency gains from the overlapping iterations. Thus PODS does 
not distribute loops with LCDs. One outcome from this is that all distributed loops are 
array filling loops. 
70 
Based on the above, PODS needs a compiler which will reduce the number of LCDs so that 
the loops can be distributed. Scalar expansion is one optimization which does this. Any 
state-of-the-art vectorizing cornpiler will have (see Padua and Wolfe [P&W86]) scalar 
expansion. Since the ID Nouveau cornpiler is not intended for a rnachine which is aided by 
scalar expansion (GITA), it does not scalar expand. PODS, on the other hand, is aided by 
scalar expansion and would have this and other optimizations (e.g., loop interchange, and 
loop fission). 
PODS also needs a LCD Detector, which will detect when LCDs occur. The LCD Detector 
performs two rnajor phases. The first phase finds the loop bodies and the second traces 
these looking for array writes (I-structure STOREs) which use values from the same array 
(I-structure FETCHs). The first phase performs the following steps: 
1. Find all D operators. 
2. Search backward frorn each D until the same D or another D is 
found. Do not search beyond the SWITCH operator. 
3. If found D is the same, then the path search forms the loop 
variable path, else it forms the loop body path. 
The loop bod.ies are now traced using the following: 
1. Find ali I-structure STORE operators. 
2. Trace up the data dependency ares frorn each to find ali 
I-structure FETCH operators which feed this I-structure 
STO RE. 
3 . Trace up the data dependency ares from each index input to find 
the source the index value. 
4. If any I-structure FETCH uses the same array as the I-structure 
STORE and their index input paths d.iffer from each other, then 
there is a LCD. 
2. 5. 4. Remote Array Accesses 
71 
Remate l\ITilY accesses will occur dueto the d.istribution of arra y data. No new reads nor 
writes are added to the original ID program. Remate reads and writes are d.iscussed in this 
section. Array caching affects remate reads, but this is not part of the model and therefore 
d.iscussed in Chapter 3, PODS Logical Architecture. Also in Chapter 3 is a d.iscussion of 
the Array Manager which actualiy performs these operations. 
Remete Reads 
As a simple example of remote array reading, considera multiprocessor with 4 PEs. Using 
a page size of 32 elements, and 3 arrays A, B, and C, each of size 100. PE O, 1, and 2 will 
each contain a single page of each arra y. PE 3 will contain a partial page (4 elements) of 
each array. Consider the following loop: 
For i <- 1 To 100 Do 
{ 
A[i] 8(101-i] + C(i] 
FIGURE 2.27. REMOTE READ CODE EXAMPLE. 
All four processors begin executing simultaneously-PE O fills A(l..32), PE 1 fills 
72 
A(33 .. 64), PE 2 fills A(65 .. 96), and PE 3 fills A(97 .. 100). Note that for most of the loop, 
each processor must access elements of array B that lie on a different processor than the 
executing processor. Each one of these remote accesses entails a transfer of data from the 
producing PE to the consuming PE, an operation that is relatively expensive on al1 current 
distributed memory multiprocessors. It will never be possible to remove the need for 
remote accesses from distributed computations, so PODS must instead use a technique for 
diminishing their effect on the overa.ll computation time. The technique PODS uses is 
called remate access caching. 
Remote access caching takes advantage of the fact that in PODS, no array element may be 
written to multiple times. As a result of this,PEs can cache data that has been recently 
accessed without considering cache coherency problems. In the partitioning scheme 
defined above, each PE contains sorne number of pages of each array. To accomplish 
remote access caching, PODS defines a remote access as the retrieval and local storing of 
the entire page containing the remote da~m. That is, when a particular element is fetched 
from a remote PE, the entire page containing that element is sent back to the requesting PE 
when the requested element becomes available. Due to locality of reference in many 
algorithms, it is likely that the same PE will need another element from that page in the near 
future, so if the cache is checked first a remote access will often be avoided. Of course, if 
the next requested element was not available at the time the page was cached, then another 
remote access, transferring the same page, will be required. Note that the term "cache" 
used here does not refer to a specialized hardware device, used to reduce access time to 
rnain memory. Rather, it is a "software cache" used to reduce access time to remote 
memory rncxiules. 
Remote Writes 
73 
In the previous discussion, it was mentioned that occasionally the program structure makes 
remote writes unavoidable. A remote write is an array write where the data collector is on a 
different PE than data storage. When this occurs a message rnust be sent to the remote PE 
with the value, the array ID, and the indices. As an example of why this might happen, 
examine the following code segment: 
Far i <- 1 To 100 Do 
{ 
A[i] = B[i] 
C [i+lO] = B [i+SJ * 2 
FIGURE 2.28. REMOTE WRITE CODE EXAMPLE. 
To see why the PODS data parti.tioning methcxi causes remote writes in this case, consider 
that a write to C may occur at a location not owned by the PE executing the loop. For 
example, suppose i is 25. PE #O is responsible for writing A(25), however PE #1 is 
responsible f or writing C(35). Without loop fission, it is necessary either for PE #O to 
remotely write to C(35) or for PE #1 to remotely write to A(25). This is nota single 
assignment violation, but it is inefficient. In this case loop fission could sol ve the problem, 
however, in general, there is no simple solution. To avoid using different mapping 
functions for A and C, PODS allows remote writes instead. 
Remote writes are also necessary f or another reason. In ideal circumstances ali data is 
written to locally, but program structure can sometimes cause remote array writes. Note 
74 
that it is not always possible to determine, at compile time, which elernent is being updated 
by an assignment statement Consider the loop below: 
For k <- 1 To n Do 
{ 
A[f(k)] == B[g(k)] 
FIGURE 2.29. lMPOSSIBLE COLLECTOR WRITES. 
The functions/and g rnak:e it impossible to determine which elernent a given k will be 
assigning ar compile time. In this case each PE rnust calculate/(k) for ali /¿s to determine 
if that element of A is inside its area-of-responsibility. It should also be noted that arrays 
are single assignrnent and that thef(k) must be well behaved (one-to-one) over the range of 
k, otherwise a single assignment run-time error will occur. 
2. 5. 5. For-Loop Distribution Algoritbm 
Now that DDE has been introduced, the effects of LCDs have been discussed, and the 
rnechanisms for distribution have been explained, the actual for-loop distribution algorithrn 
can be presented. 
There are three primary rnechanisms for achieving distribution. The data distribution 
mechanisrn (ALLOCA1E operator) has already been discussed. For PODS to distribute SPs 
they need to be spawned on multiple PEs. It is the DIST-L operator which performs this. 
When PODS determines that a certain level of a nested loop is to be distributed, its parent 
SP gets DIST-L operators, and it gets the third primary rnechanism: the RANGE_Fil.. TER. 
At compile time the program is analyzed to determine which for-loops will be distributed. 
Those for-loops which are distributed will be augmented with RANGE_Fil..TER code. The 
task of the RANGE_FIL TER is to produce only those loop variables which make the arra y 
·accesses local. At load time, each PE will be given a copy of the enrire program (ali PEs 
are homogeneous). At run-rime tokens are sene to different PEs to start execution of a 
particular for-loop SP. 
75 
Since arrays are partitioned row-major the code will be row-major as well. If it is not 
row-rnajor it will run, but inefficiencies will occur. In order to efficiently execute a 
row-major nested loop the outer-most level of the nest should be distributed. This reduces 
communication cost and context switching, and allows the array caching to operate 
efficiently. Given these observations and the previous principles, the algorithm for-loop 
distribution determination is as presented below. 
Algorithm: Loop Distribution 
1. Starting with the outer-most cede-block, repeat the following 
until ali sets of nested loops are marked (depth-first traversa!) as 
either distributed or local . 
a. Consider the next inner code-block. If this code-block does 
not have an LCDs, then mark it and all descendent SPs will be 
local. 
b. If this inner SPs has a LCD, then goto step 2. 
c. If this is the inner most SPs , then consider the next 
unmarked SPs (depth-first) and goto step 2. 
2. In each marked SP a range fil ter replaces the predicare. 
3. In the parent of each marked SP the L operators are changed into 
DIST_L operators. 
76 
The outer-most SPs of an entire program cannot be d.istributed. This is because every 
program must start somewhere; i.e., there is a single first instruction in every program. If 
it is desirable, due to LCDs, to d.istribute this outer-most SP, then a dummy SP is set up to 
drive the original SP. 
2.5.6. Examples 
LCD Examples 
In a two leve! nested loop there are four basic cases which involve LCDs: (1) no LCDs in 
either i nor j; (2) LCD in i; (3) LCD in}; and (4) LCDs in both. PODS handles each of 
these cases efficiently. 
In the following examples the same array and filling loop will be used, however the filling 
function (FUNC) will be changed to add or subtract LCDs as necessary. Considera simple 
nested loop which fills an 8 x 4 array A. 
For i <- 1 To 8 Do 
For j <- 1 To 4 Do 
{ 
A[i,j] = FUNC(x) 
FIGURE 2.30. SIMPLE ARRA Y FILLING EXAMPLE CODE. 
For the above code there would be two SPs, one for the i loop and one for the j loop. 
Since there are no LCDs, either level can be d.istributed. Assume the array is partitioned as 
shown in Figure 2.31 below, and assume that the communication delay is a short five time 
units, and that a context switch is one time unit 
1 
2 
3 
4 
5 
6 
7 
8 
1 
j 
2 3 4 
PE 1 
j PE2 
PE 3 
PE4 
ARRAY A 
FIGURE 2.31. SIMPLE ROW-MAJOR ARRAY PARTITIONING. 
Case 1: No LCDs 
77 
FUNC has no LCDs, e.g. A[ij] = B[ij]. If i were distributed the execution would be as 
shown in Table 2.4. The pai.rs of numbers in the table show when A[i,j] is being written; 
this only occurs in thej SP. The operations in italics are for the i SP. This assume PE 1 
starts out generating the i's needed and then broadcasts them to all of the PEs (including 
itself). When a PE gets an i value it starts the j SP. There are times when there is nothing 
in this PEs area-of-responsibility; thus the for i=x : 0. The initial be comes from the 
parent SP which contains the DIST_L operators, this is how ali of the initial parameters get 
broadcast Note that i is not broadcast in this case. 
78 
time PE 1 PEZ PE 3 PE4 
o 1mt1al be 
1 context sw commdelay commdelay commdelay 
2 geni=l commdelay commdelay commdelay 
3 gen i=2 commdelay commdelay commdelay 
4 context sw geni=3 gen i=4 gen i=7 
5 1, 1 context sw gen i=5 gen i=8 
6 1, 2 3, 1 gen i=6 context sw 
7 1, 3 3, 2 context sw 7, 1 
8 1, 4 3, 3 4, 1 7, 2 
9 context sw 3, 4 4,2 7, 3 
10 2, 1 4, 3 7, 4 
11 2, 2 4, 4 context sw 
12 2, 3 context sw 8, l 
13 2, 4 5, 1 8, 2 
14 5, 2 8, 3 
15 5, 3 8, 4 
16 5, 4 
17 context sw 
18 6, l 
19 6, 2 
20 6, 3 
21 6,4 
TABLE 2.4. EFFECTS OF ÜlITER LOOP DISTRIBUTION WITH NO LCDS. 
Notice that PE #1 will tak:e over elements 2,3 and 2,4 as the iteration space partitioning 
extends the area-of-responsibility based upon the first element. The communication delays 
will overlap with the operations and context switches so that the multiple i-loop 
communication delays do not delay the execution multiple times. Now if j were distributed 
the execution would be as follows in Table 2.5. The parent sp does not have DIST_L 
operators like above, it has regular L operators (which do not broadcast). Here the i's are 
broadcast from the i SP (in italics). 
79 
time PE 1 PE 2 PE:J PE4 
o parent sp 
1 context sw 
2 gen i=l 
3 gen i=2 commdelay commdelay commdelay 
4 geni=3 commdelay commdelay commdelay 
5 gen i=4 cornmde1: commde1: commde1: 
6 gen i=5 for i=l: for i=l: for i=l: 
7 gen i=6 context sw context sw context sw 
8 gen i=7 2,3 for i=2: 0 for i=2: 0 
9 gen i=8 2,4 context sw context sw 
10 1, 1 context sw for i=3: 0 for i=3: 0 
11 1, 2 3, 1 context sw context sw 
12 1, 3 3, 2 4, 1 fori=4: 0 
13 1, 4 3, 3 4,2 context sw 
14 context sw 3, 4 4, 3 fori=5: 0 
15 2, 1 context sw 4,4 context sw 
16 2, 2 for i=4: 0 context sw for i=6: 0 
17 context sw context sw 5, 1 context sw 
18 for i=3: 0 for i=5: 0 5, 2 7, 1 
19 context sw context sw 5, 3 7, 2 
20 for i=4: 0 for i=6: 0 5, 4 7, 3 
21 context sw context sw context sw 7,4 
22 for i=5: 0 for i=7: 0 6, 1 context sw 
23 context sw context sw 6, 2 8, 1 
24 for i=6: 0 for i=8: 0 6, 3 8, 2 
25 context sw context sw 6,4 8, 3 
26 for i=7: 0 context sw 8,4 
27 context sw for i=7: 0 
28 for i=8: 0 context sw 
29 for i=8: 0 
TABLE 2.5. EFFECTS OF lNNER LOOP DISTRIBUTION WIIB NO LCDS. 
Note that every PE is doing something, thus distributing additional levels of the nest would 
do nothing to speed up execution. In this case the j loop distribution must wait for each j to 
be generated. After the initial communication delays each PEs will start checking the i 
values they receive. If i is in the range (as determined by the RANGE_FILTER) thenj values 
will be generated, if not, the SP completes. This exarnple graphically shows that outer 
level distribution is better than inner level (execution time of 21 vs. 29), as described in 
Section 2.6.1 above. 
80 
Case 2: LCD in i 
FUNC uses i in such a way that there is a LCD, e.g. A[i,j] = A[i-1, j]. In this case PODS 
would not allow i to be d.istributed, and the RANGE_FU. TER would go in the jth level (i.e. 
be distributed). As in Case 1 when j was distributed, the iterations must wait for i to be 
generated. Since the LCD is in i the loop would execute as shown in Table 2.6 (execution 
time of 45). Note that Table 2.6 obeys the LCD restriction on i. The block' sin Table 2.6 
mean that the necessary array elernents have not yet been written. 
81 
time PE 1 PE 2 PEj PE4 
ro parent sp 
1 context sw 
2 geni=l 
3 geni=2 commdelay commdelay commdelay 
4 geni=3 commdelay commdelay commdelay 
5 geni=4 commdelt?' commdelt?' commdelt?' 
6 geni=5 for i=l: for i=l: for i=l: 
7 gen i=6 context sw context sw context sw 
8 gen i=7 block fori=2: 0 for i=2: 0 
9 geni=8 block context sw context sw 
10 1, 1 block for i=3: 0 for i=3: 0 
11 1, 2 block context sw context sw 
12 1, 3 block . block for i=4: 0 
13 1, 4 commdelay block context sw 
14 context sw commdelay block fori=5: 0 
15 2, 1 commdelay block context sw 
16 2,2 2,3 block fori=6: 0 
17 context sw 2,4 block context sw 
18 for i=4: 0 context sw block block 
19 context sw 3, 1 block block 
20 fori=5: 0 3, 2 commdelay block 
21 context sw 3, 3 commdelay block 
22 for i=6: 0 3, 4 commdelay block 
23 context sw context sw 4, 1 block 
24 fori=7: 0 fori=4: 0 4,2 block 
25 context sw context sw 4, 3 block 
26 fori=8: 0 fori=5: 0 4,4 block 
27 context sw context sw block 
28 fori=6: 0 5, 1 block 
29 context sw 5,2 block 
30 fori=7: 0 5, 3 block 
31 context sw 5,4 block 
32 for i=8: 0 context sw block 
33 6, 1 block 
34 6, 2 commdelay 
35 6, 3 commdelay 
36 6, 4 commdelay 
37 context sw 7, 1 
38 fori=7: 0 7,2 
39 context sw 7, 3 
40 fori=8: 0 7,4 
41 context sw 
42 8, 1 
43 8, 2 
44 8, 3 
45 8,4 
TABLE 2.6. EFFECTS OF INNER LOOP 0IS1RIBUTION WITH LCDS. 
82 
Cornparing this to Table 2.7 below, the execution time is shorter with inner loop 
distribution than with no distribution (sequentially). As the work per element grows or the 
array dlmensions increase, this advantage will grow. 
Case 3: LCD in j 
FUNC uses j in such a way that there is a LCD, e.g. A[i,j] = A[i, j-1]. In this case PODS 
would distribute i, and the RANGE_FIL TER would go in the ith level . As in Case 1 when i 
was clistributed, the iterations will ran in parallel right away. And since the LCD is inj the 
loop would execute exactly like the first part of Case 1 (execution time of 21). Note that 
Table 2.4 obeys the LCD restriction on j. 
Case 4: LCD in i and j 
In this case FUNC would be something like A[ij] = A[i-1, j-1]. Since there are LCDs in 
each level, PODS would not distribute this loop at ali and the execution would be as shown 
below in Table 2.7 (total execution time of 49). The load balance in this case is also very 
poor. 
time PE 1 PE 2 PEJ PE4 
~ parent sp 
1 context sw 
2 geni=l 
3 gen i=2 
4 geni=3 
5 gen i=4 
6 gen i=5 
7 gen i=6 
8 gen i=7 
9 gen i=8 
10 context sw 
11 1, 1 
12 1, 2 
13 1, 3 
14 1, 4 
15 context sw 
16 2, 1 
17 2, 2 
18 2, 3 
19 2,4 
20 context sw 
43 7,3 
44 7,4 
45 context sw 
46 8, 1 
47 8, 2 
48 8, 3 
49 8, 4 
TABLE 2. 7. EFFECTS OF NO DISTRIBUTION DUE TO LCDS. 
It is interesting to note that Cases 2 and 3 are still executed in parallel by PODS even with 
the LCDs. In general Case 2 could generate multiple diagonal wavefronts, while Case 1 
would execute with a horizontal sweep. The diagrams below illustrate the different 
execution pattems. The numbers in each cell are the time each cell would be filled. By 
tracing lines through equal times the wavefront can be seen. 
83 
V~ HHIHH 
Two Símultaneous 
Diagonal Wavefronts One Horizontal Sweep 
FIGURE 2.32. LCD EXECUTION W A VEFRONTS. 
Matrix Multiply 
84 
To better understand the distribution algorithm reconsider the Matrix. Multiply code shown 
previously in Figure 2.15. There are three code-blocks in this function which turn into 
SPs: one for i-loop (MM-O); one for j-loop (MM-1); and one for k-loop (MM-2). This 
function has no LCDs in the i or j loops, only in the k-loop. Using the loop distribution 
determination algorithm above, the outer-most code-block (the i-loop) cannot be distributed 
without setting upa dummy parent. The next innercode-block (thej-loop) has no LCD 
and will thus be distributed. All descendent code-blocks (only the k-loop) wil1 be local. 
The files below are the exact inputs that were used to run Matrix Multiply on the PODS 
Simulator. 
85 
M+-0 
t opcode ·~ arg5 (value, port) dest [i, p] rc.ute (r) {e} 
o ppa.pr o -> (12,0] {B} 
1 ppa.pr o -> (13,0] (7, 0] (2,0] (5, 0] 
(4, 0] (3, O] {A} 
2 UPPER l3CX.lND 2 (0.00,1) -> (6,2] (9, 1] (11,0] 
3 ~ l3CX.lND 2 (l. 00, 1) -> (6, 3] (14,0] 
4 UPPER l3CX.lND 2 (1.00,1) -> (6, 4] (15,0] 
5 ~ l3CX.lND 2 (0.00, 1) -> (6, 1] (16,0] 
6 ALI.CCA1'E 5 (2.00,0) -> (8, O] 
7 ~ l3CX.lND 2 (0.00,1) -> (9,0] (10, 1] 
8 FOOKJtM> 2 (1.00,1) -> (17,0] 
9 LE 2 (STKY,l) -> (10,0] 
10 SWI'!Oi 5 (1.00,2) (11.00,3) (2.00,4) -> (18,0] (19, 0] (21, 0] (I} 
11 DIST I.OPERA'.l'OR 1 (STKY,0) -> (12) 
12 DIST I.OPERA'.l'OR 1 (STKY,0) -> (14) 
13 DIST I.OPERA'.l'OR 1 (STKY,0) -> (15) 
14 DIST I.OPERA'.l'OR 1 (STKY,0) -> (10) 
15 DIST I.OPERA'.l'OR 1 (STKY,0) -> (11) 
16 DIST I.OPERA'.l'OR 1 (STKY,0) -> (13) 
17 DIST I.OPERA'.l'OR 1 (STKY, 0) -> (16) 
18 DIST I.OPERA'.l'OR 1 -> (1) 
19 PWS 2 (l. 00, 1) -> (20, 0] {NEXT-I} 
20 D 2 (-11.00,1) -> (9,0] (10,1] (I} 
21 DnN 1 -> 
In SP MM-O the PROMPT instructions acquire the A and B matrices used in the Matrix 
Multiply. The UPPER_BOUND and LOWER_BOUND instructions access the array 
headers to setup the loop boundaries. ALLOCATE then remotely distributes the C array 
and feeds a FORKJUMP operator. This FORKJUMP is necessary for the array manager 
to have a place to return the array identifier itjust allocated. The LE, SWITCH, PLUS, D, 
and DINV are the standard dataflow operators. The new PODS operator is the 
DIST _LOPERA TOR, which performs the standard L operator dataflow operations, but 
also sends its tokens to ali PEs. This is how i gets distributed. 
In SP MM-1 below, there is the local equivalent of the DIST_LOPERA TOR, the 
LOCAL_LOPERATOR, which sends its tokens only to itself. LOCAL_LOPERATORs are 
only used when the operations have already been distributed, and more distribution would 
just create network overhead without additional parallelism. MM-1 also has a range filter 
inserted into it, from instruction O to 18 and 29 to 30. 
86 
t-M-1 
# opcod! #args args (value, p:irt) de.st. [i, p] route (r) {e) 
o INTERVAL CXXlNT 1 (STKY,0) -> [l, l] 
1 LT 2 (0.00,0) (STKY,l) -> [2,0J 
2 SWITCH 5 (0.00, 1) (1.00,2) (29.00, 3) (3. 00, 4) -> [3, OJ [5, 0] [8, l] 
[31,0] 
3 B AAN3E 3 (STKY, 1) (0.00,2) -> [4, 1) 
4 GE 2 (STKY,0) -> [7, OJ 
5 E RAN3E 3 (STKY,l) (0.00,2) -> [6, O] 
6 GE 2 (STKY,l) -> [7, l] 
7 AND 2 -> [8, OJ 
8 SWITCH 5 (1.00,2) (9.00,3) (3.00, 4) -> (9, O] [10,0J [16, l] [17,0] 
9 E RAN3E 3 (STKY, 1) (1.00,2) -> [11,l] 
10 B RAN3E 3 (STKY,l) (1.00,2) -> [11, O] (12,l] 
11 LE 2 (STKY,l) -> (12, 0) [16, O] 
12 SWITCH 5 (1.00,2) (4.00,3) (3.00, 4) -> [13, l] [15, O] [19, l] 
13 LE 2 (STKY,0) -> [14,0) 
14 SWITCH 5 (STKY,l) (1.00,2) (-3.00,3) (0.00,4)-> (11,0J [12,1) 
15 LE 2 (STKY,l) -> (16,0J (19, OJ 
16 SWITCH 5 (STKY,0) (S'l'KY, 1) (3.00,2) (l. 00, 3) (0.00,4)-> [17,0] 
17 PllJS 2 (1.00,1) -> [18,0J 
18 FORKJlM? 2 (-17.00,1) -> [l, O] [2, 1) 
19 SWITCH 5 (1.00,2) (12 .00,3) (3.00, 4) -> [20,0) (26,0) (27, 3) (31, O] 
{J) 
20 ux::AL LOPERATOR 1 -> (7) 
21 ux::AL LOPERATOR 1 (STKY,0) -> (2) 
22 ux::AL LOPERATOR 1 (STKY, 0) -> (3) 
23 ux::AL LOPERATOR 1 (STKY, 0) -> (4) 
24 ux::AL LOPERATOR 1 (STKY,0) -> (5) 
25 ux::AL LOPERATOR 1 (STKY, 0) -> (6) 
26 PllJS 2 (1.00,1) -> (28, O] {NEXT-J} 
27 WRITE ARRAY 4 (STKY,l) (S'l'KY,2) -> 
28 D 2 (1.00,1) -> [11, O] [12, 1] [29, 1) {J} 
29 GE 2 (STKY, 0) -> (30,0J 
30 SWITCH 5 (0.00,1) (-11.00, 2) (-19.00, 3)-> (19,0J 
31 DINV 1 -> 
SP MM-2 is a simple local loop which performs a reduction-like operation. MM-2's LCD 
causes it to be run on one PE and not distributed. The LOCAL_LINV operator routes the 
sum (S) back to its parent SP which is on the same PE since it is a local operation. This 
route uses route list 9 which is programmed into every Routing Unit. 
M-1-2 
* opcode 
O IE 
1 SWITOi 
{TRIWER} 
2 SWITCl:I 
3 READ ~ 
4 PWS 
5 READ ~ 
6 MJLT 
7 PWS 
8 D 
9 D 
10 DINV 
11 DINV 
12 r..o::AL LINV 
87 
fa.rg:3 a.rg:3 (value, port) dest [i, p] route (r) (e) 
2 (STKY, 1) -> [2,0) [ 1, O] 
5 (1.00,2) (1.00,3) (3.00,4) -> [3,2] [5, l] [ 4, O] [10, O] 
5 (0.00,1) (1.00,2) (8.00,3) (1.00,4)-:> [7,0] [11, OJ {S} 
3 (STKY,0) (STKY,l) -> [6,0J 
2 (1.00,1) -> [8, OJ {NEXT-K) 
3 (STKY, 0) (STKY',2) -> [6, l] 
2 -> [7, l] 
2 -> [9, OJ {NEXT-S} 
2 (l. 00, 1) -> [O, O] [l, l] {K} 
2 (-9.00,1) -> (2, l] {S} 
1 -> 
1 -> [12, OJ 
1 -> (9) 
The routing file below is the "program" that the Routing Unít follows for sendíng tokens to 
different SPs. Notice that route list 9, used by MM-2, sends the sum to MM-1, instruction 
27, port O. Checking MM-1 we see that instruction 27 is the WRITE_ARRA Y instruction 
which is filling array C. 
DISPLAYDG ROOITS 
t dest.ination.s [sp, in.st, port] 
1 -> [l, 25, OJ (1, 27, 2] [l, 4, O] (1, 6, l] 
2 -> [2, O, O] (2, 1, l] 
3 -> [2, o, l] 
4 -> [2, 5, O] 
5 -> [2, 3, O] 
6 -> [2, 3, l] 
7 -> (2, 5, 2] 
9 -> [l, 27, OJ 
10 -> [l, 13, O] [l, 14, l] 
11 -> [l, 15, l] [l, 29, OJ 
12 -> [l, 22, O] 
13 -> [l, 21, 0) 
14 -> (1, 23, O] 
15 -> [l, 24, O] 
16 -> [l, 27, l] (1, O, OJ (1, 3, l] (1, 5, l] (1, 9, l] (1, 10, l] 
Figure 2.33 illustrates the distribution of the three Matrix Multiply SPs across four PEs. 
The cwved lines represent broadcasts, the straight lines represent execution time, and the 
bold lines correspond to the comments on the right-hand side of the figure. For this 
example assume the Matrix Multiply starts out on PE #2. There SP O begins execution, 
and enoounters the "ALLOCATE C' instruction. This instruction initiates a broadcast 
message to the other PEs. U pon receipt of this message, each PE allocates its portien of 
the array. Next, SP O generares and broadcasts the first value for i. Note that SP O does 
not have a range generator, thus it will generare al/ i-indices. 
PE #1 PE #2 PE #3 PE #4 
SP O 
r•ll¡ateC~ 
~ =0 ~ 
SP l ( 1 SP 1 
SPI 1 ¡,¡ 
j=O 
SP 2 
le-loop 
SP 1 
j =l 
1 
:-.1 o 
J = 
SP 21 
k-loop 
SP 11 
j =1 
1 
SP 1 
SP l 
e is distributed 
• i = O is broadcast, 
starts j loops 
• only PE #3 has resp-
onsibility when i is O 
•PE #3 begins j loop 
• all other SPs stop 
le-loop for (0,0) runs 
locally and reports 
sum back to SP l 
FIGURE 2.33. EXAMPLE EXECUTION TRACE FOR MATRIX MULTIPL Y ON 4 PES. 
88 
Each remote PE that receives an activating token (value 0) instantiates SP l. SP 1 does 
have a range filter, so it will process only those indices for which the current PE is 
responsible. Thus a number of PEs quickly execute essentially empty SPs because they 
have no elements for which they are responsible when i is O. In this case, PE 3 is the only 
PE with operations to perform when i is O. PE #3 executes SP #1, which spawns the 
89 
k-loop locally (the fact that the loop is local was detennined at compile time). The k-loop is 
a simple loop that generares a vector dot product and returns the result to its parent SP. The 
j-loop may now conti.nue with the j values for which it is responsible when i is O. 
In parallel with the execution of the first iteration of the i-loop, the original SP O continues 
generating and broadcasting successive values for i. This will cause new ready SPs to 
queue up in remete PEs. As other SPs block waiting for tokens, these new SPs will be 
selected for execution by the scheduler. 
Once the k-loop starts, it will access remote pages from different PEs as necessary. This is 
where the existence of the remote access cache becomes important - a large number of 
reads will access the local array cache rather than causing a remote read. 
Thus the SPs are efficiently distributed across the PEs. The distribution of Matrix Multiply 
across the set of PEs is efficient and uses little overhead. 
2. 6. Functional Distribution 
In PODS, functional distribution is not a primary concem. The APPLY operator is used to 
spawn function calls on a single remote PE, and the INVERSE_APPL Y is used to report any 
answers. Both of these operators are similar to the original ID operators, as described 
previously. 
PODS distributes functions at run time. Since all communication into and out of a function 
go through the calling SP, this decision does not have to be broadcast to the other SPs. 
Functional distribution occurs in two steps: the first step is to determine whether to 
distribute the function or not, and the second is to determine where to send it to if it is to be 
distributed. 
90 
Currently ali functi.on calls in PODS are distributed. In the furure the loading of the PE and 
the size of the function may be used to determine whether functions should be local or 
distributed. 
Once it has been detennined that a function will be distributed. where it will be sent to must 
be decided. In order to randomly distribute the work load the simple hash function below 
is used to generate the ID of the target PE. 
Target PE ID= (iteration + SP ID+ Calling PE ID) mod (number PEs) 
This will place different iterations on different PEs; necessary for calls inside of loops. By 
using the calling PE's ID the same functions called from different PEs will not all end up 
on the same PEs. Finally, the SP number adds to the randomness, particularly at the 
beginning of a program. 
This approach provides a fairly random distribution, which in turn tends to generare a 
balanced work load. Given more information, a more complex and possibly better 
distribution function may be used, but the simple approach achieves acceptable results 
without wasting interconnect bandwidth in order to maintain global state information. 
2. 7. Deadlock Handling 
Once SPs are formed they are checked for deadlock. Deadlock can occur when dynamic 
ares are present in such a way the actual instruction execution order depends on the indices 
u sed. 
Iannucci [Ian88] handles deadlock in such a way that the execution of very small SPs must 
be efficient This is not possible on currently available distributed memory MIMD 
machines. PODS instead produces a partitioning then checks it for deadlock. If it is 
deadlock-free then it will run efficiently. If it has deadlocks then the programmer is given 
91 
the choice to either change the offend.ing code. or have the partitioner split the SP to remove 
the deadlocks. 
PODS uses a cornbination of deadlock avoidance and detection. In PODS, unnecessary 
deadlocks occur only when an arra y read is placed befare its array write. In order to 
understand simple deadlock, consider the SP fragment below. The READ will request an 
I-structure read and the value would be sent to the '+ l '. But since the WRITE has not yet 
occurred (if A[i] already has a value then a single-assignment violation will occur), the '+ 
1' will block and will never unblock - causing deadlock. 
O regO <- read(A, i) 
1 regO <- regO + 1 
8 regO <- sorreva.lue 
9 write(regO, A, i) 
In order to avoid this PODS places array writes befare any reads of the same array. This is 
only lirnited by the static data dependencies. If A[i] = A[i+ l] + 1 (a LCD), then the array 
read of A[i+l] will be befare the array write into A[i]. This is nota problem. In the 
exarnple above, PODS would avoid the deadlock by ordering the instructions as follows. 
O regO <- sorreva.lue 
1 write(regO, A, i) 
2 regO <- read(A, i) 
3 regO <- regO + 1 
However, this will not always work. If another array.write to the same array occurs in the 
same SP then deadlock can occur. Once this has been detected. PODS splits the SP just 
after the array write to avoid the possible deadlock. This will avoid the deadlock because 
array writes do not have an output dependency are. Thus, putting instructions after an 
arra y write adds an irnplicit dependency are out of the array write. Splitting just after the 
array write will remove this added are, thus returning the dataflow graph to its original, 
deadlock-free state. 
92 
Consider the example below. This is the example Iannucci used to describe MDS. What 
follows is how PODS would handle it 
a= vector(0,2); 
a[OJ = O; 
a[l] = a[i] + 1; 
a[2J = a[j] - 2; 
in a [ 1 J - a [ 2 l ; 
FIGURE 2.34. ID NOUVEAU DEADLOCK CODE EXAMPLE. 
The unchecked PODS SP would look something like: 
SP O 
O write (0, A, 0) 
1 regO <- read(A, i) 
2 regO <- regO + 1 
3 write (regO, A, 1) 
4 regO <- read (A, j) 
5 regO <- regO - 2 
6 write(regO, A, 2) 
7 regO <- read(A, 1) 
8 regl <- read(A, 2) 
9 retum (regO - regl) 
This can cause an. unnecessary deadlock if i = 2 an.d j = O. In the code above (with i = 2 
and j = 0), instruction #2 blocks awaiting the read from instruction #l. This deadlock is 
unnecessary because a different ordering would not deadlock. By moving the bolded 
instructions #4 - #6 above to instructions #1 - #3 below, i = 2 andj =O would not cause a 
deadlock. to form another ordering. However, the code below would block on i =O and j 
= 1, where the code above would not. Both of these orderings cause unnecessary 
deadlocks because they can. be removed; a necessary deadlock would occur if i = 1 or j = 2 
(see Figure 2.34 above). 
SP O 
O write(O, A, 0) 
1 reqO <- read (A, j) 
2 reqO <- regO - 2 
3 write(regO, A, 2) 
4 regO <- read(A, i) 
5 regO <- regO + 1 
6 write(regO, A, 1) 
7 regO <- read(A, 1) 
8 regl <- read(A, 2) 
9 return(regO - regl) 
PODS would recognize that there are three array writes to the same array in the same SP. 
Therefore, the SP must be split after every write. This will form the following SPs. 
SP O 
O write(O, A, 0) 
SP 1 
O regO <- read(A, i) 
1 regO <- regO + 1 
2 write(regO, A, 1) 
SP 2 
O regO <- read(A, j) 
1 regO <- regO - 2 
2 write(regO, A, 2) 
SP 3 
O regO <- read (A, 1) 
1 regl <- read(A, 2) 
2 return(regO - regl) 
93 
This will remove the dynamic ares caused by placing instructions after an array write; array 
writes do not have output dependency ares. These unnecessary dependency ares are what 
cause the deadloc.k. These types of situations are possible but unlikely. In none of the 
Livermore Loops, nor Matrix Multiply, nor in any"of SIMPLE does code like this occur. 
Iannucci has designed a completely safe system, however it cannot run efficiently without 
special purpose hardware. PODS has been designed for the most lik:ely cases (scientific 
code), but can still operator on the abnormal cases (though notas efficiently as regular 
code). A detection method more complex than the simple array write test is currently being 
94 
investigated. It is based upon the LCD algorithm. This would allow PODS to create even 
larger SPs. 
CHAPTER3 
PODS Logical lmplementation 
This chapter d.iscusses the logical implementati.on of a Process-Oriented Dataflow System. 
The logical irnplementation describes the functional units and their tasks in PODS. The 
remete arra.y caching scheme is also presented. Once these are covered the logical 
architecture is examined. Finally, the supporting software suite is presented. 
3. l. System Overview 
The driving force behind the PODS logical implementation design was the desire to support 
the programmer with automatic, but efficient, parallelization of his code. To achieve this 
the logical implementation had to execute the partitioned ccxle with ~ little global 
information as possible. Global information is the root cause of many computational 
bottlenecks. And since PODS is to be used on MIMD machines with relatively slow 
communications; communications over the network have to be kept to a mínimum. 
With the above goals in mind, the logical PE design was constrained to contain a 
conventi.onal von Neumann CPU at its core. The suppon units would provide additional 
power to perf orm specialized tasks. It is envisioned that these unit would be placed on a 
single circuit board to forma complete PE. Over time the support units changed in number 
and functionality until the complete set below was finalized. 
• Executi.on Unit - main unit, performs ali ALU functions, a 
standard microprocessor (e.g., Intel 80386). 
• Matching Store - suppon unit, handles matching of incoming 
tokens. 
95 
.. Routing Unit - support unit, processes rnessages between PEs, 
similar to the Direct-Connect Model in the iPSC/'2. 
Array Manager- support unit, handles array manipulation 
requests and remete caching. 
Memory Manager - support unit, manages SP memory and 
loads SPs. 
96 
In order to produce partitioned code, a system ~oftware suite was built The suite consists 
of the ID World Compiler, the PODS Translator, the PODS Partitioner, and the PODS 
Simulator. The ID World Compiler was graciously supplied by MIT [Nik87b] and the 
other three programs were designed and built here at UC, Irvine. A parallel programmer 
would write in ID Nouveau, compile the program. run it through the translator, and then 
the partitioner to produce PODS cede. In the future a PODS compiler is envisioned that 
would replace the first three programs, and would be tailored for a specific MIMD 
architecture, like the iPSC/2. 
The PODS instruction set is designed along the lines outlined in Bic's original paper 
[Bic87]. It was designed to perform the required tasks (interna! and externa! token 
passing) as efficiently as possible. Though it was not tailored to a specific von Neumann 
CPU, the tasks required are not beyond the standard von Neumann CPU. 
U nderlaying ali of the instructions is the remote array caching scheme. This is a software 
caching scheme designed to exploit the locality of reference in most programs. This is 
critica! for slow communication MIMD systems. 
97 
3. 2. Logical PE Architecture 
The logical implementation describes the functional units and their tasks necessary in PODS 
[Roy90]. The design was constrained to contain a conventional von Neumann CPU at its 
core. The support units would provide additional power to perform specialized tasks. It is 
envisioned that these unit would be placed on a single circuit board to form a complete PE. 
This logical implementation is currently being modified to run directly on an iPSC/2. The 
way in which the tasks are performed is changing, but the tasks are still the same. 
MATCHING STORE 
contex t.sp. •. port.i teration 
contex t. sp. •. port.i ter ati on 
contex t.sp. •. port.iteration 
SCP 8 
SCP 1 
SCP 4 
SCP 7 
SCP 9 
SCP 4 (c.sp.*.p.l) 
SCP 9 (c.sp. * .p.0) 
SCP 4 (c.sp.*.p.2) 
values, 
dynamic 
addresses 
output token 
offset 1 
offset 2 
requests, 
values 
ARRA Y MANAGER 
queued 
.............................. .-..........., memory 
--.~~~~--requests 
l 1 1 1 fjc~el 1 1 1 1 
1 
remote values, 
remote requests 
local reads 
FIGURE 3.1. LOGICAL UNITS OF APODS PE. 
98 
Figure 3.1 shows how the functional units within a PE interact When an input token 
arrives it is run through the Matching Store. When the required tokens are present the 
Memory Manager will load the SP from the Program Memory into Execution Memory. 
Once in Execution Memory the Execution Unit will begin operating on itas it percolates to 
the top of the ready list The key is to keep the Execution Unit operating as muchas 
possible and to keep the number of context switches to a rninimum. In order to suppon 
this the Execution Unit calls upon the Array Manager and the Routing Unit to handle 
specialized tasks. 
Each of the tasks of the functional units is explained below. 
3. 2 .1. Execution Unit 
99 
The Execution Unit is a simple von Neumann machine which automatically blocks the 
executing process when a necessary operand is not available. This unit is the most heavily 
used and is the most complex. PODS is designed such that this unit can be a standard 
off-the-shelf microprocessor, e.g. Intel 80368. This will allow PODS to make use of 
advancements in rnicroprocessor technology, e.g. Intel i860. 
The Execution Unit uses the state transitions described in Chapter l. In order to execute 
one PODS instruction the following tasks need to be performed: 
1. check if all operands are available - if not block 
2. perform basic instruction to produce value 
3. pass value intemally to needy instructions 
4. if necessary, send message to Routing U nit with route list and 
value. 
5. increment or set program counter as directed by instruction 
These steps can easily be performed by an off-the-shelf microprocessor, and many 
optimizati.ons can be perfonned. For example, many instructions will never block since all 
of their operands are generated locally with the SP. Most instructions do not have routes 
attached, only interna! off sets for value passing. V alue passing is performed by using 
registers. Sec Bic's [Bic90] for a detailed discussion of the Execution Unit's functions. 
3.2.2. Routing Unit 
100 
The Routing Unit is loosely based upon the Direct-Connect Module in the iPSC/2. 
However, it must perform a number of tasks other than just making the connection. Ali of 
these tasks in vol ve the use of the Routing Table. 
The Routing Table is built at compile time and holds the static information necded to send a 
token from one SP to another. The figure below shows the structure of a Routing Table 
(note that is not limited to only 3 entries as shown). The Routing Table is only dependent 
upon the program, and is built by the PODS Translator. 
unique route ID 1 (sp inst port) (sp inst port) (sp inst port) 
unique route ID 2 (sp inst port) (sp inst port) NULL 
unique route ID 3 (sp inst port) NULL NULL 
unique route ID 4 (sp inst port) (sp inst port) (sp inst port) 
FIGURE 3.2. ROUTING TABLE. 
Each PE has a copy of the Routing Table. It is of a fixed size because it only holds static 
information, the dynamic information will be in the token's tag. To senda route the 
Execution Unit simply semis a local message to the Routing Unit. This message contains 
the route ID, the token's value, the token's tag, and whether this is to be a distributed or 
local or hash route. This is shown below in Figure 3.3. 
message from EU 
contains route ID, value, tag 
d.istribute/local/hashed flag 
Routing Table 
route ID ---------~u1 i::::o~~~::l-~~~ 
local route: 
send new token to the 
Matching U nit of this PE 
distributed route: 
send new token to 
all PEs 
value 
FonnNew 
Token 
Routing U nit 
hashed route: 
send new token to 
selected PE 
FIGURE 3.3. ROUTING UNIT BLOCK DIAGRAM. 
101 
If the route is local the destination PE is this one, and the network need not be accessed. In 
the future, the Execution Unit may take on this local responsibility, but that would put more 
burden on the Execution U nit 
If the route is a hashed route, then the Rouci.ng Unit must take the token's context, combine 
it with the destination (sp inst port) from the Rouci.ng Table, and run it through the hash 
function to determine where this particular SP is located. It is possible that this SP will be 
on this PE, but the Routing Unit is the only unit which can determine that 
102 
If tt ·~ c::>ute is to be d.istributed, then each PE is sent a message with the token in it This is 
how an SP is distributed. Its parent SP calls the Routing Unit with a token and calls for it 
to be distributed. This will cause every PE to receive a copy of the token, and every PE 
will start up the appropriate SP. These distributed SPs have range filters which limit the 
indices which are actually generated. 
Asan example, considera token with the following: value = 1, context = (2,3), iteration 
number = 4, and route ID = 5. If this token were to be d.istributed, and route ID 5 
contained (1, O, 0) (1, 1, 1), then every PE would get two messages. The first message 
would be destined for context (2,3), iteration number 4, SP 1, instruction O, pon O and 
have the value l. The second would be for context (2,3), iteration number 4, SP 1, 
instruction 1, pon 1 and have the value l. 
In an actually implementati.on these messages would be batched together to reduce 
communication costs. 
3. 2 . 3. Array Manager 
The Array Manager handles ali array accesses, except local array reads. The Executi.on 
Unit will issue a request to the Array Manager toread. write, or allocate an array. This will 
not cause a context switch, the Execution Unit will keep on processing until a needed value 
is not available. This causes a shadow to occur between the time the value is requested and 
the time it is needed. In the future this shadow can be exploited to execution as many 
instructions as possible before reaching the needy instruction. 
When a request for an array read is received, the Array Manager determines whether the 
element is: 
1. cached and present - return value 
103 
2. local or cached, and not presem - enqueue request 
3. rernote - send rernote request to routing unit 
If the element was local and present then the Execution unit would have read it d.irectly. To 
enqueue a request a flag is set in the memory location of the cell to indicare that there are 
requests which will need to be serviced when the cell is written. This is much like 
Arvind's I-structures, [ANP87a, ANP89]. 
When a remate read is needed, the Routing Unit will send the request to the appropriate PE 
(based upon the global partitioning). If the value is present then the entire page is retumed. 
This page is then cached in the PE's software cache for that array. In this way the remote 
caching scherne is implemented, and further reads by this SP will most likely ha ve sorne 
locality-of-reference. The single assignment restriction prevents writes from needing to be 
replicated across the network and this allows a simple caching mechanism to operated 
without cache coherency problems. 
When an array write is requested, the Array Manager perfonns a similar set of tests, but the 
cache is never directly written. The cache will be updated when the page is brought over 
from the remote PE. When the value is actualiy written into the cell the queued read 
requests are dequeued and the value is send to the ali of the requesting SPs, be they local or 
remo te. 
To allocate an array, every PE needs to know that space should be reserved. To do this the 
Array Manager on the PE where the ALLOCATE operator is fired, called the host PE, will 
assign the array a unique ID. This ID is then sent to ali of the other PEs so that they will 
reserve the requested space and use the same ID. This ID is then retumed to the requesting 
SP so that it will be used as a reference the array from any PE. 
3. 2. 4. Memory Manager 
The Mernory Manager is quite simple. It has only one task, to load SPs from program 
memory to execution memory. In an actual implementation, this would simply be a SP 
frame manager with no copying of instructions, and would probably be part of the 
Matching U nit. 
104 
SP's are loaded as soon as ali of the tokens for the first instruction are present in the 
Matching Unit. There is no reason to load the SP earlier, since the SP cannot start 
executing until then. There is also no reason to load it any later, as the second instruction 
may be fed by the first. 
3. 2. 5. Matching Unit 
The task of the Matching U nit is to receive tokens and determine which SP they are 
destined for. Logically two tokens match if their dynarnic parts and SP numbers match. 
This will uniquely identify a specific SP. In an actual implementation this is implemented 
as a hash table lookup based upon the SP ID, and the frame pointer. This hash table can be 
handled by a small, quick, rnicroprocessor like the AMD 29000. 
3 . 3. Remote Array Caching 
This remote array caching scheme was presented previously in [BNR89b]. Por that paper 
the Livermore Loops benchmark programs were run anda cache size equal to 5% of the 
arra y siz.e was found to be sufficient. This scheme has not changed significantly since that 
time. 
Single assignment is essential to this remote array caching scheme, and a little explanation 
is in order. Single assignment principies allow the implementation of a simple automatic 
synchronization mechanism. Each memory cell has two states-undefined or defined. If a 
105 
cell is undefined, ü may also have a queue of read requests associated with it. Hardware 
enforces the write-before-read requirement. Sorne examples of architectures that ha ve this 
type of write-once/read-rnany rnernory access mechanism include HEP [Srni85, Srni81] 
and I-structure memory in dataflow [A&C86, ANP87a, ANP89]. 
Prior to execution, an array is either undefined or filled with initialization data (if specified 
in the program). Each PE may write only into undefined array cells. Race cond.itions a.re 
avoided by this single assignment policy. There will never be a race cond.ition for writes to 
memory cell, since only one PE may write to any panicular cell and writing more than once 
results in a run-time error. 
Thus the single assignment rule autornatically enforces synchronization in a disttibuted 
manner, no explicit synchronization mechanisms are necessary-a majar issue in other 
programming paradigms. 
In PODS remote writes are kept to a rninimum by the panitioning described in Chapter 2. 
However, remete reads and still occur quite often, since any instruction may read any data 
item. If data is mapped onto the reading PE, the access is local, otherwise it is remete; the 
PE must request the value from the responsible PE by sending a message. Remete reads 
are synchronized just like local reads-if the data item is not available, the request is 
queued, and if the data item is available, the page containing that ítem is sent back. During 
this remete read the requesting PE can perf orm other useful work. The requesting PE may 
resume executing this SP when the page arrives. This is where the benefits of array 
caching come in, and array caching is greatly simplified because of the single assignment 
principle. 
Since the central idea in single assignment programming is to permit only one write to any 
element, by requiring single assignment we can guarantee that a page fetched from a remete 
PE and cached locally will not need any further updates during the lifetime of the arra y, 
106 
ignoring far now the possibility of partially filled pages. Given this, each PE may safely 
cache a remotely fetched page in a local data cache, preventing future accesses of the same 
remate page. The cache used will be of fixed size anda least-recently-used (LRU) 
replacement strategy is employed. 
Without single assignment, partitioning data among PEs is possible, but it would require 
excessive communication overhead to allow any instruction to write to any location of an 
array. In addit:i.on, array caching would be nearly useless as each write performed would 
require the upd.ate of ali remete caches containing the modified page. The rnachine could 
broadcast or multicast these updates to avoid the inefficiencies of individual messages, but 
the broadcasts would still strain th;~ network facilities. Not only that, but without single 
assignment the caches would be inconsistent for the duration of the page modification 
broadcast (cache coherency problem). If no cache approach is taken, no page modification 
broadcasts will be necessary, and there will be no inconsistency problems. But, the use of 
caching leads to considerable decreases in total remete accesses peif ormed. 
It has been shown [BNR89b] that a software cache size of 5% of the array size is sufficient 
to reduce the number of remote re ad significantly. Tests with scientific code ha ve shown 
that the percentage of remote reads can be reduce to less than 10% of the total number of 
reads in most cases. Figure 3.4 below shows the effects of d.ifferent size caches on 
percentage of remote reads for a number of the Livermore Loops scientific benchmark 
programs [LLL83]. Notice that nearly ali of the kemels d.rop below 10% when caching is 
used. The only exception is Matrix Multiply; this is because it reads one entire column of 
one matrix and one entire row of the other in order to write one element. PODS uses a 5% 
array cache. 
Percent 
100.00% -
X 
90.00% 
80.00% 
70.00% 
60.00% 
·•· Hydro Fragment (1) 
0- Rrn Sum (11) 
· •· Rm Differencc (12) 
~ 1-D Particle in a Cell (14) 
···Casual Fortran (15) 
"Ó' 2-D Explicit Hydro. Frag. (18) 
R=ote 50.00% 
Re.1ds X DiSCTete Ordinates Transport (20) 
40.00% X Matrix*Matrix Product (21) 
30.00% 
- Planclti.an Distributiori (22) 
20.00% 
-Tri-Diagonal Elinúma.tion (5; 
. 
l 0.00% t ·•·General Linear ReGurrencc (6) 
0.00% a-Á-.:.-.:.-. --~ 0- Equa.r.ion of State Fragment (7) 
O 5 l O 50 100 .. ·A. D.I. Integrarion (8 
Cache Size (% of array size) 
-0- Integrare Predictors - column (9) 
FIGURE 3.4. EFFECTS OF CACHE SIZE ON PERCENTAGE OF REMOTE READS. 
As can be seen in Figure 3.5 below, the percentages of remote accesses are usually less 
107 
than 5% when a 5% cache size is used, independent of the number of PEs. This caching 
can have anywhere from a minimal effect to an extremely large effect Large reductions, 
such as 1/20th of the original remote reads, have been observed. Scientific code 
demonstrates significant reductions (see [BNR89b]). 
60.~ 
50.00'I. ~·-· 
40.00'I. 
20.~ 
10.~ 
·-·--·--·----· o.oo'*> a--a 11 il-==-' 
4 8 16 32 64 
Number of PEs 
·•· Hydro Fngment (!) 
O F"tnt Sum (11) 
• • F"tnt Difference (12) 
O 1.0 Port.icle in • eeu (14) 
••• Cuual Fortr1n (IS) 
6 2·D Expliclt Hydro. Fn1. (18) 
)( Dilcme Onlinatm Tranoport (20) 
X Marrix•Matrix Product (21) 
- Plancklan Dillnbution (22) 
-Tri-Diagonal Elimimation (S 
·•· o.n.nJ Unes Reainence (6) 
O Equation of Stmm Frqmem (T) 
• • A.Dl. Inr.eantion (1 
.O lmelJ'llC Predielan - column C9: 
108 
FIGURE 3.5. REMOTE READS FOR THE LIVERMORE LOOPS USING REMOTE CACHING. 
Ata high-level this approach is similar to that taken by Callahan and Kennedy in [C&K88]. 
They describe a number of the software oriented issues involved in distributing arrays 
across distributed memories. Unlike this approach, they allow a completely general 
distribution function for allocating array elements. This is very powerful, but forces the 
programmer to explicitly program in the decomposition and can lead to expensive run-time 
calculations. This differs from the automatic parallelization goal of PODS. 
3. 4. Software Support 
In order to actually use PODS a number of support programs are necessary. These are 
shown in Figure 3.6 below. 
.id file 
PODS 
Partitioner 
PODS 
Simulator 
109 
FIGURE 3.6. PODS PROGRAMMING SYSTEM. 
3. 4 .1. ID World and GITA Compiler 
ID World is a software environment written at MIT [Nik87b] in LISP. As a pan of the 
environment there is a GIT A compiler which can prcxiuce dataflow graphs for the GIT A. 
The compiler itself [Tra86] makes use of peephole and other optimizations upon the ccxie. 
The idea here was to leverage previous work in the field until the needs of PODS were 
better understood. In the future a direct PODS compiler is in order. 
3. 4. 2. Translator 
The PODS Translator takes a set of .graf files which make up a program and converts the 
GIT A ccxie in to PODS code. This is usually a one-to-one translation. In order for PODS 
to properly execute the dataflow graphs they must be ordered. 
SPs, being small segments of sequenti.al code, have to worry about supplying tokens. An 
operator should only send tokens to instructions which come later in the SP. Tbe exception 
110 
to this rule is the D operator, which sends data back to the beginning of a loop. As 
Iannucci has pointed out [Ian88] it is not always possible to properly order a dataflow 
program so that the instructions are in a set, correct order. This is dueto the dynamic ares 
which can occur. In Chapter 2 this is discussed in the context of deadlock avoidance, and 
the PODS Partitioner is the program which ensures this. 
Specifically the tasks of the PODS Translator are: 
1 . Instruction Translation - most GIT A instructions get con verted 
directly over to a PODS instruction one-to-one. Sometimes 
groups of GIT A instructions make one PODS instruction. This 
is a format change only. 
2. Removal of U nnecessary Instructions - for GIT A a number of 
IDENT instructions are inserted for synchronization purposes; 
these are unnecessary in PODS because of the synchronization 
imposed by a program counter. 
3. Building of Routing Table - for every dependency are which 
goes from one .graf file to another, an entry into the Routing 
Table is needed 
4. Ordering Instructions - by following the dependency ares the 
PODS instructions are placed in order such that no instructions 
depend upon the input from a later instruction. This handles the 
static dependency problem, the dynamic dependency problem is 
handled in the PODS Partitioner (deadlock avoidance). 
111 
3. 4. 3. Partitioner 
The PODS Partitioner breaks apart the program into static SPs. It is primarily responsible 
for implementing the distribution scheme discussed in Chapter 2. 
To break apart the data.flow graph the Partitioner starts with the code-blocks generated by 
the GITA compiler. From there deadlock detection is used and the SPs are split as 
necessary. Once it has been determined that an SP will be distributed. the Partitioner then 
adds the range filters and the DISTRIBUTE versions of the L operators. The .pods files are 
produced and the prograrn is now ready to be run or simulated. Figure 3.7 shows the 
Partitioner Block Diagram . 
. trans files 
Deadlock 
Detector 
. trans files 
SP Spliter 
.graf files 
LCD 
Detector 
Distribution . ., _____ ...,¡ 
Algorithm 
Distribution 
Code Inserter 
.pods files 
FIGURE 3.7. PODS P ARTITIONER BLOCK DIAGRAM. 
112 
The Deadlock Detector uses the scheme described in Chapter 2 and informs the SP Spliter 
where deadlocks rnay occur. The SP Spliter breaks up the SPs as directed. This deadlock 
prevention is not necessary very often; it was not necessary anywhere in SIMPLE nor 
Matrix Multiply. The LCD Detector feeds the Distribution Algorithm Unit the loop-canied 
dependency status of each code-block. The Distribution Algorithm Unit then executes the 
algorithm discussed in Chapter 2. Finally the Distribution Code Inserter places the 
appropriate range filters into the code and annotates the L operators with either DISTRIBUTE 
or LOCAL. 
The LCD Detector is simple because of the dataflow nature of ID Nouveau (see Section 
2.5.3, LCD Effects) and the .graf files it generates. The LCD Detector is written in 'C' and 
follows the algorithm outlined in Chapter 2. The SP Spliter sirnply break a given SP up 
after every write to the problem array; the problem arra y is specified by the LCD Detector. 
Specifically the tasks of the PODS Partitioner are: 
1 . Deadlock Detection and A voidance - perfonned by the 
Deadlock Detector and SP Spliter; uses algorithm discussed in 
Section 2.7, Deadlock Handling. 
2. LCD Detection - perf ormed by the LCD Detector; uses 
algorithm discussed in Section 2.5.3, LCD Effects. 
3 . SP Distributi.on Determination - used output from LCD 
Detector to apply distribution algorithm discussed in Section 
2.5.5, Por-Loop Distribution Algorithm. 
4. Distribution Codc Insertion - inserts proper range filter and 
DISTRIBUTE or LOCAL versions of L operators; uses approach 
outlined in Section 2.5.2, Range Filters. 
3. 4. 4. Simulator 
The PODS Simulator is the subject of Chapter 4. 
113 
CHA.P1ER4 
PODS Simulations 
In this chapter the PODS Simulator and two example programs, Matrix Multiply and 
SIMPLE, are examined. The results of multiple test cases are analyzed and discussed. 
In the PODS Programming System, the simulator is the last program in the support 
software suite. The PODS Simulator was designed and build to test the logical 
implementation of PODS. Each PE is simulated down to the instruction leve!, with 
different functional units operating in parallel (see Chapter 3 for a description of the 
functional units). The PODS Simulator takes in a program and executes it step by step as if 
the program were actually running on PODS. In this manner the system can be measured 
and monitored as if running real programs. 
In order to compare the results of PODS simulations to the outside world, the PODS 
Simulator is set-up as if it were executing on Intel 386 microprocessors in a hypercube 
configuration. This is not an exact simulation of Intel's iPSC/2, but timing comparisons to 
programs on iPSC/2 systems are valid. The major, real-world program described herein is 
the SIMPLE benchmark [CH&R] developed by Lawrence Livermore Laboratory. This 
code is indicative of the large scale scientific code which is executed on supercomputers 
today. 
4 .1. Overview 
4 .1.1. Simulator Approach 
The PODS Simulator is an event-driven simulator which uses SMPL at its core. 
MacDougall has written an excellent book [Mac87] which describes SMPL and its proper 
114 
usage. In the PODS Sirnulator, as in any simulation, certain assumptions are necessary. 
These ha ve been kept to a minimum are are based on known or measured statistics. 
There is a hardware configurati.on configurati.on file which holds the following hardware 
parameters: 
• NUMBER_OF _PE - the number of processing element to 
simula te 
• PAGE_SIZE- the size of an array page (set at 32 array 
elements) 
• BROADCAST_NET- whether a broadcast type of message is 
available or not (set to true) 
• CACHE_PERCENT- the size of the software cache for each 
array (set at 5%) 
The hardware configuration file also holds the following timing parameters: 
• NE1WORK_ TIME - the time for a message to propagate over 
the network 
• SINGLE_ROUIB_ TIME - the time to build a single message 
token into a batch inside the Routing Unit 
• MS_SETUP _TIME- the time for the Matching Unit to find if a 
token has a match 
" MM_SETUP _TIME - the time for the Memory Manager to 
wake-up when a new SP is to be loaded 
115 
116 
" SINGLE_CONTEXT_SWITCH_ TIME- the time for a fast 
context switch 
The values for these, and other timing pararneters, is discussed in the next section. 
4 .1. 2. Timing Assumptions 
In order to estimare the arnount of ti.me a CISC would ta.lee to perform a given operation, 
the PODS Simulator is sized to the Intel iPS02's PEs. These are Intel 80386/80387 
CPU's at 16 MHz. with Direct-Connect Modules for communication. All timing is done 
in µseconds. Each functional unit's timing is described below 
Execution Unit 
This is the ALU and associated units. Its timing is based upon three calculations: (1) the 
time it takes to perform a fast context switch; (2) the time to perform a local array read.; and 
(3) the ti.me of each normal operati.on. Time for each normal operation was measured on 
the iPSC/2 with the following results: 
iPSC/2 Instructi.on 
integer add 
integer subtraction 
bitwise logical 
floating point negate 
floating point compare 
floati.ng point power 
floati.ng point abs 
floating point square root 
floating point multiply 
floating point di vision 
floating point addition 
floatin_g_}2_oint subtraction 
Execution time Ú:!:_sec) 
-0.30ff 
0.300 
0.558 
0.555 
5.803 
96.418 
12.626 
18.929 
7.217 
10.707 
6.753 
6.757 
TABLE 4.1. MEASURED nMEs OF OPERATIONS ON IPSC/2. 
The time for a local array read is based on the pseudo code in Figure 4.1 below. 
offset= size.dim2 * i + j 
if (offset < beginning offset) gota REMOTE READ 
if (offset ~ ending offset) gato REMOTE READ 
if (element not present) gato ENQUEUE READ 
value = array[offset] -
FIGURE 4.1. 2-D ARRA Y READ PSEUDO-CODE. 
The time for a local array read (assuming the value is present) is: 1 integer multiply + 1 
integer add + 3 integer compares+ 1 local read. This works out to be 2.7 µseconds. 
117 
The time for a fast context switch is based on the 80386 CALL ptrl6:32 instruction. This 
is a full 32 bit indirect procedure call. The worst case for this is 21 clock cycles or 1.312 
µseconds at 16 Mhz. 
ArraY Mauaiw 
The Array Manager handles ali array operations except local array reads (which are 
performed by the Execution Unit). The Array Manager handles the following tasks in the 
indicated times. 
• FreeArray: number_arrays * memory_read_time 
• ArrayWrite: memory_write_tirne + number_queued_reads * 
message_time 
• CachedRead: memory _read_time + message_time if not present 
• RemoteRead: memory_read_time + enqueued_read_time or 
message_time 
where 
.. ReceivePage: page_size * memory_write_time 
• Send.Page: page_size * mernory _read_time + message_time 
AllocateArray: 100.0 µseconds + message_time 
• memory _read_time is the time for a local read = 0.3 µseconds 
• memory _ write_time is the time for a local write = 0.4 µseconds 
• message_time is the time for a signa! from one functional unit to 
another on the same PE= 1.0 µseconds 
• enqueued_read_time is the time to push an early read onto a 
stack = 3 * memory_read_time + 5 * mernory_write_time = 2.9 
µseconds 
Routin& Unit 
118 
This is basically the Direct-Connect Module with sorne extra operations. This unit is 
responsible for taking a token, forming the rnessage, and sending it over the network to the 
correct PE and SP. Dunigan [Dun88] has done sorne extensive testing of the iPSC/2 and 
found that the communication can effectively be expressed using the following equations: 
if (rnessage_length <= 100 bytes) then 390 µsec 
if (rnessage_length > 100 bytes) then 697 + 0.4 * rnessage_length µsec 
The extra operations calculate the SP and PE to which the token will be sent When the 
Routing Unit receives a token to route, a simple table look-up is used to find the destination 
SPs. Thls is then used in a hash function to find the destination PE. Since tokens are less 
than 100 bytes, and they are batched together in groups of 20, the simulation uses an 
estímate of 19.5 µseconds for each token added. to a batch. 
Memozy Mana¡¡er 
119 
The Memory Manger sirnply grabs execution rnemory frames from free memory. This is a 
list manager, one list for free SP frames and one for used SP frames. To perform its 
operations the Memory Manger need only add or delete from a linked list This is a 
constant time operation which talces approximately 3 memory references or 0.9 µseconds. 
Matchin~ Store 
The Matching Store must search the hash table for the appropriate SP. This is a simple 
hash search which takes 15 µseconds. 
Network 
The Network is sirnply the physical propagation time. The Routing Unit handles all of the 
transrnission setup. The iPSC/'2 has a theoretical 100 Mbyte per second bandwidth. 
Assurning each message is approxirnately 100 bytes, the time for l hop is 1 µsecond. The 
network time is set to 2.5 µseconds, simulating an average of 2.5 hops. The Network can 
only handle so many messages at a ti.me, this is estimated to be half the number of PEs. 
4. 2. Measures of Effectiveness (MOEs) 
The motivation behind the following Measures of Effectiveness (MOEs) is to gauge how 
well PODS will perform on a real system with real-world problems, and how does this 
compare to what is available today. 
FunctionaJ Unit Balance - how well balanced are the functional units which make up 
the PE? This is measured by SMPL as the fraction of the time which a given facility is 
busy, i.e. the utilization. Since PODS PEs contain parallel functional units, the balance 
between the units is important. If one of the support units, e.g. the Routing Unit, were 
very heavily loaded then the Execution Unit may be waiting for it This would point to 
possible improvements in the logical architecture design. 
120 
Execution Unit Utilization - what percent of the time are the Execution Units 
operating? Do sorne PEs sit id.le awaiting the outcome of other PEs? Ideally utilization 
should be 100% for each PE, this is never actually possible. This is measured by SMPL as 
the accumulated busy time of the each execution unit, divided by the total run time. 
Execution Unit Load Balance - how equally distributed is the work load? Ideally 
each PE will put in the same amount of work. This shows if there are any "hot-spots" 
where sorne PEs are doing ali the work while others are idle. 
Parallelization Overhead - how many of the instructions executed are "work" 
instructions and how many are due to parallelization. This shows how much additional 
overhead there is in the parallel version of the program. In the PODS Simulator the 
dynamic work instructions as well as the total dynamic instructions are counted. Work 
instructions are those which must be executed no matter how many PEs are used. i.e. Ali 
instructions except the range filter instructions. 
Efficiency Comparison - how efficient is the parallel version on one PE when 
compared to a real sequential version (usually 'C' or FORTRAN). Usually the parallel 
version will be less efficient because of the additiónal tasks which must be perfonned for 
multiple PEs even though there is only one operating. Also, commercial systems have 
additional optimizations which research systems do not. If this comparison shows that the 
parallel system is within 100% of the sequential version on one PE, then the parallel system 
is not grossly inefficient, and the scalability results can be considered to have a valid base 
time. Far Matrix Multiply and SIMPLE, 'C' versions were compiled using the Intel 
supplied compiler and timed on the iPSC/2 host. 
121 
Scalability - how much do problems speed-up as the number of PEs is increased? 
Ideally linear speed-up is possible. However, overhead and program dependencies prevent 
this from being achieved. This can be seen by plotting the number of PEs vs the speed-up, 
where speed-up is defined to be the time of a single PE run divided by the time of the 
multiple PE run. This is the most important measure of effectiveness of a parallel 
processing system. 
4. 3. Example Programs 
The results presented here are for two diff erent programs. Thc first program is for matrix 
multiplication and is discussed in detail in Chapter 2. Thc second program is SIMPLE, a 
benchmark program written by Crowley et. al. [CH&R] at Lawrence Livermore 
Laboratory. This benchmark was designed to a test computer systems performance on the 
type of large scientific programs which the laboratory runs. It is used here to show how 
well PODS executes large scientific programs. For more detall on SIMPLE see [P&R90]. 
4. 3 . 1. Matrix Multiply 
A detailed discussion of the Matrix Multiply example is contained in Chapter 2. However, 
a brief discussion here is also in arder. 
Discussion 
Consider the implementation of Matrix Multiply in ID Nouvcau shown in Figure 4.2. Thc 
ccxle follows the basic scquential Matrix Multiply algorithm below, vcry closcly. 
C[i, j] = f A[i, k] * B[k, j] 
k = 1 
The use of Next s in line #9 creates a LCD while perfonning a reduction operation. The 
array write into e in line #7 controls the part:itioning, i.e., array e is the master array. 
%%% Matrix Multiply 
1 Def mm A B = { (11, ul), (12, u2) = 2D bounds A; 
2 e= i_matrix ((11,ul), (12,u2)); 
3 In 
4 { For i <- 11 To ul Do 
5 { For j <- 12 to u2 Do 
6 s = O; 
7 C[i,j] = 
8 { For k <- 11 To ul Do 
9 Next s = s + A[i,k] * B[k,j]; 
10 Finally s 
11 } 
12 }; 
13 Finally C 
14 } 
15 } ; 
FIGURE 4.2. MATRIX MUL TIPL Y ID NOUVEAU SOURCE CODE. 
122 
This function contains a number of items worth noting: ( 1) there are three different SPs 
(one far each far-loop nest level); (2) a new array, C, must be dynamically allocated and 
distributed efficiently; (3) there is a loop-carried dependency in the innermost loop (the sum 
variable, s); (4) the two input arrays, A and B, have different access patterns; and (5) the 
sizes of the input arra.ys are not known at compile time. These attributes malee the Matrix 
Multiply algorithm an interesting test case. 
Results 
Functional Unit Balance. Figure 4.3 below shows the average utilization far the 
different functional units when the 16 x 16 case is run. Notice that ali of the support units 
are not being heavily utiliz.ed. Thus the Execution Unit is not being slowed by the support 
units. This shows that the support units are truly operating in a support function and are 
not performing extensive operations. This bodes well for HyperPODS, where ali PE 
functions will be perfonned by one CPU. 
100.00% 
90.00% 
u 80.00% 
T •Eu 
I 70.00% 
L 60.00% OMS I 
z 50.00% •Ru 
A 
T 40.00% .AM 
I 30.00% IDMM o 
N 20.00% 
10.00% 
0.00% 
1 2 4 8 16 32 
N umber of PEs 
FIGURE 4.3. UTil..IZATION FOR EACH FlJNCTIONAL UNIT (16 X 16 :MM:). 
The Execution U nit has the highest utilization until the parallelism drops below that 
necessacy to keep all of the PEs active. The important case above is the 8 PE situation. 
This is where the problem size meets the available PEs. In this case the Execution Unit 
utiliz.ation is more than double that of the most loaded suppon unit (78% vs 35% for the 
Matching Store). 
123 
Execution Unit Utilization. Since the Execution Unit is the major unit doing the work 
done by a PE, as shown above, its utilizati.on is critica!. For Matrix Multiply the Execution 
U nit utiliz.ation increases as the problem size increases. This is true in general and is due to 
124 
the increase in the parallelism in larger problems. As Figure 4.4 shows below, PODS is 
only able to spread the available parallelism so far, and as more PEs are made available 
PODS is unable to fully utilize ali of them. 
100.00% 
90.00% 
u 80.00% 
T 
I 70.00% 
L 60.00% •10x10 
I 
z 50.00% D 16 X 16 
A 
T 40.00% • 32 X 32 
I 
o 
30.00% 
N 20.00% 
10.00% 
0.00% 
1 2 4 8 16 32 
Number of PEs 
FIGURE 4.4. AVERAGE EXECUTION UNIT UTILIZATION FOR MATRIX MULTIPLY. 
This inability to work all of the Execution Units fully will show up in the scalability of the 
program. When the average utilization nears 80% this is usually the end of the speed-up. 
For a the 10 x 10 case this occurs at 4 PEs, for 16 x 16 at 8 PEs, and for 32 x 32 at 16 
PEs. The scalability results below bear this out This 80% number is only indicative of 
Matrix Multiply-like problems. SIMPLE, being much more complex does not exhibit this 
problem. 
125 
Execution Unit Load Balance. Load balance is more of an issue when Execution Unit 
utilization is less than 80%. For utilizations greater than 80%, most of the PEs must be 
worlcing about the same or the utiliza.don would be lower. For this reason it is more 
interesting to consider the load balance for the medium sired problem, 16 x 16 arrays, than 
the large problem. 
Figure 4.5 shows each Execution Unit's utilization for the 16 x 16 case on 8 PEs. Contrast 
this to Figure 4.6 where most of the work is being perlormed on only half of the PEs. 
This is where the iteration level parallelism is completely used up. This is what causes the 
flat speed-up curve at from 8 PEs on up to 32 PEs for the 16 x 16 Matrix Multiply (see 
Figure 4.7 below). 
80.00% 
70.00% 
u 
T 60.00% I 
L 
I 50.00% 
z 
A 40.00% 
T 
I 
30.00% o 
N 
20.00% 
10.00% 
o 1 2 3 4 5 6 7 
PE Number 
FIGURE 4.5. UTil..IZATION FOR EACH EXECUTION UNIT (16 X 16 MM ON 8 PES). 
u 
T 
I 
L 
I 
z 
A 
T 
I 
o 
N 
80.00% 
70.00% 
60.00% 
50.00% 
40.00% 
30.00% 
20.00% 
10.00% 
0.00% 
o 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
PE Number 
FIGURE 4.6. UTILIZATIO~ ')R EACH EXECUTION UNIT (16 X 16 MM ON 16 PES). 
126 
Parallelization Overhead. For Matrix Multiply the amount of overhead due to 
parallelization decreases as the problem size increases. The table below shows dynamic 
instruction counts for clifferent problem sizes. Ali of these counts are for the 32 PE system 
(worst case). 
Problem Size 
1 X 10 
16 X 16 
32 X 32 
Work Instructions Total Instructions 
10,851 
43,083 
336,011 
1 ' 7 
50,460 
362,028 
Percent 
Overhead 
TABLE 4.2. PERCENT ÜVERHEAD lNSTRUCTIONS FOR MATRIX MULTIPLY. 
This indicares that, for Matrix Multiply-like algorithrns, the amount of parallelization 
overhead in PODS is acceptable at large input sizes. This is one reason that speed-up 
increases (see scalability below) as the problem siz.e increases. 
127 
Efficiency Comparison. A16 x 16 Matrix Multiply, ~tten in 'C' and compiled for a 
single iPSC/2 PE, takes 0.1 seconds to execute. The PODS Simulator estimates that the 
program would run in 0.190 seconds. This is within 100% of the commercial 'C' version, 
and shows that PODS is not grossly inefficient, even on one PE. 
It is also interesting to compare the number of dynamic work instructions the two systems 
execute. The standard C compiler on the iPSC/2 produces code which executes 51,893 
instructions, while PODS executes 43,083. This ratio of about 1.2: 1 holds true for ali of 
the Matrix Multiply cases. This means that PODS executes about the same number of the 
same size instructions as a commercial system. The reason PODS is slower on one PE, is 
because of the multiple PE tasks it is performing. 
Scalability. Figure 4.7 shows the speed-up of different size Matrix Multiply runs. For 
comparison the speed-up predicted for Iannucci's hybrid system is plotted [Ian88]. 
128 
35.0 
30.0 
s 25.0 -Linear 
p 
E 20.0 
• lOxlO 
E 
D .O- 16x 16 
15.0 
+ 32x32 
u 
p 10.0 
-<>- lOxlO - Iannucci 
5.0 
o.o 
o 4 8 12 16 20 24 28 32 
N umber of PEs 
FIGURE 4.7. SPEED-UP OF MA 1RIX MUL TIPL Y. 
Iannucci's machine is finer grain and is able to exploit more of the parallelism in the small 
10 x 10 problem. PODS does not reach this type of performance until the 32 x 32 problem 
is run. Since Iannucci's machine requires a new CPU design and system architecture, it is 
impossible to know how well it compares to a commercial system. Leaving open the 
question of absolute run times and true scalability. It will be interesting to see how cost 
effective the system will be once it is built 
4.3.2. SIMPLE 
Simulating the execution of of ali of SIMPLE on the PODS Simulator is not possible dueto 
memory limitarions. So SIMPLE was broken up into its.component routines. The major 
routines were run through the translator then through the partitioner, and finally simulated 
129 
on the PODS Simulator. These majar routines are: VELOCITY_POSITTON, 
HYDRODYNAMICS, and CONDUCTION. Ali of the other procedures are eíther run only once 
(e.g. GENERATOR) orare called by one of the above. This breaking up of SIMPLE is 
appropriate because the routines feed each other in a sequential fashion. There rnay be 
sorne parallelism which is not being exploited., but it is minimal. 
The most important routine is CONDUCTION, both VELOCITY_POSITTON, and 
HYDRODYNAMICS are rnuch easier to parallelize. VELOCITY _POSITTON has no LCDs, no 
function calls, and runs in parallel very well. HYDRODYNAMICS has only 5 SPs 
(CONDUCTION has 15 SP) and is basically one big nested loop; it is not nearly as cornplex 
as CONDUCTION. CONDUCTION is the most difficult to parallelize because of: (1) the sweep 
phases where every element is recalculated twice, based upon its neighbors; (2) the 
complexity of 15 SP plus multiple function calls; and (3) the large number of LCDs with 
both incrementing and decrementing for-loops. These LCD's prevent iteration level 
parallelism from be gin distributed efficiently. For these reasons CONDUCTION is examined 
in detail the discussion section, while the final results for all of SilVlPLE added together is 
presented below in the results section. 
Discussion 
The original ID code for SilVlPLE was written at MIT based upon the Lawrence Livermore 
version. This original ID code was then updated to ID Nouveau. CONDUCTION is a 
complex routine with multiple function calls and code blocks. 
The sweep operations in CONDUCTION cause LCD to occur in the inner-most nest of the 
loops. Figure 4.8 shows one of the sweep blocks (there are two nearly identical sweeps) 
inside of CONDUCTION. Notice that the local arrays a and b are allocated at the outer level 
(lines #3 and #4), filled in the next inner nest (lines #11 - #13), and then used to fill the 
theta _bar arrays (lines # 16 and # 17). Both of these last two loops ha ve LCDs. In lines 
130 
#12 and #13 b[l-1] is used to produce b[l], generating a LCD with distance l. In lines #16 
and #17 theta_bar[k.1+1] is used to produce theta_bar[k,l]. This generates a LCD with 
distance 1 because the for-loop is decrementing (see downto in line #15). 
% Alternating direction sweeps 
% z sweep 
1 theta bar = i_matrix ( (kmn+l, kmx) , ( lmn+l, lmx+l) ) ; 
2 {far k <- kmn+l to kmx do 
3 a= i array (lmn,lmx); 
4 b = i=array (lmn,lmx); 
% a[lmn],b[lmn] are ~ot used because cbb[*,lmn]=O 
5 a [ lmn ] = .J ; 
6 b[lmn] = 0.0; 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
{far 1 <- lmn+l to lmx do 
} ; 
y= sigma[k,l]+cbb[k-1,1] 
+cbb[k-1,1-1]*(1-a[l-l]); 
a[l] = cbb[k-1,1)/y; 
b[l] = (sigma[k,l]*theta hat[k,l] 
+cbb[k-l,l-l]*b[l-1))/y 
%%% back substitution 
theta_bar[k,lmx+l] = O; 
% theta[k,lmx+l] is not used because a[lmx]=O 
{ for 1 <- lmx downto lmn+l do 
theta bar[k,l] = a[l]*theta bar[k,l+l] 
- + b[l] -
} ; 
} ; 
FIGURE 4.8. SWEEP FOR-LOOPS IN CONDUCTION CODE. 
These sweep operations can severely limit parallelism in sorne systems. In PODS the outer 
nest of the for-loop (either k or l) is distributed across the available PEs. Once this is done 
then no future distribution is necessary. 
131 
In another part of CONDUCTION there is a nested for-loop with LCDs at all levels: lines #30 
- #32 for the outer level, and lines #20 and #21 for the inner level. This for-loop is shown 
in Figure 4.9. This for-loop would be modified by a scalar expanding compiler. 
1 delta theta max, internal eps = 
2 -{ - -
3 delta_theta = O; internal_eps = O; 
4 in 
5 {for k <- kmn+l to kmx do 
6 y, col internal eps = 
7 {- -
8 delta_theta_col = O; col_internal_eps=O; 
9 in 
10 {for 1 <- lmn+l to lmx do 
11 i = table look up 
12 theta-table theta transp[l,k] 
13 indexl[k, l] 3; -
14 j = index2[k,l]; 
15 last indexl[k,l] = i; 
16 eps k 1 = polynomial theta transp[l,k] 
17 - - rho[k,l] i j T Coefficients; 
18 p[k,l] = polynomial theta transp[l,k] 
19 rho[k,l] i j-P Coefficients; 
20 next col internal eps = -
21 col Tnternal eps + mass [k, 1) *eps k l; 
22 eps[k,l] = eps k-1; --
23 y= abs(theta hat[k,l] -
24 theta transp[l,k])/theta transp[l,k]; 
25 next delta theta col = -
2 6 if y > delta theta col 
27 then y else delta theta col 
28 finally delta theta col, col internal eps} 
29 } ; - - - -
30 next delta theta = if y > delta theta then y 
31 - else delta theta; 
32 next internal eps = internal eps + col internal eps 
33 finally delta_theta, internal_eps } - -
34 } ; 
FIGURE 4.9. ORIGINAL CONDUCTION CODE WITH MULTIPLE LCDS. 
The above code was replaced with the following in Figure 4.10. The lines in bold below 
were added or modified ( lines #1, #2, #28, #29, and #31 - #42). 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 } ; 
%%% changed by jmar 
vect cie = i array (kmn, kmx) ; 
vect_y = i_array (kmn, kmx); 
{for k <- kmn+l to kmx do 
y, col internal_eps = 
{ 
delta theta col O; col_internal_eps=O; 
in 
{for 1 <- lmn+l to lmx do 
i = table look up 
theta-table theta_~ransp[l,k] 
indexl[k,l] 3; ~ 
j = index2[k,l]; 
last indexl[k,l] = i; 
eps k 1 = polynomial theta transp[l,k] 
- - rho[k,l] i j T Coefficients; 
p[k,l] = polynomial theta transp[l,k] 
rho[k,l] i j-P Coefficients; 
next col internal eps = col-internal eps 
- -+ mass[k,l]*eps k I; 
eps[k,l] = eps k l; - -
y= abs(theta hat[k,l] -
theta transp[l,k])/theta transp[l,k]; 
next delta theta col = -
if y >-delta-theta col 
then y else delta theta col 
finally delta_theta_col, col_Internal_eps} 
} i 
vect y[k] = y; 
vect_cie[k] = col_internal_eps; 
%%% added by jmar 
31 delta theta max, internal_eps = 
32 { - -
33 delta theta = O; internal_eps = O; 
34 in 
35 {for k <- kmn+l to kmx do 
36 next delta theta = if vect y [k] >delta theta 
37 then vect_y [k] -
38 else delta theta; 
39 next internal_eps = internal eps + 
40 - vect cie[k]; 
41 finally delta_theta, internal_eps} -
42 } ; 
FIGURE 4.10. SCALAR EXPANDED CONDUCTION CODE FRAGMENT. 
133 
This se.alar expansion does not change the output in any way and is a standard compiler 
optimization. 
Another interesting point is that three different subroutines are called: POL YNOMIAL, 
TABLE_LOOK_UP, and BOUNDARY_HEAT_FLOW. With POLYNOMIAL and 
TABLE_LOOK_UP being called rnany ti.mes inside the inner for-loop. These function calls 
are spun off onto other processors to allow more parallelism to be exploited. 
Once the scalar expansion is done, all of the for-loops, except the one add.ed by the 
expansion (lines #31 - #42), are distributed at the first level of the nest. This allows 
CONDUCTION iterati.ons to be spread across all available PEs, thus producing excellent 
speed-up. 
The 22 SPs which PODS forms for CONDUCTION are shown in Table 4.3 below along 
with sorne statisti.cs for each SP. 
CONDUCTION-1. pods 
CONDUCTION-1-0.pods 
CONDUCTION-1-1.pods 
CONDUCTION-2.pods 
CONDUCTION-2-0.pods 
CONDUCTION-2-1.pods 
CONDUCTION-3.pods 
CONDUCTION-4.pods 
CONDUCTION-4-0. pods 
CONDUCTION-5.pods 
CONDUCTION-5-0.pods 
CONDUCTION-6.pods 
CONDUCTION-6-0.pods 
BHF.pods 
BHF-0.pods 
BHF-1.pods 
TLU.pods 
TLU-1.pods 
TLU-0.pods 
POLY. ods 
39 
12 
26 
40 
12 
26 
29 
37 
38 
31 
27 
28 
27 
22 
20 
20 
19 
9 
10 
40 
Dístriburion omments 
Main SP, drives others 
LCD prevents distribution, 
added by scalar expansion 
Distributed For-Loop SP 
Local For-Loop SP 
Local For-Loop SP 
Distributed For-Loop SP 
Local For-Loop SP 
Local For-Loop SP 
Distributed For-Loop SP 
Pistri.buted For-Loop SP 
1.bcal For-Loop SP 
Distributed For-Loop SP 
LocatFor-Loop SP 
Distributed For-Loop SP 
Local For-Loop SP 
Main SP of Procedure 
Small SP, local to BHF 
Small SP, local to BHF 
Main SP of Procedure 
Small SP, local to TLU 
Small SP, local to TLU 
Procedure SP 
TABLE 4.3. SP STATISTICS FOR CONDUCTION. 
Results 
These results are for a1l of the Sll\1PLE routines added together. This is valid because each 
of the routines feeds the next one. If there is sorne iteration level parallelism available 
between routines, then the results will be better than shown here. This was necessary due 
to the performance limitations of the PODS simulator. 
Functional Unit Balance. Smaller problem sizes stress the distribution of work 
between functional units more than larger ones. This is because larger problems have more 
available parallelism andan unbalance PE rnay not show a drop in utilization. The worst 
case, 16 x 16, utilization is shown in Figure 4.11. 
135 
70.00% 
60.00% 
u 
T 50.00% •Eu I 
L OMS 
I 40.00% 
z 111 RU 
A 30.00% 
T 11 AM: 
I 20.00% !!] MM o 
N 
10.00% 
0.00% 
1 2 4 8 16 32 
N umber of PEs 
FIGURE 4.11. UTILIZATION FOR EACH FUNCTIONAL UNIT (16 X 16 SIMPLE). 
The support units once again act in a support role, never reaching any significant utilization 
unit the available parallelism has been used up, at around 8 PEs. Even at 32 PEs the 
support units do not have any bottlenecks, the only change is that the Execution Units 
utilization has d.ropped to a level comparable to the support units. 
Execution Unit Utili:zation. For a 64 x 64 SIMPLE the utilization starts out at 
approximately 70% for 1 PE and goes down to 50% for 32 PEs (see Figure 4.12 ). Once 
again on small problems (16 x 16) the Execution Unit utilization is much lower than on 
large problems (64 x 64). 
136 
70.00% 
60.00% 
u 
T 50.00% I 
L • 16 X 16 
I 40.00% 
z 0 32 X 32 
A 30.00% 
T • 64x64 
I 
o 20.00% 
N 
10.00% 
0.00% 
1 2 4 8 16 32 
N umber of PEs 
FIGURE 4.12. EXECUTION UNIT UTILIZA TION FOR SIMPLE. 
It is interesting that SIMPLE continues to speed-up even with Execution Units which are 
50% id.le (see Figure 4.16 below). This differs from the Matri.x Multiply example above, 
which stopped speeding-up when utilization drops below 80%. This is dueto the 
complexity difference between SIMPLE and Matrix Multiply. 
Execution Unit Load Balance. SIMPLE, being rnuch more complex than Matrix 
Multiply, spread its load much better. Even in the worst case (16 x 16 on 32 PEs), where 
little speed-up is begin gained, every PE contributes to the final solution (see Figure 4.13). 
50.00% 
45.00% 
u 40.00% 
T 
I 35.00% 
L 30.00% 
I 
z 25.00% 
A 
T 20.00% 
I 15.00% 
o 
N 10.00% 
5.00% 
0.00% 
.,.. 
-
"" 
-
+ 
+ 
-
-
+ 
-
o 
,-,-,-,-,-,-, 11 1 ,,,,,-, 
-, ' 
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 
PE Number 
FIGURE 4.13. EXECUTION UNIT UTILIZATION (16 X 16 SIMPLE ON 32 PES). 
When a rnediurn sized problern is run the load balance is better, see Figure 4.14. 
137 
50.00% 
45.00% 
u 40.00% 
T 
1 35.00% 
L 30.00% 
1 
z 25.00% 
A 
T 20.00% 
I 15.00% 
o 
N 10.00% 
5.00% 
0.00% 
..,. 
+ 
-
+ ' 
+ 
-
-
+ 
+ 
--1 
• , , 'T 'T ""T , 'T 'T 'T 
o 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 
PE Number 
FIGURE 4.14. EXEClITION UNIT UTILIZATION (32 X 32 SIMPLE ON 32 PES). 
Finally, when a large problem is run the load balance is quite flat on 32 PEs. This is a 
much more realistic size problem far scientific programs. 
138 
60.00% 
u 50.00% 
T 
I 40.00% 
L 
I 
z 30.00% 
A 
T 
I 20.00% 
o 
N 
10.00% 
0.00% 
"T 
-
-1 
-
-1 
-
1 1 1 1 1 1 1 T T 
o 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 
PE Number 
FIGURE 4.15. EXECUTION UNIT UTILIZATION (64 X 64 SIMPLE ON 32 PES). 
Parallelization Overhead. The table below shows dynamic instruction counts for 
different problem sizes. All of these counts are for the 32 PE system (worst case). 
Wor Instructions o Instructions Percent 
Overhead 
4 ,71 5 ' 1 .54% 
215,546 . 240,288 10.30% 
907,711 993,322 8.62% 
TABLE 4.4. PERCENT OVERHEAD INSTRUCTIONS FOR SIMPLE. 
139 
The percentage of overhead in SIMPLE is smaller than for Matrix Multiply. This is dueto 
the size of the for-loop bodies being larger in SIMPLE (see CONDUCTION code above). 
Keeping the parallelization overhead low is central to efficient parallel processing. 
140 
Effidency Comparison. For a 32 x 32 input CONDUCTION takes 0.9 seconds on a 
single iPSC/2 PE. This was measured by compiling the standard 'C' version of SIMPLE, 
then nmning one iteration of the main loop, and subtracting the setup time (mainly the 
GENERATE routines). CONDUCTION is used here rather than the total SIMPLE because of 
the function calls and other operations between the major routines which do not appear in 
the total. This would cause the single iPSC/2 PE time to be inflated compared to the PODS 
time. However, the PODS Sirnulator still estimates that the program would run in 1.72 
seconds. This is once again within 100% of the comrnercial version, and shows that 
PODS is not grossly inefficient This has been found to be true on all of the test cases. 
Scalability. This is the true test of a parallel system - how well does it speed-up for 
real-world type problems. Figure 4.16 shows the speed-up of different size SIMPLE runs. 
For comparison the speed-up Pingali and Rogers obtained for a 64 x 64 run is also plotted. 
[P&R90] 
141 
35.0 
30.0 
s 25.0 -Linear 
p 
E 20.0 • 16 X 16 
E 
D .Q. 32 X 32 
15.0 
+64 X 64 
u 
p 10.0 
-<> 64 X 64 - P&R 
5.0 
O.O 
o 4 8 12 16 20 24 28 32 
N urnber of PEs 
FIGURE 4.16. SPEED-UP OF SIMPLE. 
For the srnall 16 x 16 case, PODS tops out ata speed-up of 8.1. Eventually the 
parallelization overhead would cause this srnall problem to even run slower as the number 
of PEs increased There is not yet a way for PODS to determine when a problem is so 
srnall that it should not be spread across ali of the available PEs. PODS either runs the SP 
in place or distributes it across ali PEs. 
For the 32 x 32 case, speed-up tops out at 12.4. This order-of-magnitude speed-up is quite 
acceptable and is comparable to the speed-up obtained by Pingali & Rogers on the 64 x 64 
case. 
The 64 x 64 problem size is much more likely for scientific coding and is thus a better 
gauge for the success of PODS in parallelizing scientific code. For the 64 x 64 case, PODS 
142 
is able to spread the work efficiently across all of the PEs, achieving a speed-up of 18.9 on 
32 PEs. It is unlikely that a greater speed-up would occur on 64 PEs since the average 
Execution Unit utilization is 44%. And based upan the 16 x 16 case, once the Execution 
Unit utilization drops below 40% little speed-up is possible for SIMPLE. This speed-up is 
better than Pingali & Rogers' 64 x 64. 
The reason PODS performs better is due to the remate caching in PODS. Pingali & Rogers 
send the data values to the PE where they are needed. This causes a large number of 
individual messages to be sent, thus their extreme interest in batching messages. In PODS 
the individual messages are also batched, however array data is passed a page at a time. 
The remete caching allows PEs to access arra y elements as if they were local. Using this 
locality of reference, PODS is able toread over 187,000 data elements from caches in the 
CONDUCTION routine alone. This concept is heavily supported by the single assignment 
nature of ID Nouveau [Roy90]. Single assignment allows PODS to ignore cache 
coherency problems and to efficiently partition the arrays. 
4.4. Summary 
This chapter discussed the PODS Simulator and sorne results of interesting benchmark 
problems. The simulation is event-driven and is based on SMPL. The timing assumptions 
were based on the iPSC/2 computer system. The simulator is like an emulator in that it 
actually executes the code at the instruction level. Each different type of instruction takes 
different amount of simulated time. Thus a reasonable estimated of the actual run-time was 
achieved. 
Different measures of effectiveness were used to evaluate PODS on the classic Matrix 
Multi.ply problem and on the more complex SIMPLE hydrodynamics problem. In all cases 
the parallelization overhead was low and the support units did not slow down the Execution 
Unit It is important to note that the single PE time for PODS was not grossly inefficient 
l .+3 
when compared to commercial 'C' systems when run on the same size CPU. This gives 
the speed-up computations a solid base execution time from which to wark. 
Por Matrix Multiply (a small problem) the Execution Unit utilization was high (80% and 
greater) until the available iteration level parallelism was used up. When this occurred the 
load balance was d.riven way down. Half the PEs were being utilized at 80% and half at 
less than 3%. This unequal load balance caused the speed-up to end abruptly. As the 
problem size was increase this unequal load balance was staved off until greater and greater 
numbers of PEs were made available. This points to a future enhancement-PODS needs 
to know how many PEs to distribute a problem across. Currently PODS decides to 
distribute ar not to distribute, there is no algorithm for gauging when a problem is so small 
that all of the available PEs should not be used. 
Far comparison purposed the 10 x 10 Matrix Multiply speed-up pred.icted for Iannucci's 
Hybrid Architecture is included Iannucci is able to achieve impressive speed-up on small 
problems because of the finer grain. PODS is designed to exploit iteration level 
parallelism, and there is not that much available on the 10 x 10 Matrix Multiply. Iannucci's 
system requires new hardware components while PODS is designed far off-the-shelf 
components. It will be interesting to see how cost effective it is once it is built 
The more complex SIMPLE hydrodynamics program showed how well PODS performs 
on scientific programs. Being much more complex, SIMPLE contains much of the 
iteration level parallelism PODS is designed to exploit. The Execution Unit utilization was 
notas high for SIMPLE as it was for Matrix Multiply. This is to be expected, the simple, 
regular nature of Matrix Multiply is rnuch easier to distribute evenly. However, with 
SIMPLE there is not the abrupt load imbalance that Matrix Multiply encounters. The 
complexity in SIMPLE allows speed-ups to continue raising even though the Execution 
Unit utilization is only 50%. 
144 
Pingali and Rogers ha ve rnn the ID version of SIMPLE on an iPSC/2. Their results were 
quite good. but PODS is able to achieve an even greater speed-up. This is due to the 
individual message passing which they use. Pingali and Rogers' static scheduling allows 
one PE to know when another PE needs the value just calculated. They then send this 
value to the needy PE. Recognizíng early on that this would cause numerous messages, 
they batched messages together in order to reduce communication costs. In PODS, 
individual messages are also batched, however, array references are handled differently. 
The remate caching of array values allows the locality of reference to be exploited. This 
can be a majar source of speed-up, on the larger SIMPLE runs over 187 ,000 cached arra y 
reads occur out of the 210,000 total reads. This, in conjunction with the efficient 
distribution of work, allows PODS to achieve even greater speed-ups. 
CHAPTER5 
Conclusions 
This chapter presents the related research projects at other universities and sorne of the 
advantages and disadvantages of single assignment, followed by a summary of the 
conclusions found in this research. The areas for future research are discussed as well. 
5. l. Related Work 
Ali of these research project recognize the need to integrate the Dataflow and von Neumann 
rnodels of cornputation. Different cornpiler technology and hardware are used with various 
levels of success. 
5. l. l. Iannucci's Hybrid Architecture 
The Dataflow / von Neurnann Hybrid Architecture proposed by Iannucci [Ian88] differs 
frorn PODS in that it requires a new CPU specifically designed for the architecture, where 
PODS uses off-the-shelf components. A compiler is used to partition the program into 
scheduiing quantums [Ian88]. Scheduling quantums are collections of dataflow 
instructions subject to sequential execution. The Method of Dependency Sets is used to 
generate these scheduling quanturns without deadlock. 
Like PODS this approach executes only one thread ata time, while blocking others which 
are awaiting values. Given that the scheduling quanturns are usually less than five 
instructions long, the need for a fast context switch is high. In PODS the average SP is 
over 25 instructions long. Iannucci's rnodel predicts that 23,569 instructions would be 
executed for a 10 x 10 Matrix Multiply [Ian88]. For the same program PODS only 
executes 15,072 instructions; thus each PODS instruction does 1.5 times the work. 
Together these reduce the need for a fast context switch significantly. 
145 
146 
Iannucci's abiliry to exploit a fair amount of parallelism from a 10 x 10 Matrix Multiply 
(nearly 20 times speed-up) is impressive. It will be interesting to see how cost effective the 
new architecture with the its new CPU will be. 
S .1. 2. Gao's Hybrid Machine 
At McGill University Guang Gao has been working on a hybrid machine which basically 
adds control-flow to dataflow [Gao90]. This is achieved with a signal graph which is 
similar to the PODS routing table. However, Gao does not use the concept of sequential 
threads. Instead his granularity is a single instruction. He makes use of the pipelinned 
architectures available for von Neumann execution, but the next instruction is not 
necessarily stored right after the present one. A signal graph indicates which instruction 
will be loaded next This had advantages and disadvantages. 
The flexibility of this approach is very high. Depending upon the signa! graph the system 
will function as a dataflow machine or as a von Neumann machine. This can change back 
and forth from instruction to instruction. The amount of overhead this incurs is unknown. 
There is also the problem of a completely new hardware architecture, which may make this 
approach intractable from a cost standpoint 
Another difference from PODS is the use of SISAL [MSS85] rather than ID Nouveau. 
SISAL has a number of good concepts, however, any parallel architecture will have a 
difficult time supporting the dynamic arrays, the update operator, and the recursive function 
calling required. These force the memory manager to be highly efficient at allocating and 
deallocating space. Additionally, the overall machine performance depends on a careful 
layout of these dynamic arrays to reduce memory contenti.on, a difficult problem at best 
147 
5. 1. 3. Alf aifa 
The Alfalfa system [G&H89] is mainly concemed with different dynamic scheduling 
techniques and does not address the problem of distributing large data structures, such as 
arrays. They achieve sorne impressive results for problems involving little to no data 
communication, however, for Matrix Multiply, they see poor speed-up results. They claim 
that this is due to the slow message passing time of the iPSC, but PODS shows that a data 
cache combined with simple scheduling can overcome the long latencies associated with 
accessing rernote data 
5 .1. 4. Decoupled Multilevel Dataflow Model 
The Decoupled Multilevel Dataflow Model at USC [E&G90] is a macrcKiataflow project 
aimed at the exploitation of vinual space, multilevel memory hierarchies, and RISC design 
principles. The variable resolution (different size macro operators) allows programs to be 
rnatched with the system. With vector and larger operators the standard von Neumann 
optimizations can be used. 
This system uses SISAL as Gao <loes and will have sorne of the same difficulties. The 
problem is compounded by the need for vector extensions to SISAL so that the 
programmer can tell the system what to vectorize. This places the additional burden of 
specifying parallelism on the programmer. 
The amount of overhead the system incurs, and the cost effectiveness of building a new 
CPU have yet to be determined. It is possible that this variable resolution will be very 
effective at matching a programs inherent parallelism to the processor's capabilities. 
148 
S .1. S. Dynamic Structured Dataflow 
The Dynamic Structured Dataflow project [Got90] at the Israel Academy of Sciences 
Foundation for Basic Research is working on an execution model with arbitrarily fine 
granularity. This approach is similar to the original PODS concept of SCSs, but here the 
scheduling and resource allocation is done dynamically, where the original PODS attempted 
to detennine the best groupings at compile time. The current PODS uses iterati.on level 
parallelism rather than ses threads. 
In Dynamic Structured Dataflow, the need for a fast context switch is very high, anda fair 
amount of effort has been put into the Parallel Work Conveyor [G&K80] which satisfies 
this requirement. Currently the project is wo.rking on an architectural specification and 
simulator. It will be interesting to see how large of a granularity the system produces and 
how well thc Parallel Work Conveyor operates. These will be very important to efficient 
execution. 
S. l. 6. Pingali and Rogers' Compiler 
At Cornell Pingali and Rogers have been working on a compiler [P&R90, R&P88, 
R&P89] which will take ID Nouveau and compile it into 'C' for execution on an Intel 
iPSC/2. Their language (ID Nouveau) and architecture (iPSC/2) are the same as the current 
PODS, however from there on the approaches differ significantly. 
In PODS there is an underlying execution model which is very different from that used in 
standard von Neumann processors. Pingali and Rogers have stayed with the standard von 
Neumann model. This places PODS closer to true dataflow, and, as such, is better able to 
exploit irregular parallelism. 
149 
Pingali and Rogers exploit the locality of data reference in large programs, however they do 
not ha ve anything analogous to the remete array caching in PODS. It is conceivable that 
this could be added to their compiler. It is unclear how this would affect their speed-up. 
One of the most critica! elements of their work is the batching of messages. In their 
system a PE knows when and where to sent a data value to another PE. This would create 
a large number of messages if it were not for the batching which is used. In would be 
interesting to incorporate sorne of their ideas into PODS. 
Their performance on an iPSC/2 running SIMPLE is quite good. This seems to be due to 
the clear and concise nature of a compiler which takes ID Nouveau and produces 'C' code 
for a parallel machine. This approach has stimulated the desire to build such a compiler for 
PODS. 
S. 2. Advantages and Disadvantages of Single Assignment 
The proper use of single assignment is central to PODS. The main advantage of single 
assignment is its ability to implicitly expose parallelism With single assignment only the 
definitional data dependencies restrict parallelism. There are no extra dependencies based 
upon storage location naming. This is critica! for parallel program synchronization, 
otherwise innocuous timing bugs can occur. 
Exposing this much parallelism can cause resource overloading. The reality of physical 
machines requires that the parallelisrn be throttled by the operating system. This throttling 
can take significant overhead. This disadvantage is minimized in PODS by the large 
granularity of the SPs. 
An oft criticized feature of implicit parallelism is the inability of a prograrnmer to override 
the synchronizarion when he knows a better way. This lack of control is unsettling to 
150 
man y parallel programmers. This is because the current state of parallel programrning an 
requires the programmer to take control or take his chances. See [Kar87] for a look at 
parallel prograrnming today. 
Another danger is too rnuch copying of intermediate arra y elements. If an update or replace 
arra y operator is available, grossly inefficient programs can be written. Aids for detecting 
this type of inefficiency are needed. 
In an architecrural sense single assignment has sorne problems. The fact that memory is 
finite rneans that memory locations will have to be written over. i.e. a variable's definition 
(its one and only assignment) will not exist forever. This presents the problem of knowing 
when a variable is no longer needed by any of the processors. 
The final factor is the ease (or difficulty) to program in a single assignment language. See 
[ANP87a] for a convincing argumentas to the ease of single assignment programming. 
The combination of single assignment, areas-of-responsibility, and caching leads to low 
communicat:i.on overhead and well-balanced loads when applied to the majority of the 
Livermore Loops [BNR89b, LLL83], Matrix Mult:i.ply, and SIMPLE [CH&R]. Single 
assignment permits the exploitation of large numbers of PEs automat:i.cally. 
Synchronization problems are solved through the adopt:i.on of the single assignment policy. 
By segmenting array writes using the area-of-responsibility concept, all PEs perform 
roughly the same number of remete accesses. These two concepts allow caching to be 
implemented without extensive communication, and caching is central to reducing remete 
array accesses. 
5. 3. Summary 
This dissertation has discussed the Process-Oriented Dataflow System and its suitability for 
running scientific programs on distributed-memory MThID machines. The partitioning and 
distribution algorithms, along with their underlying principies, have been examined and 
discussed. The logical implementation which was used in the simulations has been 
presented along with the suppon software suite. The remote array cachlng scheme used 
has been described. The event-driven simulation was explained and the results of 
experiments with Matrix Multiply and SIMPLE were examined 
1s1 
It has been found that PODS can achieve speed-ups of nearly 20 times on large versions of 
SIMPLE. This surpasses the speed-up of other approaches on similar architectures. This 
speed-up is sufficient to warrant recoding of large scientific programs from FORTRAN or 
C to ID Nouveau; usually a 10 times speed-up is considered large enough. When large 
scientific programs are written, they are usually written by scientists, not computer 
programmers. ID Nouveau will be easier for scientists to use because of its declarative 
nature. Combine this with the automatic parallelization in PODS and this approach is much 
more productive for parallel scientific prograrnming. 
The basic PODS mcxiel of execution with its ability to "degenerate" to a von Neumann 
machine as necessary, has the following advantages: 
• the number of tokens through matching store and across the 
routing network in general is reduced due to the use of SPs. 
• instruction fetch/execution is as efficient as in a typical von 
Neumann architecture, especially when loops run in-place. 
• programmers may ignore such parameters as the number of 
available PEs- the automatic partitioning allows a higher level 
of abstraction. 
• SPs are long and execute an average of 70 instructions before a 
context switch - reducing context switches greatly increase the 
efficiency and scalability of the system. 
The mechanism for distributing arrays in PODS not only allows for larger arrays than 
normally available in such machines, but it also takes advantage of locality of reference. 
The remete array caching scheme future enhances the locality. 
Both SIMPLE and Matrix Multiply have been used as performance measures. Matrix 
Multiply is a good measure lx(cause it has severa! interesting properties: 
• there are multiple code-blocks 
• a new array must be dynamically allocated and distributed 
• there is a loop-cani.ed dependency in the innermost loop 
• the two input arrays, A and B, have different access patterns 
• the sizes of the input arrays are not known at compile time 
152 
Matrix multiply also forms the basis for many irnportant scientific algorithms such as: LU 
decomposition, convolution, and the Fast-Fourier Transform. SIMPLE is a good measure 
because of its size (nearly 1000 FORTRAN instructions) and complexity (numerous SPs 
and function calls with many dynamic array). SIMPLE was also designed as a benchmark 
program by one of the largest users of supercomputers, Lawrence Livermore Laboratory. 
In summary, PODS allows MIMD machines to exploit vector and data parallelism 
efficiently, while still providing the flexibility of distributed-memory MIMD machines. 
153 
5. 4. Future Research 
This is the first step in the development of a new approach to parallel processing. To 
further understand the advantages and disadvantages of this approach, a variety of issues 
need to be examined: 
• Reduction operators are not fully exploited. How can vector to 
scalar operations be implemented? Current ideas include a 
mechanism to allow collection of subrange results. 
• How well can scientific programmer use ID Nouveau and 
PODS? 
• How well does PODS execute non-scientific code? 
• Should the programmer be able to specify any panitioning 
parameters? 
• How well does PODS run on real hardware? 
To investigate these issues two major projects are in the works: the first is HyperPODS, an 
implementation of PODS on an Intel iPSC/2; the second is a PODS compiler which would 
take ID Nouveau and compile it directly for a particular implementation of PDS (e.g., 
HyperPODS). 
5. 4 .1. HyperPODS 
HyperPODS is currently being build using the logical implementation described herein. So 
far the logical implementation has served well, but changes wil1 undoubtedly be necessary. 
The issues below will have to be addressed: 
• Register Allocation - the passing of tokens intemal to an SP 
will be done through registers. 
Presence Bits - these are not supported in the hardware, but are 
necessary for I-structures. 
" Blocking of an SP - this will have to be done at certain 
instructions and not others. The efficiency of this is important to 
context switch times. 
• Matching Store - this support unit is the most utilized. It must 
be efficient. 
• Routing Unit - the batching of messages will ha ve to be done 
in the CPU and the interaction with the Direct-Connect Module 
is critical to the scalability of the system. 
• Array Manager - the enqueuing and dequeuing of reads will 
require dynamic memory allocation. 
• Resource Limitations - PODS may ha ve to be throttled down to 
prevent deadlock. The exact implementation of this is unclear. 
These are just a few of the research issues which the PODS team will be addressing in 
HyperPODS. 
5. 4. 2. PODS Compiler 
154 
The development of HyperPODS and the success of Pingali and Rogers has lead to 
renewed interest in a PODS Compiler. The GITA Compiler currently in use is written in 
LISP and takes up a large amount of memory when executing. More importantly there are 
optimizations which PODS can use which are not in the GITA Compiler (e.g., scalar 
expansion). The PODS Compiler would replace the GIT A Compiler and the PODS 
Translator. The PODS Partitioner could be incorporated, but this is not necessary. 
155 
Once HyperPODS and a PODS Compiler for it are finished, the complete PODS system 
can be sent to beta-test si tes at facilities which have an iPSC/2 and are interested in getting 
more scientific programs to be parallel. This will be the true test of the PODS concept. 
[A&085] 
[A&K87] 
[A&C86] 
[A&E88] 
[AGP78] 
[A&N87] 
CHAPTER6 
References 
S. Allan, R. Oldehoeft. HEP SISAL: Parallel Functional Programming. In 
Para/le/ MIMD Computation: the HEP Supercomputer and Its Applications. 
J. S. Kowalik, Eds. (MIT Press, Cambridge, MA, 1985), pp. 123-150. 
R. Allen, K. Kennedy. Automatic Translation of FORTRAN Programs to 
Vector Form. ACM Trans. Prog. Lang. and Sys. V9, n4 (1987), pp. 491-
542. 
Arvind, D. E. Culler. Dataflow Architectures. Annual Reviews in 
Computer Science VI, (1986), pp. 225-253. 
Arvind, K. Ekanadham. Future Scientific Programming on Parallel 
Machines. J. Para/le/ Dist. Comp. V5, n5 (1988), pp. 460-493. 
Arvind, K. P. Gostelow, W. Plouffe. An Asynchronous Programming 
Language and Computing Machine. Technical Report 114a (December 
1978), Department of Information and Computer Science, University of 
California, Irvine. 
Arvind, R. S. Nikhil. Executing a Program on the MIT Tagged-Token 
Dataflow Architecture. MIT T echnical Repon Computation Structures 
Group Memo 271(March1987), Laboratory for Computer Science, MIT. 
156 
[ANP87a] 
[ANP87b] 
[ANP89] 
[Bab84] 
[Bic87] 
[Bic90] 
[BNR89a] 
157 
Arvind, R. S. Nikhil, K. K. Pingali. 1-Structures: Data Structures for 
Parallel Computing. MIT Technical Repon Compuration Srrucrures Group 
Memo 269 (February 1987), Laboratory for Computer Science, MIT. 
Arvind, R. S. Nikhil, K. K. Pingali. ID Nouveau Reference Manual Part 
11: Operational Semantics. MIT Technical Repon (April 1987), Laboratory 
for Computer Science, MIT. 
Arvind, R. S. Nikhil, K. K. Pingali. 1-Structures: Data Structures for 
Parallel Computing. ACM TOPLAS VI 1, n4 (1989), pp. 598-632. 
R. G. Babb. Parallel Processing with Large-Grain Data Aow Techniques. 
IEEE Computer (July 1984), pp. 55-61. 
L. Bic. A Process-Oriented Model for Efficient Execution of Dataflow 
Programs. Proc. 7th Int'/ Conf on Distributed Computing Systems 
(1987), pp. 744-758. 
L. Bic. A Process-Oriented Model for Efficient Execution of Dataflow 
Programs. Journa/ of Dist. and Para/le/ Computing VMarch, (1990), pp. 
15-38. 
L. Bic, M. D. Nagel, J. M. A. Roy. Automatic Data/Program Partitioning 
Using the Single Assignment Principie. Technical Repon #89-09 (January 
1989), University of California, Irvine. 
[BNR89b] 
[BNR90a] 
158 
L. Bic, M. D. Nagel, J. M. A. Roy. Automatic Data/Program Partitioning 
Using the Single Assignment Principie. Supercomputing '89 (1989), pp. 
551-556. 
L. Bic, M. D. Nagel, J. M. A. Roy. Executing Matrix Multiply on a 
Process Oriented Dataflow Machine. Technical Report 90-08 (April 1990), 
Department of ICS, University of California, lrvine. 
[BNR90b] L. Bic, M. D. Nagel, J. M. A. Roy. On Array Partitioning in PODS. In 
Advanced Tapies in Data-Flow Computing. J. L. Gaudiot, L. Bic, Eds. 
(Prentice Hall, Englewood Cliffs, New Jersey, 1990), pp. 305-325. 
[B&E87] 
[Bur88] 
[C&K88] 
[CH&R] 
[DFI...89] 
R. Buehrer, K. Ekanadham. Incorporating Data Flow Ideas into von 
Neumann Processors for Parallel Execution. IEEE Trans. Comp. VC-36, 
nl2 (1987), pp. 1515-1521. 
D. Bums. Loop-Based Concurrency Identified as Best at Exploiting 
Parallelism. Computer Technology Review (Winter 1988), pp. 19-23. 
D. Callahan, K. Kennedy. Compiling Programs for Distributed-Memory 
Multiprocessors. Jour. of Supercomputing V2, (1988), pp. 151-169. 
W. P. Crowley, C. P. Henderson, T. E. Rudy. The SIMPLE Code. 
UCID 17715 (February 1978), Lawrence Livermore Laboratory. 
D. DeForest, A. Faustini, R. Lee. Hyperflow. The Third Conference on 
Hypercube Concurrent Computers and Applications (1989), pp. 482-488. 
[Den75] 
[Dun88] 
[E&G90] 
[Gao90] 
[G&H89] 
[G&K80] 
159 
J. B. Dennis. First Version of a Dataflow Procedure Language. Machine 
Tech. Memorandum 61 Cambridge, MA. M.I.T. 
T. H. Dunigan. Performance of a Second Generati.on Hypercube. 
Technical Repon ORNUTM-10881(November1988), Oak Ridge National 
Laboratory. 
P. Evripidou, J. L. Gaudiot. The USC Decoupled Multilevel Data-Flow 
Execution Model. In Advanced Topics in Dara-Flow Computing. J. L. 
Gaudiot, L. Bic, Eds. (Prentice Hall, Englewood Cliffs, New Jersey, 
1990), pp. pp. 347-380. 
G. R. Gao. A Flexible Architecture Model for Hybrid Data-Flow and 
Control-Flow Evaluation. In Advanced Topics in Data-Flow Computing. 
J. L. Gaudiot, L. Bic, Eds. (Prentice Hall, Englewood Cliffs, New Jersey, 
1990), pp. 327-346. 
B. Goldberg, P. Hudak. Implementing Functional Programs on a 
Hypercube Multi.processor. The Third Conference on Hypercube 
Concurrent Compurers and Applications (1989), pp. 489-504. 
A. Gottlieb, C. P. Kruskal. A Data Moti.en Algorithm. Technical Repon 
Ultracompurer Note 7 (January 1980), Courant Institute of Mathematical 
Sciences. 
[Got90] 
[H&B84] 
[Ian88] 
[IEEE89] 
[Ins91] 
[iPSC89] 
[K&T88] 
[Kap86] 
[Kar87] 
l. Gottlieb. Work Distribution in the DSDF Architecture. In Advanced 
Tapies in Data-Flow Computing. J. L. Gaudiot, L. Bic, Eds. (Prentice 
Hall, Englewood Cliffs, New Jersey, 1990), pp. 381-409. 
160 
K. Hwang, F. A. Briggs. Computer Architecture and Parallel Processing, 
McGraw-Hill, New York, New York, 1984. 
R. A. Iannucci. A Dataflow/von Neumann Hybrid Architecture. 
Dissertation (1988), MIT. 
IEEE. The Computer Spectrum: A perspective on the Evolution of 
Computing. IEEE Computer (November 1989), pp. 57-68. 
IEEE. Intel's Newest Supercomputer. In The ln.stitute, Eds., 1991, pp. 6. 
IPSC User's Guide, Intel, Portland, Oregon, 1989. 
M. Kallstrom, S. S. Thakkar. Programming Three Parallel Computers. 
IEEE Software (January 1988), pp. 11-22. 
I. Kaplan. A Large-Grain Dataflow Architecture. Worb-hop on Future 
Direction.s in Computer Archirecture and Software (1986), pp. 131-138. 
A. H. Karp. Programming for Parallelism. IEEE Computer (May 1987), 
pp. 43-57. 
[K&B88] 
[LLL83] 
[Lan65] 
[L&G86] 
[Mac87] 
[MSS85] 
[Nik87a] 
[Nik87b] 
A. H. Karp, R. G. B. II. A Comparison of 12 Parallel Fortran Dialects. 
IEEE Software (September 1988), pp. 52-67. 
L. L. N. Laboratory. FORTRAN KERNELS: .MFLOPS, V221DEC!86 
m/328 (Regents of the University of California, Livermore, CA., 1983). 
P. J. Land.in. A Correspondence Between ALGOL 60 and Church's 
Lambda-Notation: Part I. Comm. ACM V8, n2 (1965), pp. 89-101. 
J. W. Liu, A. Grimshaw. A Distributed System Architecture Based on 
Macro Dataflow Model. Workshop on Future Directions in Computer 
Architecture and Software (1986), pp. 155-162. 
161 
M. H. MacDougall. Simulating Computer Systems: Techniques and Tools, 
MIT Press, Cambridge, MA, 1987. 
J. R. McGraw, et al. SISAL, Streams and Iteration in a Single Assignment 
Language. Language Reference Manual, Ver.1.2 M-146 (1985), 
Lawerence Livermore National Laboratory. 
R. S. Nikhil. ID Nouveau Reference Manual Part I: Syntax. MIT 
Technical Report (April 1987), Laboratory for Computer Science, MIT. 
R. S. Nikhil. ID World Reference Manual (for Lisp Machines). MIT 
Technical Report (April 1987), Laboratory for Computer Science, MIT. 
[Nik88] 
[N&A89] 
[P&W86] 
[P&B90] 
[Pap88] 
[P&R90] 
[R&P88] 
162 
R. S. Nikhil. ID Reference Manual - Version 88.1. MITTechnica/ Report 
Computation Structures Group Memo 284 (August 1988), Laboratory for 
Computer Science, MIT. 
R. S. Ni.khil, Arvind. Can Dataflow Subsume von Neumann Computing? 
16th lnt'l Computer Architecture Conference (1989), pp. 262-272. 
D. A. Padua, M. J. Wolfe. Advanced Compiler Optimizations for 
Supercomputers. Comm. of ACM V29, nl2 (1986), pp. 1184-1201. 
C. M. Pancake, D. Bergmark. Do Parallel Languages Respond to the 
Needs of Scientific Programmers? IEEE Computer (December 1990), pp. 
13 - 23. 
G. M. Papadopoulos. Implementation of a General-Purpose Dataflow 
Multiprocessor. Technica/ Report TR-432 (August 1988), MIT Laboratory 
for Computer Science. 
K. Pingali, A. Rogers. Compiler Parallelization of SIMPLE for a 
Distributed Memory Machine. TR 90-1084 (January 1990), Department of 
Computer Science, Comell University. 
A. Rogers, K. Pingali. Process Decomposition Through Locality of 
Reference. Technical Report TR 88-935 (August 1988), Department of 
Computer Science, Comell University. 
[R&P89] 
[Roy90] 
[S&H87] 
[Smi85] 
[Smi81] 
[Tra86] 
[U&Z89] 
A. Rogers, K. Pingali. Cornpiling Prograrns far Distributed Mernory 
Architectures. 4th Hypercuhe Concurrent Computers & Applications 
Conference (1989), pp. 529-542. 
163 
J. M. A. Roy, M. D. Nagel, L. Bic. Partitioning Declarative Programs into 
Communicating Processes. Supercomputing '90 (1990), pp. 846-855. 
B. Shirazi, A. R. Hurson. A Large/Fine Grain Parallel Dataflow Model and 
its Performance Evaluation. 1987 National Computer Conference (1987), 
pp. 119-126. 
B. Srnith. The Architecture of HEP. In Paralle/ MIMD Computation: the 
HEP Supercomputer and Its Applications. J. S. Kowalik, Ed.s. (MIT 
Press, Cambridge, MA, 1985), pp. 41-58. 
B. J. Srnith. Architecture and Applications of the HEP Multiprocessor 
Computer System. Society of Photo-Optica/ lnstrumentation Engineers 
V298 Real-Time Signa/ Processing IV, (1981), pp. 241-248. 
K. R. Traub. A Compiler for the MIT Tagged-Token Dataflow 
Architecture. MIT Technical Report !fEED (August 1986), Laboratory for 
Computer Science, MIT. 
T. Ungerer, E. Zehendner. A parallel Programming Language Directed 
Towards Top-Down Software Development lnternational Conference on 
Paralle/Processing (1989), pp. 122-125. 
[Veg88] 
[W&A85] 
[Z&U87] 
164 
S. R. Vegdahl. Architectures That Support Functional Programrning 
Languages. In Computer Architecture: Concepts and Systems. V.M. 
Milutinovic, Eds. (North-Holland, New York, NY, 1988), pp. 405-453. 
W. W. Wadge, E. A. Ashcroft. Lucid, the Dataflow Programming 
Language, Academic Press, London, 1985. 
E. Zehendner, T. Ungerer. The ASTOR Architecture. 7th International 
Conference on Discributed Computing Systems (1987), pp. 424-430. 
Appendix A: Range Filter Algorithms 
This appendix presents the different range filter algorithms used in PODS. There are three 
base algorithms and three parameterizations used to generate a specific range filter. 
When a level of a nest (say ia) is clistributed, the range filter needs to consider al! of the 
indices above it, i 1 to ia-1. This produces three different base algorithms in the current 
PODS. The first base algorithms is the most common, and uses only the first level index, 
see Figure A.1 below. This range filter is the most common because PODS will clistribute 
the outermost level whenever possible. 
1 rn = O 
2 if rn > interval count of master array then exit 
3 set i to the rnaxirnurn of the beginning of the interval 
and the loop beginning 
4 if i is not in the interval or the first elernent of this 
dirnension is not owned then incrernent rn and gota 2 
5 if i is within the loop bounds then set continue to TRUE 
and send i and continue into the loop body 
else incrernent rn and goto 2 
6 if continue is TRUE do the loop body else goto 9 
7 true part of loop body 
8 if new i is within loop bounds set continue to TRUE, 
send i-and continue into the loop body, and goto 4 
else set continue to FALSE, send i and continue int.o the 
loop body, and goto 6 (with i set to new i) 
9 false part of loop body -
FIGURE A. l. BASE RANGE FIL TER ALGORITHM FOR ÜU1ERMOST LEVEL 
DISTRIBUTION. 
The general algorithm functions by repeatedly extracting ranges from the array boundary 
table. While within the range, the filter passes indices for elements within that range. The 
filter also keeps the loop alive by sending a continue token to the loop switch until all 
ranges ha ve been exhausted. In the figure above, mis just sorne variable used to count the 
165 
166 
intervals; i is the loop index, and continue is the signal to the loop body telling it whether 
to continue or not. 
The next base algorithm is for loops which are d.istributed at the second outermost level. In 
this case the range filter must consider two indices, i andj. Figure A.2 below shows this 
algorithm Notice that it is only slightly d.ifferent the first case; line #3 is added andj rather 
than i is checked in lines #4 - #10. 
1 m = O 
2 if m > interval count of master array then exit 
3 if i is not in interval m then increment m and gato 2 
4 set j to the maximum of the beginning of the interval 
and the loop beginning 
5 if j is not in the interval or the first element of this 
dimension is not owned then increment m and gato 2 
6 if j is within the loop bounds then set continue to TRUE 
and send j and continue into the loop body 
else increment m and gato 2 
7 if continue is TRUE do the loop body else goto 10 
8 true part of loop body 
9 if new j is within loop bounds set continue to TRUE, 
send j-and continue into the loop body, and goto 5 
else set continue to FALSE, send j and continue into the 
loop body, and goto 7 (with j set to new j) 
10 false part of loop body -
FIGURE A.2. BASE RANGE FIL TER ALGORITHM FOR SECOND OUTERMOST LEVEL 
DISTRIBUTION. 
The final case handles the situation when the third level of a nest is distributed. Once again 
this is a simple extension of the first case: adding additional lines to check the additional 
levels (lines #3 and #4) and checking k rather than i. This algorithm can easily be extended 
to handle further levels once PODS handles arrays with more than three dimensions. 
1 m = O 
2 if m > interval count of master array then exit 
3 if i is not in interval m then increment m and gato 2 
4 if j is not in interval m then increment rn and gota 2 
5 set k to the maximum of the beginning of the interval 
and the loop beginning 
167 
6 if k is not in the interval or the first elernent of this 
dimension is not owned then incrernent m and gato 2 
7 if k is within the loop bounds then set continue to TRUE 
and send k and continue into the loop body 
else increment rn and gato 2 
8 if continue is TRUE do the loop body else gato 11 
9 true part of loop body 
10 if new k is within loop bounds set continue to TRUE, 
send k-and continue into the loop body, and gato 6 
else set continue to FALSE, send k and continue into the 
loop body, and gato 8 (with k set to new_k) 
11 false part of loop body 
FIGURE A.3. BASE RANGE FlLTER ALGoR.ITHMFOR THIRD 0U1ERMOST LEVEL 
DISTRlBlmON. 
Once the base algorithm is selected the three parameterizations are applied. These are: 
1. Loop direction parameterization, 1 to n vs. n downto l. 
2. Indices parameterization, A[i, j] vs. A[c¡*i+k¡, Cj*j+kj]. 
3 . S tepsize parameterization, step by 1 vs. step by C. 
These parameterizati.ons are independent of each other. The first, loop directi.on 
parameterizati.on is quite simple. Lines # 1, #2, and #3 need to be replaced as shown in 
bold in Figure A.4 below. In this way the intervals are accessed in descending order. 
Note that the interval counter mis decremented rather than incremented 
1 m = interval count of master arra.y 
2 if m < O then ex:it 
3 set i to the minimum of the and of the interval 
and tha loop end 
168 
4 if i is not in the interval ar the first element of this 
dimension is not owned then decrement m and gota 2 
5 if i is within the loop bounds then set continue to TRUE 
and send i and continue into the loop body 
else decrement m and gato 2 
6 if continue is TRUE do the loop body else gato 9 
7 true part of loop body 
8 if new i is within loop bounds set continue to TRUE, 
send i-and continue into the loop body, and gota 4 
else set continue to FALSE, send i and continue into the 
loop body, and gato 6 (with i set to new i) 
9 false part of loop body -
FIGURE A.4. RANGE FIL TER ALGORITHM FOR SlEPSIZE -1. 
The second parameterization is for complex indices like A[c¡*i+k¡, Cj*j+kj]. The range 
filter for this situati.on needs different index check conditions. Figure A.5 shows the 
algorithm for a second level distribution (alongJ) writi.ng into A[c¡*i+k¡, Cj*j+kj]. 
169 
1 m = O 
2 if rn > interval count of master array then exit 
3 if (c1*i+ki) is not in interva.l m then increment 
m a.nd goto 2 
4 set j to the ma.ximum of the loop beqinninq and 
(beqinninq of the interva.l-kj) / Cj 
5 if ( Cj * j +kj) is not in the inte:rval or tbe first 
element of this dimension is not owned then 
increment m a.nd qoto 2 
6 if j is within the loop bounds then set continue to TRUE 
and send j and continue into the loop body 
else incrernent m and goto 2 
7 if continue is TRUE do the loop body else goto 10 
8 true part of loop body 
9 if new j is within loop bounds set continue to TRUE, 
send j-and continue into the loop body, and gato 5 
else set continue to FALSE, send j and continue into the 
loop body, and gota 7 (with j set to new j) 
10 false part of loop body -
FIGURE A.5. SECOND LEVEL DISTRIBUTION RANGE FlLTER FOR A[C1*I+Kr,C1*J+K1]. 
The lines in bold (lines #3 - #5) ha ve different check conditions than those in Figure A.2; 
this is the only change. 
The third parameterization is also quite simple. This handles the case where the stepsize is 
not 1 nor -1, but sorne constant c. Note that this stepsize is important only on the level of 
the nest which is distributed. Figure A.6 shows the algorithm for a third level distribution 
with stepsize c. Note that line #5, in bold, is the only modification. 
1111111111111111111111111111111111111111111111111111111111111111 
3 1970 00832 7717 170 
·¡1 m = O 
2 if m > interval count of master array then exit 
3 if i is not in interval m then increment m and gato 2 
4 if j is not in interval m then increment m and gato 2 
5 set k to the (first multiple of C + sta.rt o-t: 
loop) > start of interva.l m 
6 if k is not in the interval or the first element of this 
dimension is not owned then increment m and gato 2 
7 if k is within the loop bounds then set continue to TRUE 
and send k and continue into the loop body 
else increment m and gato 2 
8 if continue is TRUE do the loop body else gota 11 
9 true part of loop body 
10 if new k is within loop bounds set continue to TRUE, 
send k-and continue into the loop body, and gota 6 
else set continue to FALSE, send k and continue into the 
loop body, and gato 8 (with k set to new k) 
11 false part of loop body -
FIGURE A.6. RANGE FIL TER FOR THIRD LEVEL DIS'IRIBUTION WITH STEPSIZE C. 
Asan example consider the loop range: for k = 2 to 30 stepsize 3. Valid values of k are: 2, 
5, 8, 11, 14, 17, 20, 23, 26, and 29. If an interval m, for a given PE, ran from 6 to 16 
inclusive, then k would start out at 8, and stop at 14. 
The three basic algorithms plus the three parameterizations allow PODS to insert the proper 
range filter at compile time. 
