Automatic parallelisation for a class of URE problems by Chen, Xian
THE UNIVERSITY OF NEWCASTLE UPON TYNE
DEPARTMENT OF COMPUTING SCIENCE
UNIVERSITY OF
NEWCASTLE UPON TYNE
Automatic Parallelisation
for a Class of URE Problems
by
Xian Chen
PhD Thesis
November 1995
Abstract
This thesis deals with the methodology and software of automatic parallelisation for
numerical supercomputing and supercomputers. Basically, we focus on the problem of
Uniform Recurrence Equations (URE) which exists widely in numerical computations.
vVepropose a complete methodology of automatic generation of parallel programs for
regular array designs. The methodology starts with an introduction of a set of canonical
dependencies which generates a general modelling of the various URE problems. Based
on these canonical dependencies, partitioning and mapping methods are developed which
gives the foundation of the universal design process. Using the theoretical results we
propose the structures of parallel programs and eventually generate automatically parallel
codes which run correctly and efficiently on transputer array.
The achievements presented in this thesis can be regarded as a significant progress
in the area of automatic generation of parallel codes and regular (systolic) array design.
This methodology is integrated and self-contained, and may be the only practical working
package in this area.
Acknow ledgements
I sincerely thank my supervisor, Dr. Graham Megson, for suggesting this area of research
and guidance over the course of my work.
I would also like to express my appreciation to Prof. P. Lee and Dr. J. K. Wright, both
members of my thesis committee. My thanks also go to the members of the Algorithm
Engineering Research Group, their efforts were greatly appreciated.
Financial support for this research was provided by the Research Committee of Univer-
sity of Newcastle upon Tyne grant No. F406/RC/Ol with additional funds from CVCP
Overseas Research Students Awards Scheme ORS /92029017.
Finally, I thank mostly my mother, my sister and my wife for their support and un-
derstanding during my studies, which is the essential condition for me to complete the
work.
11
Declaration
I certify that no part of the material offered here in this thesis has been previously
submitted by me for a degree or other qualification in this or any other university.
Xian Chen
III
Dedicated to My Father
IV
Terminology
Below is a list of common abbreviations and assumptions. Local variables are specified
as necessary in the text.
Abbreviations
EVPA:
SNF:
HNF:
HSNF:
DF:
DC:
DV:
LSGP:
LPGS:
N-D:
Enlarged Virtual Processor Array.
Smith Normal Form of matrix.
Hermite Normal Form of matrix.
Half SNF of matrix.
Diagonalisation Form of matrix.
Data Flow Cube.
Data Flow Vector.
Local Sequential and Global Parallel.
Local Parallel and Global Sequential.
N dimensional.
General Assumptions
Unless otherwise stated you can assume the following
GCD stands for the greatest common divisor, and gcd(·) for the GCD of a vector.
det(·) stands for the determinant of a square matrix.
An integral matrix M is said to be an unimodular matrix if 1 det(M) 1= 1
M = diag(xo,···, xn) means that M is a diagonal matrix with Xo,···, Xn as its
diagonal entries.
A blackbold small letter, say a, stands for a column vector.
A overbar vector indicates a row vector. For instance, it = [aa, ab a3].
A bold capital letter , say M, stands for a matrix.
Generally hi E B means that b, is the i-th column vector of matrix B.
Generally bi E b means that bi is the i-th entry of vector h.
Generally bi,i E B means that bi,i is the (i,j)-th entry of vector B.
Superscript "i" means in processor-time domain.
Superscript "K" means in the K-D time domain.
Superscript "(Po .. .p; ..,PM)", e.g., (2310), means a permutation.
Superscript "q" means in sq domain.
Superscript "q'" means in sq' domain under basis E (see below).
Superscript "8" means in supernode domain.
v
Superscript "u" and d" stands for the upper bound and the lower bound, respectively.
Subscript "Sll C" means simple uni-directional connected mesh.
Subscript "SBC" means simple bi-directional connected mesh.
Terms for Global usage
A is the activity matrix of LSGP.
Am,N is the coefficient matrix of the system of the m inequalities of the N-dimension
polyhedron.
AP is the positive coefficient matrix expressing D (see below) under the basis E (see
below).
A d is a positive coefficient matrix expressing D, its entries are between ° and l.
A}\,'lis M-D hypercube processor domain.
a is the index vector of a processor.
amax indicates the minimum sizes of the partitioning parallelepiped required to achieve
the canonical dependencies.
B is obtained from B", and defines the partitioning parallelepiped and is also the trans-
formation from the original space to the supernode space.
B, is thi> basis by which D can be expressed by Ad.
BB is the partitioning parallelepiped by which a computation polyhedron can be
mapped within the processor array while satisfying the canonical dependencies.
, .
C, stands for a condition gj - dj ~ q" to define an area for data flows in a supernode,
see 9j , q" later, ~' is the dependency in sq' space.
C1iK stands for a convex hull.
c is the constant vector in the system of inequalities representing a polyhedron.
c" = [cg, ... ,C~-lV indicates the extra sizes of a local supernode memory space.
c is is the determinant of T (see below), and also the whole compression factor.
D is the original dependency matrix.
DC~:,dt, or Dc3:, is a data flow cube with a processor dependency d", time delay
dt and supernode dependency d". Note that for brevity, sometimes, dt and dP can be
omitted so that DC has forms of Dc3: or DCd' where the meaning is clear.
d stands for a dependency vector.
DP is the dependency matrix in the processor domain.
d" stands for a dependency vector in the processor domain.
DA is the dependency matrix in a LSGP processor domain.
dA stands for a dependency vector in the LSGP processor domain.
dilO stands for supernode dependency vector [l,l,OlT.
dt stands for a time delay.
VI
E is a Positive Expressing Basis (PEB).
F is the scaling matrix to scale E to B,;
Fa is a scaling matrix to scale E such that the polyhedron can be mapped into a
processor array.
F d is a scaling matrix to scale E to Bd.
fK -r and IK -r are the first and the last computation nodes of polyhedron pK -r.
G is a diagonal matrix with the elements of g as its diagonal entries.
g = [... ,gi, ... jT indicates the sizes in each direction of a supernode.
Hj, is the HNF, of A, with only k as its diagonal entries.
I is the identity matrix.
i stands for a computation node, i.e. an iteration in a set of nested loops.
IBWi and OBwi are the input and the output buffers in a processor, respectively,
carried out at the ordinal number Wi (see below).
IPpi and OPp; are the input and the output packs of communications of a processor,
respectively, carried out at the ordinal number Wi (see below) along the interconnec-
tion primitive p (see below).
J is the antidiagonal identity matrix.
j is the index vector in processor-time domain.
1= [1o,' .. ,1M-IV is the length vector of a processor array.
k = [ko,"', kM-IV is the LSGP compression vector. For (N-l)-D SBC case, k = ko =
... = kM-I'
M is the number of dimensions of processor array.
mo, ... ,mN-l indicate the sizes of local supernode memory space.
m is the number of inequalities of the polyhedron.
N is the number of dimensions of original nested loops.
N'inks is the number of communication links in a processor.
NDC the number of DC's.
nd is the number of dependencies.
nu is the number of the vertices of the computational polyhedron.
K = N - M is the dimension in a lower dimensional mapping.
o stands for a matrix filled with "0".
o stands for a vector filled with "0".
P is a matrix representing the interconnection primitives of the architecture.
Vll
Pp is the permuting matrix.
p is an interconnection primitive, also a column vector of P.
P is a polyhedron.
Q is the LSGP same-time matrix.
QdP is a collection of data associated with processor dependency d".
Qi,j is a collection of nodes associated with df and dj.
q stands for the quasi-supernode.
q' is the index vector in the space s«.
"[ " ]T , dq ···,qi'··· =q mo g.
REp stands for the relay buffer along the interconnection primitive p.
sq' is a space under basis E.
S is the space projection matrix of K x N.
5 is the N-D original space.
SP is a set of polyhedra.
sP is the N-D space projection vector.
s stands for a supernode.
T is the space-time transformation matrix.
Tb is the mapping matrix from the N-D time domain to the supernode domain in
LSGP case.
T' is the lower-dimensional space-time transformation matrix.
T is a set of t.
t is a timing vector.
t is the index vector in the N-D time domain for the LSGP case.
U is the image of the vertices in processor array domain mapped by S.
US and rS are the mapping matrix and the model vector for the SNF independent
parti tioning.
UH and rH are the mapping matrix and the model vector for the HSNF independent
parti tioning.
UD and rD are the mapping matrix and the model vector for the DF independent
parti tioning.
Y the vertices of the computational polyhedron.
ye the vertices of the enlarged supernode polyhedron.
v stands for a vertex.
W stands for the vertices of the polyhedron under basis E.
w' indicates the location of a supernode in a processor after LSGP partitioning.
Wi indicates the ordinal number of computing each w'.
VUl
II is the timing projection matrix.
1 is a vector filled only with "1".
L, is a sample (or choosing) vector such that its i-th element is 1, and zero otherwise.
OJ = O... 0, total 2i "O"s in a line, Ii = 1. .. 1, total 2i "1"s in a line.
[xo, yo) x ... X [XN-I, YN-I) indicates a cubic area in the supernode of dimension N.
IX
Contents
Acknowledgements 11
Declaration III
Terminology ix
1 Introduction 1
1.1 Automatic Parallelisation 1
1.2 Structures of Processor Array . . . . . 4
1.3 Nested Loops and Data Dependencies. 7
1.4 Space-Time Mapping 10
1.5 About the Thesis . . . . . . . . . . . . 12
2 Survey and Analysis of Partitioning and Mapping 14
2.1 The Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Independent Partitioning. . . . . . . . . . . . . . . . . . . . . 15
2.3 General Partitioning and Projecting For Space-Time Mapping 18
2.3.1 Projecting-Partitioning Methods. 19
2.3.2 Partitioning-Projecting Methods. 26
2.3.3 Comparison...... 27
2.4 Lower-DimensionalMapping. 28
2.5 Summary . . . . . . . . . . . 31
3 Improvements of Some Existing Partitioning and Mapping Methods 32
3.1 On the Maximal Independent Partitioning . . . . . . . 32
3.1.1 Introduction..................... 32
3.1.2 Improved Methods of Independent Partitioning . 33
3.1.3 Producing HSNF . . . . . . . . . . . . . . . . . .. 34
3.1.4 Algorithm Generation . 37
3.2 A Synthesis Method using LSGP Partitioning for Given-Shape Regular
Arrays . . . . . . . . . . . . . . . . . . . 39
3.2.1 Introduction............. 39
3.2.2 A New LSGP Synthesis Method. . 40
3.2.3 The Conditions for Valid Q and t . 42
3.2.4 An Example. . . . . . . . . . . . 47
3.2.5 A Particular Area of Application 49
x
3.3 Bouncing LPGS Method 51
3.3.1 Introduction . . . 51
3.3.2 Bouncing LPGS . 52
3.4 Summary ........ 54
4 A Methodology of Partitioning and Mapping for Given (N-l)-D Regular
Arrays 55
4.1 A New Methodology . . . . . . . . . . . . . . 55
4.2 A Transformation for Canonical Dependencies 57
4.2.1 A Cone Including Dependencies . . . . 57
4.2.2 Forming Canonical Dependencies in Supernode Space 61
4.3 Selection of Space Projection and Timing Vector. . . . . . . 65
4.3.1 S-T Transformation and Interconnection Primitives. 65
4.3.2 Choosing Sand t . . . . . . . . . . . . . 66
4.3.3 Permuting the Space Projection Matrix. . . . . 70
4.4 Further LSGP Partitioning for SBC . . . . . . . . . . . 71
4.5 Scaling the Supernode Parallelepiped and Optimisation 73
4.5.1 SUC Cases. 74
4.5.2 SBC Case . . . . . . . . . . . . . . . . . . . . . 76
4.5.3 Optimising 77
4.5.4 Integralization of the Quasi-supernode Transformation Matrix 79
4.6 Ex.onples-aadDiscueeions . . . . . . 81
4.6.1 A 3-D. Example . . . . . . . . 81
4.6.2 More Results and Discussions 84
4.7 Summary . . . . . . . . . . . . . . . 85
5 Optimal Mapping Onto Lower-Dimensional Regular Arrays 87
5.1 A Methodology for Partitioning and Mapping onto Lower Dimensional Array 87
5.2 The Transformation into K-D Time Domain and M-D Processor Array 89
5.2.1 Selecting a Family of T. . . . . . . . 89
5.2.2 Scaling the Supernode Parallelepiped . . . . . 91
5.3 Maximum Local K-D Time Domain . . . . . . . . . . 92
5.3.1 Intersecting a Polyhedron with a Hyperplane. 92
5.3.2 Maximum Local Supernode Domain. 93
5.3.3 Mapping to K-D Time Domain ., . . . . . . 94
5.4 Valid Minimum Projecting Vector. . . . . . . . . . . 95
5.5 General Methodology for Valid Minimum Projecting Vector . 98
5.5.1 The First and Last Nodes of Polyhedrons. 98
5.5.2 Deriving p . . 100
5.6 Optimisation . 104
5.6.1 Method . 104
5.6.2 An Example. 106
5.7 Special Example: Partitioning and Mapping a Knapsack Problem onto a
Linear Array. . . . . . . . . . . . . . . . . . . . 108
5.7.1 Description of Computational Structure 108
Xl
5.7.2 Supernode Polyhedron .
5.7.3 Transformation onto a Time-Processor Domain
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . .
6 The Structure of Parallel programs
6.1 Introduction .
6.2 On Supernodes .
6.2.1 Boundary of the Supernode Polyhedron.
6.2.2 Vertices of the Enlarged Quasi-Supernode Polyhedron.
6.2.3 Boundary of a Single Supernode . . .
6.3 Data Flows . .
6.3.1 Outgoing Data of a Single Supernode .
6.3.2 Outgoing Data Packets of a Processor.
6.4 Algorithm Generation For A Lower Dimensional Processor Array
6.4.1 K-D Parallel Algorithms .
6.4.2 K-D Time Domain to 1-D Time Domain . . . . . . . . ..
6.5 Algorithm Generation For LSGP Case .
6.5.1 Rectangular Boundary in Virtual Processor-Time Domain
6.5.2 Algorithm Generation Involving LSGP
6.5.3 Outgoing Data after LSGP. . . . . . . . . . .
6.5.4 Example.....................
6.5.5 Algorithm Generation for (N-1)-D SUC Case.
6.6 From Inequalities To Boundaries .
6.6.1 Finding All Possible Lower and Upper Bounds.
6.6.2 Deleting Redundant Bounds
6.7 Summary .
7 Parallel Code Generation
7.1 Introduction .
7.2 Processor Array .
7.2.1 Multi-Processor Machines
7.2.2 Building an Array and Creating Communication Meshes
7.3 Supernode Storage .
7.3.1 From Supernode Domain to Local Supernode Domains
7.3.2 Supernode Storage and in/out Data Storage
7.4 Data Flow and Relay . . . . . . . . . . . . . . .
7.4.1 LSGP Case .
7.4.2 Non LSGP Case and Direct Data Flows
7.5 Outline of the Parallel Code . . . . .
7.5.1 Parallel Codes for Non-LSGP
7.5.2 Parallel Codes with LSGP
7.6 Summary .
xu
110
111
112
114
114
117
117
124
125
126
126
128
129
129
130
133
135
135
138
139
142
142
143
144
147
149
149
150
150
150
151
153
154
157
158
163
167
168
171
. 173
8 Experimental Results and Discussions 175
8.1 Experimental Results. . . . . . 175
8.1.1 Test of the Correctness . . . . . 175
8.1.2 Measuring Performance. . . . . 177
8.2 Discussionson Factors AffectingEfficiency 182
8.2.1 Space and Time Fitness of the Mapping 182
8.2.2 Data Communication . . . 186
8.2.3 Single Supernode Loops . 187
8.2.4 The Effect of Granularity 188
8.3 Summary . . . . . . . . . . . . . 188
9 Conclusion 190
9.1 Summary Of The Whole Design Procedure 190
9.2 Evaluations and Comparisons 192
9.2.1 Theory. 193
9.2.2 Practice. 194
9.2.3 Results.. 195
9.3 Closing Remarks 195
Bibliography 197
A Scaling SBC 205
A.l Initial wi's and w~'s . . . . . . . . . . . . . . . . . . . . 205
A.2 SF as a Function of Jk . . . . . . . . . . . . . . . . . . . 207
A.2.1 Determining fa with a Known Set of wi and w! . 208
A.2.2 Determining the Turning Points . . . . . . . . 208
A.2.3 Selectingwi and w! from Multi-Candidates . . . 209
A.3 Delimit jj, According to Dependencies . 210
B Collection of Algorithms for (N-1)-D Partitioning and Mapping 212
B.1 Pre-compilation Work . 212
B.2 CompilingWork. . . . . . . . . . . . . . . . . . . . . . . . . . 213
C Collection of Algorithms for Lower-Dimensional Mapping 216
D Parallel Algorithm for Pure LSGP Method 218
E Generating Data Flows and Relays for LSGP 219
F The Collection of Experimental Results 222
G Examples of Automatically Generated Parallel Codes 249
G.l Parallel Codes for Non-LSGP . 249
G.1.1 h.file . . . . . . . . 249
G.1.2 Parallel Code . . . 254
G.2 Parallel Codes for LSGP . 260
Xlll
G.2.1 h.file .
G.2.2 Parallel Codes.
. 260
. 263
XIV
List of Figures
1.1 Full Bi-directional Connected Regular Array
1.2 The Patterns of Interconnections
1.3 A Computational Graph . . . . . . . . . .
1.4 Chart of the SARACEN Project . . . . . .
6
6
9
13
16
20
2.1 Two Natures of Independent Partitioning
2.2 Partitioning of Moldovan's Method ....
3.1 The Endless Cylinder for Computational Domain with One Infinite Index. 49
3.2 Demonstration of LSGP and LPGS . . . . . . 51
3.3 LPGS and Bouncing LPGS 53
3.4 One time hyperplane of 2-D Bouncing LPGS . . . . . . . . 53
4.1 The Basic Idea of Our Partitioning and Mapping Method.
4.2 The Cone of Dependency Vectors. . .
4.3 Supernode Partitioning and Canonical Dependencies .
4.4 A Conceptual Chart of the Partitioning and Mapping Method
57
58
64
86
5.1 The Basic Idea of Lower dimensional Partitioning and Mapping Methods 88
5.2 The Intersection of Polyhedron with a Hyperplane. 93
5.3 2-D Time Polyhedron. . . . . . . 96
5.4 The layouts of a 3-D Polyhedron 97
5.5 Finding fK-r for -ps-r . . . . . . . . . . . . . . . . 100
5.6 Two Maximum Local 2-D Time Polyhedra. (a) is P; defined by V~ and
(b) P? defined by V~ 107
5.7 Knapsack Data dependency Graph " 109
5.8 The m processor linear arra.y implementing the Knapsack problem 112
5.9 A Conceptual Chart of Lower dimensional Partitioning and Mapping Meth-
ods 113
6.1 2-D quasi-supernode polyhedron and the corresponding supernode polyhe-
dron 119
6.2 Dependencies in the quasi-supernode domain and in the supernode domain. 127
6.3 LSGP Partitioning layout and dependencies 134
6.4 Chart of Program Transformation 147
7.1 Memory layout of separate supernodes. 152
xv
7.2 Local-supernodes memory space layout . . . . . . 154
7.3 Marking w'with ordinal number w. . . . . . . . . 159
7.4 The Conceptual Chart of Creating lP's and OP's 160
7.5 LSGP dependencies and data flows. 161
7.6 Data flows and relays. 164
7.7 Indirect and direct transference of data. 165
7.8 Processor Array and Parallel Codes . . . 166
7.9 The flow chart of the template of a Parallel Codes 169
7.10 TRANSPORT and Communication channels in the case of 1-D and SBC 169
7.11 Chart of Parallel Code Generation 174
8.1 The basic cubic polyhedron and the transformed actual polyhedron 178
9.1 Chart of the Whole Procedure. 192
XVl
List of Tables
4.1 4-D examples .
4.2 5-D examples .
9.1 Comparison of three methods in theory
84
85
193
XVll.
Chapter 1
Introduction
1.1 Automatic Parallelisation
This thesis deals with the methodology and software of automatic parallelisation for nu-
merical supercomputing and supercomputers. Many scientists and engineers have tasks
which require several millions or billions of floating point operations such as computer
vision [13], ocean circulation study [41], electromagnetic field study [38], VLSI circuits
simulation [102], fluid and reservoir simulation [79] and various other numerical compu-
tations.
To solve computational problems quickly, these scientists and engineers use the fastest
computing equipment they can find or afford. Over the last forty years, hardware tech-
nology has undergone rapid transformation. For a single processor, the processing speed
has increased dramatically. It is said that the processing speed has doubled every 3-4
years in the last two or three decades. Unfortunately this increase cannot continue for-
ever. Hardware development is achieving the physical limits of current micro-electronic
technology (Unless fundamentally changing the technology, the width of lines of the latest
VLSI chip is about half micro-meter, which is near the wavelengths, about a few thousand
angstroms, of visible light being used for photoetching techniques. In addition, placing
wires closer together also cause mutual inductance and you cannot alter the length of wires
by micronization. This is why further speed-up is not possible with current technology).
Thus there is a limit after which computational tasks become so heavy that no single
processor can carry them out in an acceptable time. As a result, today's supercomputers
have been designed to cope with enormous computing tasks. Most of supercomputers, no
1
matter what kind of architecture they use, typically have some kinds of array facilities.
Therefore, the problem of exploiting parallelism for a particular problem and automati-
cally generating parallel codes which can run on a supercomputer arise in the front of the
users of such machines.
A basic question is, as a scientist or engineer, not specialising in parallel computing,
how easy is it to write parallel codes for computing tasks in your area? Usually, the
task can be achieved, but only after quite a long period, maybe a few months for a
beginner, to study the theory of parallelism, the parallel language and a structure of the
supercomputer. And then, to test and debug parallel codes is often an extremely difficult
task, much more difficult than for corresponding sequential codes. To some extent, we can
say that the difficulty and length of time required to write an efficient parallel program
inhibits the actual application of supercomputing.
As a result, exploitation of automatic parallelization has been of concern for a long
time before. An obvious solution is to augment compilers to generate parallelism. In-
deed since the.early 1970's, many researchers have attempted to exploit the possibility
of automatically generating parallel codes from a sequential algorithm. A great amount
of work has been done and a huge number of papers have appeared in the literature. In
the early stages, people focussed on numerical problems which were oriented towards For-
loops structures and discovered some basic ideas [56] [4]. From the middle of the 1980's,
attention focussed primarily on the problem of data dependency and the description of
recurrence equations, as well as the problem of partitioning and mapping methods [84]
[88] [87] [73]. During the 1990's, more aspects of the problem have been explored and
theories have matured to a usable state, especially in the case of Uniform Recurrence
Equations (UREs) [93] [24] [94]. We may divide the research into the following topics for
which we list only the literature which make significant contributions.
In the early days, Lamport's work [56] is the landmark. The concept of space-time
mapping (hyperplane transformation) is still in use today. This concept is also referred
to as a wavefront transformation [55]. The data dependency that exists in a loop nest is
also important in this area. The general concepts were introduced in the 1970's (such as
[4]). Quinton [84] gave a comprehensive description of the so-called URE problem, though
2
the concept of URE was suggested very early [50]. Rajopadhye [88] [87] discussed both
URE and ARE (Affine Recurrence Equation) problems and discovered that some kinds
of ARE problems can be transformed into equivalent URE problem (see the definitions of
URE and ARE later). A commonly used notation for representing general dependences
are "direction vectors" which are proposed in [4], [5] and [109]while [15] presented the
concept of reduced dependency. [107] gave an interesting description of the wavefront
transformation and data dependency from another viewpoint.
At the end of the 1970's. Padua [78]raised interest in Independent partitioning, which
divides the computation of a loop nest into a number of independent blocks when possible.
Some researchers contributed to this problem in the 1980's, and the work of Shang and
Fortes [93] at the beginning of the 1990's made the latest fundamental progress for this
topic.
Partition£ng an algorithm to fit into a fixed-size array is essential to many applications,
Moldovan and Fortes in 1986 were the early researchers to study the general partitioning
problem, and proposed [73]a well-known technique, Local Parallel and Global Sequential
(LPGS) partitioning. Although the research for partitioning some specific algorithms
started earlier [43]. Since the 1980's, this problem has received considerable attention,
and many methods have been proposed. The supernode partitioning introduced in [46]is
an important idea for partitioning, while Darte's work [24] in the beginning of the 1990's
marks another fundamental contribution to the theory of partitioning. Some partitioning
methods come from more specific and limited view-points, eg, [81] [91]. A common case
is that an array with lower-dimensions is given to carry out the computation of a set of
nested multiple loops, which requires a so-called Lower-dimensional mapping which differs
from Lamport's ordinary space-time mapping. In spite of some early methods for specific
algorithms, the first important steps toward a formal solution were made in [58] and [59]
in the late 1980's and the early 1990's. The theory was raised to a higher level in [94]
although there still remain serious problems.
From the middle of the 1980's, based on some of the theoretical achievements, people
began to try to design software packages and tools for the task of automatic parallel code
generation [74] [33], although they were not very successful, further progress has made
3
and the prospect of building a practical software tool and to achieve real applications now
exists.
1.2 Structures of Processor Array
Before continuing it is essential to give a brief review of the resources of parallel computing.
So far, there are different classifications for the structures of parallel machines [28] [29]
[95]. The widest-used category [29] is based on instruction and data streams, that is,
Single-Instruction & Single-Data (SISD), Single-Instruction & Multiple-Data (SIMD),
Multiple-Instruction & Single-Data (MISD) and Multiple-Instruction & Multiple-Data
(MIMD). Obviously, SISD is a conventional sequential machine while examples of MISD
are difficult to find. In SIMD, Multiple processors run under the control of a single
instruction stream, while in MIMD, every processor is controlled by its own instruction
streams.
In the SIMD class, a sub-classification can be given. Array Processors use a control unit
to instruct a number of independent processors; Vector Processors manipulate vectors as
operands instead of scalars; and Pipelined Processors operate on data as it flows through
a pipeline. of processing elements [97]. The MIMD class can be divided further into
Multicomputer Systems and Multiprocessor Systems. The former consists of a number of
autonomous computers which mayor may not communicate with each other, while the
latter is characterised by interaction between processors at the process, data set and data
elements level.
Another classification is concerned with memory, i.e., shared-memory machines and
distributed-memory machines, although strictly they fall into the Multiprocessor System
class [97]. In the shared-memory machine, all processors share a global memory from
which they can read and write data. The advantages are significant. Firstly there is no
need of data flowbetween processors and a simple write or read to memory is sufficient. An
obvious disadvantage is the bottleneck effect of controlling consistent access to the global
memory. Many expensive hardware components, such as buses and cache memories, have
been used to improve the memory bandwidth, but it is difficult to overcome the inherent
shortcoming. In contrast, the distributed-memory scheme allows each processor to have its
4
own local memory, forming a node of an array, which allows cheap, flexible and expansible
hardware, but also suffers from the disadvantage of overheads when explicit data flows
between processors are required which make programming difficult.
The concept of a systolic array was introduced in [53]. Generally speaking, a pure
systolic array is a regular set of interconnected cells each capable of performing some
simple fixed operation upon data which flows through the array at regular beats. The
array acts something like a multi-dimensional complex pipelined array. The principle is
to alleviate the shared memory bottleneck by bring data from the global memory once
and re-using it many times as it is pumped around the processors with low overhead
communication. The concept of the systolic array has been widely applied. In general
we may term a distributed-memory processor array each of whose processors performs
identical operation at regular beats a "software systolic array". The attractive point
of a systolic array is the identity of operations of each processors and the regularity of
the operations of the whole array. A full description of structures and types of parallel
machines can be found in [44].
Among ,;11the types of parallel computing machines, we will focus on the "software
systolic array". The main reasons are
1 Its hardware is cheap and easily available, less restrictions on applications.
2 Programming for it is difficult, so it needs automatic parallelisation more urgently
than more general purpose systems.
3 The same technology of automatic parallelisation for the "software systolic array"
can be applied directly to automatic design of a hardware systolic arrays.
Without specific requirements, we assume that the multiprocessor array takes the
shape of a rectangular parallelepiped, termed a regular array. Each processor is a node
of the array and is physically linked with its nearest neighbours. The processor is able
to access quickly its local memory, but needs relative long time to open a port for data
transference. The direction of a link from one processor to another is indicated by a
so-called interconnection primitive p, giving rise to the following definition.
5
Figure 1.1: Full Bi-directional Connected Regular Array
Definition 1.2.1 A mesh connected regular array is a tuple (AM, P) where AM is a MiD
rectangular parallelepiped (hypercube) of size 10 x ... X IM-l (let 1 = [10,... , l,\ll_l]T) ) in
which every integral point is a processor. AM is referred to as the processor array. P is
an integral matrix of interconnection primitives whose columns indicate the directed-links
from one processor to another.
For example, the processor array in Figure 1.1 is a 3 x 3 (1= [3, 3]) regular array with
interconnection primitives
P = [Po, ... , P7] = [~ =~ ~ 1 ~ 1 ~ ~ 1 ~ ~ 1 ].
For instance, considering processor 11, Po = [1, l]T stands for the link from processor 11
to processor 20 and P7 = [-1,0]T for the link from processor 11 to processor 12.
Two other easily available interconnection models are shown in Figure 1.2 (for example,
in a 2-D array) , which have the interconnection primitives for Figure 1.2.(a), and (b) are
and are termed as SUC and SBC model, respectively. Because they are the most available
meshes for parallel computing machines, we build our methodology based on them in the
(a) Simple Uni-direction (b) SimpleBi-direction
Connected. (SUC). Connected. (SBC)
Figure 1.2: The Patterns of Interconnections
6
following chapters.
1.3 Nested Loops and Data Dependencies
Nested loop structures are the most time-consuming part in most scientific computational
algorithms. It is hard to write a program which runs for a significantly long time without
using loops at some point. Consequently a great deal of research has focused on exploiting
the parallelism in such loop-structures. The computations of N nested loops can be
considered as the body of a polyhedron in a N-D Euclidean space, usually called the
computation polyhedron. Each point, or a node, in the computation polyhedron is an
iteration of the loop body, and is referred to by its iteration vector, integral N-tuples,
i = [io, ... , iN-IV. In the process of computation, each node executes the operations
associated with it when its required data is available,
Iterations are not generally isolated from each other. For example, to compute one
iteration, we may need some data whose results are computed by other iterations, that
is, some nodes are dependent on others. Generally, the nested loops have the form [107]
Loops 1.3.1 General For-Loops
FOR io := rmax(lg, lA,... )1 TO lmin(ug, U6," .)J
...................
h Ii - -Ii. t.i d -Ii - [Ii Ii 0 0]' i - -i'+ u,i d -i - [ i iwere j - jl+Cj an j - i,O,"" i,i-I' , ... , ,Ui -Uil Ci an Ui - ui,O"",ui,i-I'
0, ... ,0] (apparently both l{ and u{ are a linear functions, defined by I{ or ut, of the indices
of the outer level loops). The "max" operation means that for each loop there may be a
number of nominees of the lower bound and only the largest one is chosen from all the
possible ones as the lower bound of the loop. The "min" has the·opposite meaning.
Collecting expressions in all max( ... ) and min( ... ) in the nested loops above yield a
system of inequalities, having a form
(1.1)
for j = 0, ... , m - 1, or in matrix form
(1.2)
.7
where m is the number of the inequalities. The system of inequalities in eqn(1.2) defines
an N-D polyhedron P. A polyhedron can be also confined by a set of finite vertices. If
I~'s and u{ are integral vectors, all vertices are integral.
Notice that f(i) = g(J(io), ... , f(inr1)) is defined as a Recurrence Equation [87] over
P, where i E P; ij E P for j = 0, ... ,nd - 1, and 9 is a single valued function of Vs. In
general, the computational body may consist of a system of recurrence equations instead
of just one, but this makes no basic changes in the concept or application of our technique.
Thus for brief, we restrict ourselves to the case of a single recurrence equation.
A recurrence equation is called a Uniform Recurrence Equation (URE) iff V j =
0, ... ,nd - 1, ij = i - dj , where dy's are N-D constant vector. In a URE problem, all
dependencies are straightforwardly defined by a constant distance in each dimension. Such
dependencies are expressed by a dependency measure, termed the dependency vector d,
pointing from one node to another. The matrix D (with each column a dependency vector)
describes all dependency relations among the nodes in a given computation polyhedron.
A recurrence equation is called an Affine Recurrence Equation (ARE) iff V j =
0, ... ,nd - 1, ij = Afi - dj, where the A§'s are N x N constant matrices.
In ARE problems, the dependencies are implicitly expressed by linear transformations
of the variable indices. If the AI's are all identity matrices, the problem degenerates
naturally into a URE. Alternatively, if det(A/) = ° for some dependency, we can transform
it to a URE by using pipelining methods [30]. Data is pipelined along a constant vector
which is the non-null solution of equation Afx = 0 (0 is a null vector). If neither of the
above cases apply, the dependency can be rewritten as h = h + q(h), where q(it) means
a transformation. But because q(h) is not a constant vector, routing [33] must be used
to produce UREs instead, which needs additional mechanisms to control the flow of data
in the array.
If the relation from i to Vs does not fall into the URE or ARE category, it is defined
quite generally as ij = m;(i), where mj stands for a general mapping from an N-D integral
space to another N-D integral space. Sometimes, the dependency relations can not be
expressed by vectors, but by some symbols representing dependency directions [109]. The
distances are not well-defined.
8
Figure 1.3: A Computational Graph
We restrict ourselves to the UREs and AREs which can be transformed into UREs
systematically. This is not only because the URE is the simplest case to study in this
field, but in fact, URE covers a wide range of numerical computational problems, from
the simple cases of matrix multiplications, AR and ARMA processes to the more complex
Knapsack Problem [71]. Indeed, many complex numerical problems are expressed as
iterational processes where the computation of an iteration relies on the previous iterations
with fixed access distances. This mathematical background gives a sound nature to URE
for applications. Furthermore, the research on the case of URE will pave a road to
more complex situations. Some members of the Algorithm Engineering Research Group
(AERG) in Newcastle University are working on transforming more general recurrences
to URE as mentioned above which can make a direct use of the techniques being derived
for URE and on uniform mappings and routings for general ARE problems.
As regards the more complex dependency representation which may be involved more
in logic reasoning problems, unfortunately, the theory of space-time scheduling is currently
far beyond the application stage and is not be considered in our project. For URE, we
define the computational graph as follows
Definition 1.3.1 A Computational Graph is defined as a tuple(V,D), where V is a set
of finite vertices E ZN, which defines a computation polyhedron and D is the dependency
matrix. The computation graph is acyclic.
In fact, here V is a matrix of n x nv, nv is the number of the vertices. D is a N x nd
matrix, nd is the number of dependency vectors.
For example, the computational graph of the following simple For loop program.
. 9
FOR io := ° TO 5
FOR i1 := io TO io + 6
A(io, i1) := 3A(io, i1 - 1) + A(io - 1, i1 - 1) x A(io - 1, i1)1
is illustrated in Figure 1.3. From the figure, we can collect the vertices of the graph and
the dependencies as (henceforth assume that the outmost loops are listed first in a column
dependency vector)
V=[0055]° 6 5 10 D=[lOl]011
Usually, extracting the dependencies of an URE is easy. However, it is sometimes
quite difficult to determine all the vertices of a polyhedron defined by a set of inequalities,
a classic problem in computational geometry. It is not a difficult problem in principle.
In fact, a vertex v is the solution of N linearly independent equations from the system
Am,NX = c and satisfies the condition Am,NV ~ c [90]. However, because Am,N contains
potentially et;; sets of linearly independent equations, we have to solve the et: sets of
linear equations; and then test the potential vertices by Am,NV ~ c to obtain the final
results. When m ~ N, this is a time-comsuming work. We do not consider this problem
explicitly in the thesis although the reader is referred to [90] for necessary mathematics.
1.4 Space-Time Mapping
As early as 1974, Lamport [56] introduced the idea of a linear mapping of a computational
graph onto a (N-1)-D space (or processor array) domain and I-D time domain by a space-
time transformation defined by an integral matrix T
(1.3)
where t = [to, ... ,tN-d is the timing, or scheduling, vector, and
s= SO,N-l 1
SN-~'N-l
(1.4)
11f the addressed element of the array A is out of the index polyhedron, it is zero
10
is the space mapping, a (N - 1) x N matrix. Since then this concept has being adopted
by many researchers, e.g., [84], [73], [109], [81].
In this method, I, is used to form a number of parallel hyperplanes in the Euclidean
space. All the nodes on one hyperplane are independent of each other and, can, therefore,
be computed concurrently. S is used to map each node to a processor in an (N-1)-D array.
The mapping of a polyhedron onto a N-1 space by pre-multiplying S can be thought of
as a projection of the polyhedron along a space projection vector sP E ZN which is
perpendicular to every row vector of the space mapping, i.e., SsP = o.
It is well-known that I and sP must satisfy the two conditions:
Id; > 0, 'Vi,O:::; i < nd or ID > 0' (dependency constraint) (1.5)
c = IsP > 0 (no conflict in a processor) (1.6)
where 1/c is the computational efficiency of the mapping; d, are the dependency vectors in
D. For the first condition, notice that if, for instance, b = il +d., we have ti2 = tit + tdi.
This indicates that the execution time of b is behind that of it if tdi > 0, which satisfies
the neccesary conditions of b depending on it. For the second condition, notice that
ik= it + ksP for all k E Z will be mapped to the same processor by S due to SsP = 0, so
any two of them must not be executed at the same time. Because ik is executed at time
tik = tit + ktsP = tit + kc, all the ik's are guaranteed to be executed at different times
by eqn(1.6).
More generally, we can use
(1.7)
to map the N-D polyhedron onto a M-D space domain and a K-D time domain [104].
Here, S is a At x N matrix and II is a 1< x N timing projecting matrix (K = N - M).
It should be pointed that there is a constraint for dependency vectors. In fact in a
sequential loop nest, the nodes are executed in lexicographical order; thus dependences
extracted from such nested loops nest must be "lexicographical positive" [107].
Definition 1.4.1 A vector d is lexicographically positive, written d» 0, iJ 3i: (cl; > 0
and'tlj < i: dj ~ 0), where cl; is the i-th component oj d.
11
That is to say there is at least one positive component before any negative component of
the vector. There must also be at least one positive component.
1.5 About the Thesis
A research project Systolic And Regular Array Computation Environments (SARACEN),
a high level CAD tool for parallel code generation, began in University of Newcastle during
1990, see Figure 1.4. The main procedure is summarised as:
Step 1 Semantic analysis of a high-level descriptions of a computational algorithm and
extraction of the computational graph.
Step 2 Partitioning and mapping the computational graph onto a given regular array.
Step 3 Determination of the structure, boundaries and data flows of the parallel program,
and generation of parallel codes
Work on the SARACEN system is divided into self-contained themes allowing work to
progress in each area simultaneously. Data is exchanged between sections of the software
via a common intermediate format. Consequently in this thesis we can assume that our
input will be a system of uniform recurrence equations. The problem of transforming
non-uniform systems of equations to uniform ones is not addressed in this work. Indeed,
it is nontrival and is the subject of a joint project with Manchester University (EPSRC
REFLEX) and the interested reader is referenced to the thesis [89] by L.Rapanotti which
discusses transformational issues in detail (Step 1 above).
This text is based on my own work as part of AERG which works on the area of
automatic parallelization. In particular, and for a limited class of recurrences, this thesis
proposes a theory of mapping algorithms to parallel architectures and develops software
which automatically generates parallel programs.
Step 2 of the project is concerned with the major theoretical work of this thesis.
Although the concept of space-time mapping was introduced very early, it is still quite
an open topic especially in the area of mapping onto a given regular processor array,
particularly where constraints on connections are concerned. Traditionally the algorithm
mapping is performed with the aid of a design tool (e.g., a visualization system).
12
---------------,
~High Level Description I
I
: Sequential Algorithm--'-.'. ~-l~-~-~-~-~-~-:~---,
- -~~~~"-~{~~~~:--
~Extracting Computational :
~Graph :--------T--------
~T;a~~f~~~g N~n~URE ':
: Dependency into URE
~ Partitioning and Mapping: .'
I. I .'
_~ =o=~o=a===e=l==y= == = ==~-0
~ Building Structure, Boundaries :: The work in
~ Data flows of Parallel Program :: this thesis- - ~----_-------_-1_-_-_-_-_-_-_-_-_- ~ ~
........................................ __ _-_ -_ _ --_ - ..
Figure 1.4: Chart of the SARACEN Project
The rest of the thesis is organizied as follows. Chapter 2, Chapter 3, Chapter 4 and
Chapter 5 consider the mapping problem. Chapter 2 gives a survey and analysis to the
partitioning and mapping problem; Chapter 3 presents the improvements for some ex-
isting partitioning techniques; Chapter 4 proposes our basic methodology of partitioning
and the mapping onto fixed-sized (N-1)-D processor arrays; Chapter 5 deals with the
lower-dimensional mapping problem, i.e., mapping onto M-D array while M < N-l. Step
3 is quite a new research area, since, so far, few researchers have manage~ to convert the-
ory into practice effectively and some aspects have hardly been addressed at all. Chapter
6 considers the problems of determining the structure, boundaries and data flows of the
parallel program, given the theory of previous chapters. In Chapter 7, we present auto-
matically generated parallel codes which can run correctly and efficiently on given arrays.
The experimental results and some discussions can be found in Chapter 8, while Chapter
9 provides a summary, evaluation, and comparison for our methodology. Finally, some
samples of the automatically generated parallel codes are presented in Appendix G. Some
of the algorithms are also collected into other appendices.
13
Chapter 2 .
Survey and Analysis of Partitioning
and Mapping
In this chapter, a survey is made of the current methods of partitioning and mapping
for parallelisation. A number of algorithms of independent partitioning are analysed
and evaluated. For the general partitioning and mapping problem, a classification is
proposed. Under the classification, analysis is made to exploit the advantages of two
classes of partitioning and mapping methods, as well as to expose their weaknesses. Such
criteria are helpful in developing better methodologies for partitioning and mapping. A
brief survey and analysis is also given for the lower-dimensional mapping problem.
2.1 The Problems
To speed up program computation, it is hoped to compute all nodes in the computation
polyhedron simultaneously. Unfortunately such a requirement is generally impossible, be-
cause firstly the results of some nodes must be calculated before others, and secondly, we
usually do not have enough processors to match the number of nodes in the computation
polyhedron. In partitioning, and as we will see later, in some situations the computation
polyhedron can be separated into a number of independent subsets, where no data trans-
ference is needed between the subsets. Consequently the computations associated with
the subsets can be assigned to individual processors, and carried out in parallel.
In many cases, such an independent partitioning does not arise naturally, and the well-
known technique of space-time mapping is applied to expose and exploit the parallelism
14
in the computation graph. By means of both the time mapping and space mapping,
each node in the polyhedron is assigned to a distinct processor at a distinct moment.
That is, a linear injective transformation from the original computational space onto a
processor-time space. However, such a transformation only works well for applications
involving regular computations and where the number of processors is dependent on the
loop-bounds of the original problem.
There are three problems to be concerned with: (1) whether the transformed poly-
hedron in the processor-time space can be fitted into a given-size actual processor array;
(2) whether the resulting dependencies can be implemented by the given interconnection
primitives of the actual array; (3) whether the data transference of any communication
is carried out efficiently. A partitioning procedure is necessary to solve these three prob-
lems. A naive and simple approach is to group a number of nodes together and assign
them to one processor, the volume of the transformed polyhedron is obviously reduced
to fit the size of the physical array and much of the inter-processor data transference is
removed, as is the accompanying overheads for communication. However a good strategy
of partitioning should also be helpful in the efficiency of mapping data dependencies onto
a given interconnection pattern.
Finally, in addition to the three problems above, there is one key point left: whether
the N-D computational task can be carried out with a given lower-dimension processor
array. In the well-known space-time scheduling concept, the time domain is a I-D space,
leaving (N-1)-dimensions to the space domain. In many cases, especially when N is large,
the available processor array has only M-dimensions where M is less than N-l. Squashing
the polyhedron into the correct number of dimensions poses serious difficulties that we
address later.
2.2 Independent Partitioning
The best and simplest way of parallelizing nested loops is to partition them into several
independent partitions, each of which can be completed in one processor. In this way,
apart from the task loading, there is no need for inter-processor communications.
Independent partitioning is possible under two circumstances. Firstly, if r = rank(D)
15
D=[~
:~:
I I
I~~I
I I
:~:
,------------------------, il
o 2 3 4 5 6 7 8 9
Figure 2.1: Two Natures of Independent Partitioning
is less than N , because r is the number of a linearly independent vectors in D, there
are not enough linearly-independent dependency vectors in the available directions to
connect all nodes in the space together as a whole, therefore the space spanned by these
vectors is only a subspace of the original one. Each of the subspace can be grouped into
an independent partitioning. Secondly,if the distances associated with the dependency
vectors are so "long" that they jump over the neighbouring nodes to some node far away,
thus nodes have no dependent relationship with their neighbours. Therefore wecan further
make jumpinr partitionings each of which consists of only the nodes connected by "long"
dependency vectors.
Consider Figure 2.1. At first, because det(D) = 1, we can see that there is no de-
pendency in the io dimension, therefore each row of the polyhedron can be divided into
an independent partitioning. Secondly, it can be seen that in each row the nodes 0, 3,
6, 9 have a dependent relationship but are independent to other nodes, thus they can be
further assigned into a group. The same fact holds also for the nodes 1, 4, 7 and the
nodes 2, 5, 8. Thus we have 3 independent partitionings in each row.
Several methods for independent partitioning have been established. The method
of coarsest granularity, proposed in [107], is available only for the first case, i.e., r =
rank(D) < N. First we find N-r vectors v such that vD = 0, then span a unimodular
matrix T with the v's as its first N-r rows, see [107].T is the transformation matrix. It is
proved [107]that the entries in the first N-r rows of the transformed dependency matrix,
TD, are zero. This means that there is no dependency in the transformed N-r outmost
loops, so they can be carried out as DOALLloops. For instance, for a given set of nested
loops with one statement A(i,j) = A(i - 2,j + 3) - A(i - 4,j + 6), and dependency
• 16
matrix D = [~3 ~61,where det(D) = O. We can define the unimodular matrix
T - [3 2 1 d I" . h' [0 0 1 . .- -1 -1 an et I = TI, t en D = TD = 2 1 . The original statement
can be transformed to A(i',j') = A(i',j'-2) ~A(i',j'-l), where there is no dependency with
respect to the first index, allowing it to be carried out with DOALL.
The Greatest Common Divisor method, [78], employs the greatest common divisor
or GCD of every row of D to build a diagonal matrix as the partitioning matrix D'. In
many cases this method does not achieve the maximal partitioning (but we should not
be over-critical as it is now quite an old method). For example, for a given set of nested
loops with one statement A(i,j) = A(i-3,j-8) - A(i-5,j-18), and dependency matrix D, the
corresponding partitioning matrix D' is
[3 5] I [10]D = 8 18 --+ B = 0 2
indicating that only two independent sets can be partitioned with B'. In fact, notice that
there is "I" in the O-th row of B', this means that the independent partitioning cannot
be achieved along the O-th dimension. But since the only non-zero element in the 1st row
of B' is "2", we can divide the whole space into two independent sets: So = {i : io E Z
and il mod 2 = O} and SI = {i : io E Z and il mod 2 = I}. It can be checked that So
and SI are self-contained with respect to any additions of the columns of D. This means
that So and SI are independent of each other.
The Minimum Distance method introduced in [81] came from a simple thought that
if there is a dependent relationship between two nodes P and P there must be a integral
solution a for the equation P = i2 + Da. However, usually it is quite difficult to find
an integral solution for such an equation, so the D is transformed to an upper triangular
matrix D' by means of a linear mapping K which is an integer matrix, that is, D' =
DK. The independent partitioning begins from finding a set of initial points iiO'S,where
i = 0"", det(Dt) - 1. Then for node i if there is an integral solution a for equation
i= iiO+Dta, the node i is grouped into the partitioning i. It was claimed that this method
can achieve the maximal independent partitioning. Unfortunately the disadvantages of
this method cannot be ignored. It is not very clear how to determine the set of initial
points. Furthermore, to find an integral solution of i = iiO+ Dta for every node is still
• 17
very costly procedure, even if D' is triangular matrix.
In [93]the Partitioning Vector Method (PVM) is proposed. Unfortunately, this method
can not guarantee the maximal partitioning. In the same paper, the Smith Normal Form
(SNF) method was also derived. In the SNF method the maximal partitioning is guaran-
teed. Any matrix can be transformed to its unique SNF, that is, D = u-n-v-, where Vu
and Vu are unimodular matrices, and Dd is a diagonal matrix diag(do, d1, ••• , dr-I, 0, ... )
such that do I dl I .. I d.::«. Proceeding similarly to PVM, first make the N-r outmost
loops DOALL loops. Second, partition the inner blocks with r pairs of Pi and ai, where
Pi is the i-th row vector of V, and ai is di. That is, group the nodes i1 and b ( i1 and
i2 E J) together if Vi Pi x it mod ai = Pi x i2 mod ai. See reference for proof. More
intuitively, we point out that because this method makes the partitioning in r different
directions (the maximum number), there is no further partitioning possible. Consequently
very little room is left for improvement.
However, the SNF is an overly conservative condition for the maximal partitioning
problem, because the condition do I d1 I .. I dr-1 is unnecessary. In fact, a diagonalising
matrix is sufficient, at least in principle. In addition, it is inconvenient to collect nodes
one by one using P x i (mod a), because a computation for each node in the whole
polyhedron is required. Also the sequential relations among the nodes grouped into a
block may become unclear. A method is required to perform the transformation of the
nested loops such that all parallelism exploited can be expressed by a loop nest consisting
of a number of DOALL loops.
2.3 General Partitioning and Projecting For Space-
Time Mapping
If the computation polyhedron can not to be divided into several independent blocks,
or the number of such blocks is much less than the available processors, a space-time
mapping has to be used to exploit parallelism in the computation. However, partitioning
still plays a key role.
Usually it is hoped that, by use of partitioning, the whole polyhedron can be allocated
within a physical array, and the inter-processor communications can be reduced. The
• 18
time consumed by communication consists of two parts. The first part is the time for
actual data flows, which is reduced by changing a remote communication to a local one
inside one partition allocated to one processor. The second part is the time for overhead
operations to prepare a communication, such as setting parameters of the data flows to
the interface of a link. In some current processors (such as the transputer [45]), this
overhead is significant. By partitioning, the data output from a processor can be sent out
in one lump, so only one overhead operation is needed for the data transference.
Unfortunately, so far, there is no clear classification with regard to the methods of
general partitioning. Partitioning is closely related to projection ,where "projection"
means mapping a N-D space onto M-D space along a projection direction (M < N).
We avoid using the term "mapping" deliberately, because it is often used to express the
allocation of nodes onto a processor array. Sometimes, projecting is equivalent to mapping
if the M-D space is the processor space. However the relation between partitioning and
projecting can be taken as a criterion for classification. One category is projecting ahead
of partitioning.rsuch as LPGS (Locally Parallel Globally Sequential) and LSGP (Locally
Sequential Globally Parallel). The second is the reverse - partitioning ahead of projecting,
for example, tiling [108].
2.3.1 Projecting-Partitioning Methods
The first class of partitioning-projecting methods takes mainly two forms: LPGS and
LSGP, so we discuss them in detail.
LPGS Method
The method proposed in [73]is the typical form of LPGS. Here the objective is to perform
partitioning and mapping so that the computation polyhedron can be allocated into a
fixed-sized array. For simplicity, we take a 3-D computation polyhedron (i.e. a nest of 3
loops) as example henceforth. In this method,
• 19
(a) Bands Parallel to One Depen-
dency Vector.
~:/ ~
/
5..22 /
/
/
/
(b) Counter Data Flow between
Bands.
Figure 2.2: Partitioning of Moldovan's Method. PI and P2 are the hyperplanes, bI and b2
are the bands, S11, ••• ,S22 are the segments formed by the hyperplanes and the bands. d,
and d2 are dependency vectors.
is a transformation matrix, where t is the timing vector, So and SI are space mapping
functions. For any node i E J, if
no = So x i (mod No), nI = SI x i (mod NI),
node iwill be computed at processor (no, nl) at time t x i. Obviously No and NI can be
taken as the sizes of the given regular processor array. However, by this kind of projecting
and partitioning, a single time hyperplane can have more than one node mapped to
the same processor. In other words, conflicts take place and nodes compete for the
right to evaluate themselves! To avoid this problem, each time hyperplane is broken up
into a number of segments. The computation polyhedron is divided into a number of
bands which cross a time hyperplane to form the segments such that the nodes in one
processor belong to different segments of different time hyperplanes. On execution of the
computation, we complete each segment, one by one (hence the name globally sequential),
on one time hyperplane, then do the same to the next hyperplane, so conflicts are avoided,
see Figure 2.2.(a).
An important contribution of [73]was a criterion for selecting the time vector and the
space mapping function according to the dependencies and the available physical links P.
20
The criterion can be summarised by the equation.
SD=PK (2.1)
where D is a dependency matrix, K is an integral matrix such that Vkjj E K, kjj ~ 0 and
"E.jkjj :S t x d.. Notice that the SD represents the projected dependencies which must be
implemented by a positive combination of the column vectors of P, and that the time for
implementing a projected dependency S x d, by such a combination must be within the
time when it is needed, that is t x d., d, E D.
The method has two advantages. Firstly it makes great use of the processor resource,
because on one time hyperplane, every processor is overloaded, and hence busy most of
the time. Secondly, on one time hyperplane there is no need for communications. So only
after a processor completes all of its assigned nodes on one hyperplane, does it need to
output some intermediate results to its neighbours. As a result, the output data can be
sent out as one large package.
However, the method also suffers from a number of disadvantages. Because the com-
putational polyhedron is divided into bands and these bands are mapped onto a processor
array in a way that ensures the left-hand sides of all the bands are mapped to one side
of the array while the right-hand sides to the other side, the problem of data transfer-
ence from the right-hand side of a band to the left-hand side of the band on its right
arises. Consequently, a set of First-In-First-Out buffers and long-distance links from one
side to the other side are needed along each edge of the processor array. Obviously, such
additional hardware is not always available and restricts the applications of the method.
In addition, when the given regular processor array offers only the simple intercon-
nection primitives P which have no negative entries, we may face serious difficulties in
choosing S. Considering eqn (2.1), because there are no negative elements in PK, we have
to look for a S such that all the elements of SD must be non-negative for an arbitrary D
which may consists of negative entries. Since the searching takes place in an N x (N -l)-D
integral space, it involves a computationally expensive procedure. Furthermore, even if
such an S is obtained, the timing vector t selected according to SD = PK cannot guar-
antee det(T) = c = 1. Usually, we may have c > 1, even ~ 1, i.e., very poor efficiency,
because efficiency is less than or equal to ~. In fact, only one of c processors is active at
21
a moment, leaving the remaining c - 1 processors idle. These idle times correspond to
"invalid holes" in the processor-time domain. Trivially, one may suggest looking directly
for a T under condition of SD = PK and det(T) = 1. Then, the search space is enlarged
to N x N dimensions. There is no way to show the existence of such a solution for general
situations. Therefore, there is no guarantee for the performance of the LPGS method.
Finally we remark that in [25]this method was said to be limited to the case when there
is no counter data flow between bands. This limitation may not be true for Moldovan's
method itself, because it is the segments, not the bands, which are computed sequentially,
there is no restriction on data flow between the bands. It can be seen from Figure. I.(b)
that there is counter data flow, because of d} pointing form bl to b2, and d2 pointing
inversely. However, the computations of segments are carried out in the order 511, S12, S21
and S22 there is no conflict with respect to dependency relations. In fact, Sl1 and S12 are
on the same timing hyperplane PI, so there is no dependency relation between them; and
the same holds for S21 and S22. Also when evaluating S21 and S22, the data needed by them
are ready because P2 is performed after Pi- However, if the condition of no counter data
flows between bands is imposed on Moldovan's method, every dependency vector must
be parallel to the bands. For this purpose, d2 must not exist, and the bands have to be
modified so that they are parallel to d1, as in Figure.1.(a). The positive result of this is
that there is no longer the need of additional hardware. Therefore, only in this sense, the
restriction of no counter data flows is sensible. However, this case exists only if det(D) =
0, which means that the partitioning degenerates to an independent one.
Other methods also exist, in particular, [76], [40] and [48] can be considered as the
extensions of the above method.
Darte's method [24] is a good example of the LSGP subclass, which marks an important
theoretical advance. Two important concepts introduced in this method are a "same-
time" matrix Q and an activity matrix A. Q, which is perpendicular to t, spans a (N-l)-D
subspace where all nodes are executed simultaneously, while A = SQ spans a subspace,
in the cell space, where all cell are active simultaneously.
22
The method can be presented as a procedure of partitioning and mapping onto fixed-
size arrays. Since the LSGP approach will be used later in our method, we briefly describe
it as follows:
Step 1 Given sP and c, where sP is perpendicular to the 5, and c is an overall compression
factor (from computational polyhedron to processor array)
Step 2 Determine t by c = tsP
Step 3 Derive 5 from s", Q from t.
Step 4 Compute the Hermite Normal Form (HNF) of A = 5Q and form a partitioning box
of size ko, ... , kN-2, where ko, ... , kN-2 are the diagonal elements of the HNF , and c
= ni ki. The quantities k, indicate the compression ratio in the i-th dimension. We
refer to k = [ko, ... , kN-2]T as the compression vector indicating the compression
factors in each dimension.
Step 5 Project the computation polyhedron as a virtual array by 5 along projecting vector
s.
Step 6 Draw boxes on the virtual array with edges parallel to the axis and of size ko x ... X
kN-2• The nodes in one box are active at the different times. Hence, they can be
clustered and mapped into one processor, so that no conflicts occur.
The process is illustrated as follows:
sP~S
c .u.
t~Q
HNF
~ A ko, ... ,kN-2
==}
This diagram shows simply that the compression factor cis decomposited to sP and t
which generate 5 and Q, respectively; A = 5Q is transformed to its HNF whose diagonal
entries determine the sizes of conflict-free partitioning box.
Darte's method is very clever, but it still has some disadvantages. The first difficulty
is with the selection of 5 such that the computation polyhedron can be mapped onto
an actual array of not only fixed-size but fixed-size in each dimensions. For instance, if
we have an processor array of size No x NI' We may be able to find an 5 such that
23
the polyhedron can be projected as a virtual array of size (no x No) x (nI x NI)' so the
compression factor c should be no x nl. However, the activity matrix A resulting from S
and Q can be transformed to its HNF with 1 and no x n}, or no and nI, as the diagonals.
If no and nI are the diagonal elements, partitioning with a parallelepiped of no x nl can
map the virtual array onto the actual array. If this is not possible, partitioning with
a parallelepiped of 1 x (no x nl) fails to give a suitable mapping, i.e., compression is
too strong in one dimension, but too weak in the second dimension. We can check that
modification of Q has no effect on the HNF of A, but that modifying S does. However, if
S has to be changed for a different HNF while keeping c unchanged, the virtual array itself
is changed. As a result, the previous compression factor c is no longer valid, so everything
has to be re-evaluated. Essentially the technique is a high-dimensional searching procedure
in which multiple-target functions are indirectly modified to fit their N-1 targets, which
is a difficult job.
Furthermore, the LSGP method does not consider the problem of redefined data com-
munications. Because both Sand t are determined according to the requirement of
mapping onto a given regular array, no flexibility is left for satisfying SD = PK. No
proof can be predicted for the existence of a satisfactory pair of S and t with so many
conditions for arbitrary dependencies and simple interconnection primitives.
The final shortcoming of the method lies in a fact that the timing layersIhyperplanes]
are too thin and no further segmenting can be made on a layer so that data produced in
one layer may be needed immediately by the nodes on the next layer and in another box.
Therefore, there is no time left to collect the data belonging to a number of sequential
nodes in one processor as a package communication. This is similar to the case of no
partitioning at all.
In [111] almost the same result as Darte's is obtained from a different aim and by a
different mathematical framework. They began with the efficiency problem of a space-
time mapping. As before, let the transformation matrix T = [IT, ST]T. It was found that
if det(T) = c' > 1, on projecting a polyhedron as a virtual array, only one in every c' cells
of the virtual array is active at a moment (one time hyperplane), and the c' -1others are
idle. A great waste of processor resources. To improve efficiency, c' cells can be clustered
24
into one processor so that the processor keeps busy all time. This is a well-known fact.
However, the contribution is to give a clustering strategy such that in a clustered block
there is no conflict. It was found that no more than one cell is active at a moment if the
partitioning is made by a parallelepiped with edges whose lengths are the factors of c', In
fact, this c' is exactly the compression factor c in Darte's method. Thus the method is the
same as Darte's one in nature, and shares the same advantages, as well as disadvantages.
In these two methods, LSGP and LPGS, projecting is done before partitioning. That
is, firstly, map the polyhedron on a (N-l )-D hyperplane, then partition the virtual array,
or projected nodes, to blocks to be allocated to processors. The main difference in their
partitionings lies in the fact that the LSGP method collects the projected nodes in a non-
continuous way, "jumping" with a constant step; while in Darte's approach partitions of
the virtual array occur in a continuous piecewise manner.
Other similar methds
A method was proposed in [96], similar to the ones above in the sense of partitioning the
projected nodes on a hyperplane. At first, take the timing vector also as the projection
vector s. Project the computation polyhedron onto a hyperplane perpendicular to t, and
do the same for the dependencies. Among the projected dependencies, select the shortest
one as a grouping vector g and another (for the case of 3-D polyhedron) as auxiliary
grouping vector a. Let k be an integer such that k x g is the shortest integral vector. In
partitioning, k projected nodes along g are grouped into one block, and the first projected
node, say is, in the block is taken as a "seed", to determine other seeds for other blocks.
That is, I x k x g + is and m x a + i, (1 and m are integers) are seeds for other blocks.
These blocks are clustered again, and are allocated to a hypercube structure by dividing
one block into two halves and proceeding recursively.
The main advantage of this method is the reduction in the amount of data flow.
Because the grouping vector is chosen as the shortest projected dependency vector, many
dependencies are limited inside a group. Because two projected dependency vectors are
used to create other blocks, they are also the interconnections between blocks, so physical
links can be assigned to them, reducing the need to relay data. The method has three
• 25
drawbacks, two of which are significant. Firstly, the way of allocating nodes was given
only for a fixed-size hypercube structure. If the hypercube structure is not used, it is
not clear how to map these blocks onto an actual array. The hypercube is one kind of
architecture, but not always available. The second problem is more problematical. In one
block, there can be more than one node which lies on the same time hyperplane. To avoid
conflicts, extra control or further segmentation is needed, but not discussed. Finally, like
Darte's method, data flows cannot be made into a single package easily.
Because Moldovan's method was thought to be invalid when there are counter data
flows between the bands, a modification of choosing the bands to be parallel to the
timing hyperplanes was proposed in [25]. This change is so significant that it is no longer
LPGS. The computation polyhedron is sliced into bands, each of which consist of c timing
hyperplanes ( c = det(T), the compressing factor as above). (c-l) vectors VI, ... , Vc-I are
chosen such that nodes i, i +Vb ..., i + Vc-b are in the different timing hyperplanes of a
band. These nodes are active at different instances, therefore they can be assigned to one
processor. Hence within a band, the method is similar to LSGP. However, these bands
will be carried out in sequence, so the method can be thought to be "locally sequential"
and "partially parallel" and "globally sequential". Unfortunately how to select the (c-l)
vectors remains a serious problem. Because there is no systematic way to select them
and no rules to determine their positions, it is possible to merge remote nodes to one
processor, resulting ill non-local.data flows.
The problem of mapping piecewise regular algorithms onto piecewise regular arrays
with a given number of processors by means of applying the above LPGS and LSGP
methods is considered in [99].
2.3.2 Partitioning-Projecting Methods
Tiling, [108][46],is another class of partitioning, which is usually carried out on the orig-
inal computation polyhedron. First, cluster adjacent nodes as a block (referred to as a
supernode or a tile) according to some criteria. The original computation space is trans-
formed to a supernode space, and the original polyhedron, to a supernode polyhedron.
Then the supernode polyhedron is mapped onto a processor array. An obvious advantage
• 26
of tiling over the previous methods is that because a supernode sends its output data just
once after all computations in it are finished, the data can be easily transferred as a single
package.
As an example of tiling, consider the method in [46]. Here a set of hyperplanes
h, x x = k is introduced, where i= 1, ...,N; x ERN; k E Z, which slices the computation
polyhedron in N directions. The nodes enclosed in one section formed by these hyper-
planes is assigned to one supernode. The method is relatively straightforward, the major
contribution being the derivation of a number of conditions for the selection of the hi.
Unfortunately the problem of how to map the supernodes onto a fixed-connection and
fixed-size processor array is not considered.
King, Chou and Ni [52]considered the problem of partitioning according to dependency
vectors in the 2-D case. They found that among 2-D dependency vectors it is always
possible to find two, say dI and d2, such that every other dependency can be expressed as
a positive combination of the two vectors, i.e., c, x d, = ai x dI + b, X d2, where d, E D;
Ci > OJai and b, 2: O. Let 11 = max(ai/ci) and 12 = max(bi/Ci) Vi. If the nodes in a
2-D computation space are clustered into supernodes by parallelograms with edges along
cl} with size It and along d2 with size 12, no matter how many and what the original
dependency vectors are, there exists three dependency vectors for the supernodes, that
is [1,0], [0,1] and [1,1]. This idea is beneficial for the simplification and unification of
parallelisation procedures, because canonical (or unchanged) dependency matrix can be
used. However, due to computational difficulties, they did not extend the idea to the 3-D
case, did not deal with the problem of space-time mapping, and so failed to exploit the
benefits which may result from the canonical form of the dependency matrices.
2.3.3 Comparison
Comparing the two main classes of methods, projecting-partitioning or partitioning-
projecting, we can make a summary of their main advantages and disadvantages:
If space mapping is performed before partitioning, i.e., projecting-partitioning, it is rela-
tivelyeasy to map a polyhedron onto a given size array. Because after the space projecting,
we have the information about the virtual array, so it would be relatively easy to par-
• 27
tition the virtual array to fit the actual array. In addition, and quite importantly, high
efficiency is easily achieved. The two weaknesses of this class are, firstly, in LSGP-like
methods the time layer is too thin, so that output data can not stay in a partition long
enough to be packed with others. Secondly, the shape and size of clustered blocks are
determined with the requirement of no conflicts. As a result, there is no guarantee that
the resulting interconnections will be fitted to the given interconnection primitives, or
that local communications will be possible. In contrast, the partitioning-projecting class
possesses the opposite strong and weak points.
In the tiling method where partitioning is performed ahead of projection, during the
partitioning process we do not have the necessary information to guide the method to
scale the size of a supernode such that the supernode polyhedron can then be mapped
onto an given array. This information is closely related to the problem of space projection.
Furthermore, unless the space-time matrix for the mapping of the supernode space onto
the actual computational space is unimodular, the efficiency tends to be quite low. Of
course, tiling .lso has its advantages. As mentioned above, because the supernode can be
treated just like nodes, data flowscan be easily transferred as a package. The requirement
of supernode interconnections being fitted to given primitives and the locality of data flows
can be obtained by means of determining a suitable shape and size for the supernodes.
As both classes have their own fundamental weaknesses, it is difficult to apply them
to practical applications. As a result, the development of software packages for practical
use in automatic parallelization has been restricted.
2.4 Lower-Dimensional Mapping
In practice, it is desirable to consider lower dimensional arrays especially when N > 3 so
as to produce physically realistic architectures.
The problem of mapping an algorithm onto a lower dimensional processor array has
received relatively less attentions to date. Some work concentrates on a specific problem
[51], such as mapping four or five dimensional bit level algorithms onto a two-dimension
processor array. There are also a number of papers dealing with more general cases [104],
[58], [59], [94], [110].
28
The method in [104]was an early attempt to deal with the issue of lower dimensional
mapping and is essentially simple to understand. A non-singular N x N matrix T = [ ~ 1
is used to map the original computational N-D polyhedron to form a K-D time polyhedron
by II which is a K x N timing matrix, and a (N-K)-D processor array by S which is a
(N - K) x N space mapping matrix. A K-D minimum hypercube is built to contain the
K-D polyhedron. Finally, a K-D projecting vector is used to map the hypercube into a 1-D
time domain without producing computational conflict. The shortcoming of the technique
is that when an irregular K-D polyhedron is regularised to a hypercube, the total volume
of the computational region can be increased significantly, thus reducing efficiency. An
altenative, but very similar, method involving a mapping onto lower-dimensional array
by means of a multiprojection is given in [99]. Unfortunately there is no evaluation to
show that the approach achieves an optimum multiprojection for the cases of irregular
polyhedra.
An alternative approach to using two time projections (first from N-D to K-D then
from K-D to 1-D time) is a single timing vector t to map the original N-D polyhedron onto
l-D time domain directly (see [58], [59]). However, the resulting T = [ ~ 1 is singular
matrix, because S is still (N - 1<) x N. It is well-known that a singular transformation
matrix will result in a conflict mapping, i.e., more than one node in the polyhedron
is mapped to the same processor and executed at the same instance, and thus strictly
forbidden. To avoid this problem, a number of conditions are usually introduced.
The same concept is expressed in a better way by [94]where the original computational
polyhedron is assumed to be a hypercube. Because rank(T) < N, the null-space of T
consists of the vectors c' by which two nodes j and j+c' are mapped to the same space-time
point. Once S is given, the problem is focused on looking for the minimum t, under some
other conditions, such that no integral members of the null-space of T can be contained
within the hypercube. Such a T is termed conflict-free. In principle, it is always possible
to find a conflict-free T.
Unfortunately, it is very difficult to find the minimum t associated with a conflict-free
T. The method for seeking tis: (1) from a small norm, produce all of the vectors t
having the same norm; (2) for each of the t's, check whether the hypercube contains any
29
integral member of the kernel of T. If some integral members are in the kernal, increase
the norm of t, repeat (1) and (2). Clearly this searching procedure is extremely time
consuming. For example, step (2) may have to be repeated as many times as the volume
of the computational polyhedron, a tremendous number of computations when the size of
the polyhedron is large. In addition, step (2) also poses a difficult problem in the theory
of linear programming [110J.
Furthermore, the performance of the method is often not that good. For example, if
the original polyhedron is a hypercube, the method is good, because it enumerates all the
possible t one by one. However, when the shape of the original polyhedron is irregular,
which is very common (e.g., banded sparse matrix problems), the method still uses a
hypercube to contain the polyhedron, and is unable to exploit the benefit arising from
the irregularity. This is because it is more difficult to perform step (2) on an irregular
shaped polyhedron, and also because the method only uses a conflict-free criterion to select
t (this is not sufficient). Besides the conflict-free condition, we must take the order of
executions into consideration. Otherwise the executions of a computation may violate data
dependency requirements. For example, if the original polyhedron is defined by vertices
[O,O,OJT, [0,0, 10]T, [10,0, 10]T and [10,10, 10f, it can be verified that no more than one
node can be mapped to the same instance by at = [64,10,1], i.e., conflict-free. But the t
cannot be used, because a node [9,9, lO]T is mapped to 676, while [10,0, lOV is mapped
to 641, which violates a lexicographic order of execution which is usually assumed. To
avoid this problem with the conflict-free mapping method, we have to embed the irregular
polyhedron into the smallest possible hypercube.
The General Parameter Method (GPM) was proposed in [34]. The GPM uses (M +
1) x N independent parameters to represent the matrix T (M+l)xN and searches for optimal
solutions under some constraints for these parameters. It was shown that this method
is equivalent to [94], so the method suffers similar shortcomings although searching com-
plexity is improved.
In general the methods of [58], [59], [94] and [34] which all attempt to construct
T(M+1)XN encounter serious obstacles in synthesising an actual array. We must determine
which node is addressed by a particular processor at a particular time. It is easy to map
30
an N-D domain to an (M+l)-D domain by T(M+l)xN, but an obvious difficulty is how to
do the reverse: the mapping from (M+l)-D domain to N-D domain. No papers deal with
the problem so far but it is essential for code generation.
2.5 Summary
In this survey, the two main classes of partitioning and mapping, independent partitioning
and general partitioning and mapping, are discussed. From the discussion, we can say
that the theory on independent partitioning for UREs is essentially mature.
The general partitioning and mapping for UREs is still an open area for further study.
The existing methods, such as LSGP, LPGS and Tiling, all suffer from various kinds of
disadvantages ranging from hardware limitation, and low computation or communication
efficiency to design difficulties. We will address and solve some of these problems in the
following chapters by both improving existing methods and proposing a new integrated
methodology.
31
Chapter 3
Improvements of Some Existing
Partitioning and Mapping Methods
In this chapter we consider some improvements to independent partitioning, LSGP and
LPGS methods. In Chapter 2 we discussed these partitioning approaches [93], [25]and [73]
and the important roles they have played in the development of field. For the independent
partitioning of [93], a simplified SNF partitioning method is suggested and the generation
of parallel codes is presented. For the LSGP of [25],we propose a synthesis method which
guarantees the production of a given-regular-shapemapping easily. For the LPGS of [73],
one improved approache is presented which remove the long-distance links for data flow
from one side of the array to the other and is helpful in reducing the communication and
the complexity of design.
3.1 On the Maximal Independent Partitioning
3.1.1 Introduction
As mentioned in Section 2.2, the best and simplest way of parallelising nested loops is to
partition them into several independent partitions if possible. Each of the partitions can
be completed in one processor, so there is no need for inter-processor communications.
The SNF method of [93] is successful for this objective.
To place the improvements in context it is necessary to give a brief description of the
SNF method. In algebra, if a matrix M can be transformed into M' by only elementary
row and column operations, M is said to be equivalent to M', noted as M '" M'. The
32
SNF of D is such that
where US and VS are unimodular matrices, R~ = diag(rg"" ,r~_l) is an integral s x s
diagonal matrix and rg I rf I ... r~_l' "I" means "divide".
The SNF approach of producing independent partitioning (US, rS) is described as
follows:
Let partitioning matrix US be such that USDVs = RS. Define rS = [rg, ... ,
r~_l,oo, ... ,oojT E ZN. The partitions {Iy(l), .. ·,IY(IJ)} of the original computational
domain S is such that Iy(j) := {i : USi(mod rS) = y(j), i E S}, where y(j) E ZN is
referred to as the index vector of the partition Iy(j), j = 1"" ,I' (the notation Ui(mod r)
means that Vi and do uii(mod ri), where ri E r and Ui is the ith row of U).
i1 and i2 are said to be pseudo-connected if there exists a vector A such that P -
P + DA, that is, they are dependent on each other if they are pseudo-connected, and
vice versa. SV', the independent partitioning is such that the pseudo-connected and only
pseudo-connected nodes are grouped into one partition. It can be proved [93]that P and P
are pseudo-connected iff USP(mod rS) = USP(mod rS). It is not easy to give an intuitive
explanation for the SNF method, but we will explain its nature later in Subsection 3.1.4.
However, the SNF is not the only way to achieve the maximal partitioning. We
propose two methods, Diagonalisation Form (DF) and a simplified version of SNF (HSNF),
which can do the same task but HSNF is preferable. An algorithm to produce HSNF,
as well as SNF, via DF is presented, which shows a significant computational advantage.
Furthermore, the problem of algorithm transformation, which was left open in [93], is also
discussed.
3.1.2 Improved Methods of Independent Partitioning
Inspecting carefully the proof of Lemma 5.1 of [93], one can find that the condition
USDVs = RS is a sufficient condition for the maximal partitioning but not the necessary
one. The condition rg I rf I ... r;_l is unnecessary. In fact, it was not mentioned in the
33
proof of the lemma. Therefore, US and rS can be replaced by UD and rD such that
UDDyD = RD = [~ g 1
R~ = diag(r{?, ... ,r~_l) is an integral diagonal matrix. RD is a diagonal matrix but not
SNF. Let rD = [r{?, ... ,r~_lloo,· .. .oc]" E ZN. The partitioning (UD,rD) is called the
Diagonalisation Form partitioning (DF), and it can be proved by the proof of Lemma 5.1
of [93] that P and i2 are pseudo-connected iff UDP(mod rD) = UDP(mod rD).
The first matter we are concerned with is whether the SNF partitioning of (US, rS) and
the DF of (UD, rD) have the same result (that is they obtain the maximal partitioning).
We will see later that det(R~) = det(Rf), i.e., their upper bounds of partition number
are the same.
The advantage of DF partitioning (UD, rD) li~s in the fact that much less computation
is needed to diagonalise a matrix than to produce the SNF of the matrix. However, as
shown later, from the view-point of producing parallel codes, (US, rS) sometimes has an
advantage over (UD, rD), because rS may have more "1" elements than rD. The more
"1" elements it has, the simpler the parallel codes corresponding to it will be, which is
explained on page 39 after we introduce the structure of parallel programs.
However, in the procedure of producing the SNF, once all1's have been found, we
can stop the procedure because the actual SNF itself is unnecessary. Therefore, a "half"
SNF (HSNF) :., suggested, which is a diagonal matrix which is not SNF but which has as
many "I" as possible on its diagonal. Then the corresponding partitioning is (UH, rH),
where UH is such that UHDyH = RH, and rH is the diagonal of RH with "0" replacing
the OOj RH is a diagonal matrix with as many as "1" elements as possible.
3.1.3 Producing HSNF
A new algorithm to produce the HSNF of the matrix DNxnd is developed via a simple
algorithm for diagonalising a matrix. The basic idea is as follows
Step 1 Diagonalise D.
Step 2 Calculate the gcd of the entries in the diagonal from the top-left one, say doo, to
the right-bottom. If gcd =F 1, stop.
• 34
Step 3 Check whether the top-left entry, say doo, can divide all the entries in the diagonal.
If anyone, say djj can not be divided by doo, add djj to djo.
Step 4 Diagonalise D again. All entries in the second diagonalised matrix can be divided
by its top-left entry.
Step 5 Go to Step 2 to repeat this procedure for du, d22,·· • as the top-left entry, in turn.
Algorithm 3.1.1 Diagonalise Matrix-D(Y, D, Z, top-left)
FOR i := top-left TO min(N,nd)-l
DO {
FOR j := i + 1TO N-l operation 1
9 := gcd(dii, dji) and calculate p and q such that 9 = P X dii + q X dji
X :~ [: ~~:;~gl, [~; 1 :~ XT [ ~; l, [~; 1 :~ XT [ ~; 1
FOR j := i+ 1TO »« - 1 operation 2
9 := gcd(djj, djj) and calculate p and q such that 9 = P x d« + q x djj
X :~ [: -d~:/~gl, [d,: dj] := xja, : dj], [z,: Zj] :~ X[z, : Zj]
}WHILE( djj is not the only non-zero entry in d.)
End of Algorithm
where d, and dj are the column and row vectors of D, respectively, and dij E D. This algo-
rithm appears straightforward, operation 1 and operation 2 are designed to annihilate the
entries, except djj, on dj and d., respectively. It is worthwhile mentioning that after the
operation 1, d;, becomes the gcd of a.. In fact, in the operation [ ~; 1 :~ XT [ ~; l, be-
cause lp, q][dii, dji]T = gcd(dii, djj), dii := gcd(dii, dji)j so dii = gcd(dii, di+Iti,··· ,dN-Iti).
This also holds for operation 2. The following algorithm produces the HSNF of D.
Algorithm 3.1.2 HSNF-D(Y, D, Z)
Make identity matrices Y and Z
i:= 0
Diagonalise Matrix-D(Y, D, Z, i) operation 1
WHILE(gcd( dii, ... , dmin,min)= 1)
FOR j := i TO min(N,nd)-l operation 2
Find the smallest entry, say d"", along the diagonal of D from djj
Swap dj with dk, do the same for Y, Swap dj with dk, do the same for Z
FOR I := j + 1TO min(N ,nd)-l operation 3
IF du cannot divide dll
djl := djl + dll and Zj := Zj + Zl
• 35
Diagonalise Matrix-D(Y, D, Z, i)
i := i+ 1
UH := Y, VH:= Z
End of Algorithm
We should prove that for any diagonal matrix D = diag(doo, dll,"', dN-I,N-I), after
operation 4
operation 2 and operation 4, its new doo can divide any entries in the diagonal. In fact,
without loss of generality, suppose doo is the smallest in the diagonal and it can not divide
any of dll,"', dN-I,N-I' After operation 3, do of D becomes [doo,du .:»>, dN-I,N-IV, In
Diagonalise Matrix-D, after annihilating do, doo = gcd(dll,· .. ,dN-I,N-I), and any other
entries of D are a linear combination of dll,"', dN-I,N-I. As a consequence of this, the
new doocan divide any other entries in the modified D. Therefore, at most, N-l iterations
are needed to produce the SNF of D. It remains to explain the WHILE loop test. Less
computations are required to produce HSNF because the process stops midway when all
"I" elements in the diagonal have been found.
Comparing with the method [49], our algorithm is faster. In step i, there is (N-i-l)-D
HNF at the bottom-left of the operating matrix in the method of [49], instead of a (N-
i-l)-D diagonal submatrix in our method. To ensure the property of dii I d(i+1),(i+1), we
have to check if any entries of the (N-i-l)-D submatrix at the bottom-left of the operating
matrix cannot be divided by dii. If there is, the operating matrix must be modified and
the same operation (computiong HNF or diagonal matrix) must be repeated. Therefore
in step i, there are (N - i - 1)2/2 operation of computing HNF at most in the method
of [49], while there are (N - i-I) operation of computing diagonal matrix at most in
our method. Even if the computation of diagonalising a matrix is roughly twice that of
producing the HNF, our algorithm is still much faster.
It is easy to show that det(R~) = det(R~). In fact, after being diagonalised for
the first time, D has the form of [~ g j. Any following elementary row and column
operation for HSNF or SNF are limited to only the first s rows and the first s columns. No
nonzero entries are created outside the square area of R;'. So because only elementary
row and column operations are applied, there must be , det(R~) , = , det(R~) ,. An
.36
example may be helpful to illustrate the HSNF algorithm (but going on to SNF):
Original DF
[ 1~ o 0][8 -55 0] operation! [11 0
~ 1
operation2 11 04 -22 0 ==::} o 2 ==::} o 4602 o 0
HSNF
U 0 2~2l
SNF
operation4 [1 0 0 1 operation2 4 operation4 [1 0 0 1==::} o 22 0 ==::} 22 ==::} o 2 0004 o 0 44
3.1.4 Algorithm Generation
The SNF independent partitioning approach is an important advance in theory. However,
the suggested partitioning procedure of assigning any i :i E Sand USi(mod rS) = y(j)
into a partition y(j) is impractical. If the S is large, the computation of USi will be
very time-consuming. Moreover, no parallel codes are produced in this way. We hope to
transform a serial algorithm into parallel one as the result of the independent partitioning
procedure. Fortunately, this is not too difficult with HSNF.
Having obtained the HSNF of D using Algorithm 3.1.2,in order to move the potential
parallel loops so they become the outermost loops, reverse RH, that is
For simplicity, let R = JRHJ. We know that R = diag(ro,···, rN-d, where
{
- 0
r, >!
=!
i= 0,···, no-1
i = no, ... , no + n, - 1
i= no+ ni, ... ,N - 1
where no = N - r and n, is the number of diagonal entries larger than 1.
Let the transformation matrix be defined as T = JUH. Because T is unimodular,
the original sequential FOR-Loops, indexed by i = [io,···, iN-If, can be transformed
directly to another set of FOR-loops, indexed by j = [io,··· ,jN_I]T, such that j := Ti.
that is
FOR io := l~TO u~ . FOR jo := 16 TO u~j:= Tl
FOR iN-1 := lkr-l TO Ukr-l FOR jN-l := Itv-l TO Utv-l
.37
where Ii and ui may be the functions of io,· .. ,ik-1; l~ and u~may be the functions of
. .
)o,···,)k-l·
It is interesting to see the data dependencies in the transformed space. Let VJ =
(VHJ)-l. Note that VJ is a unimodular matrix. In fact, the new dependencies, Dj, are
Ono
Jrno X y nO
JY N-l
where Ono is a no row matrix and yJi E VJ and gcd(yJi) = 1.
Because the first no rows of Dj are zero, the first no FOR-loop's with respect to
jo,··· ,jno-l can be carried out concurrently, so they become DOALL-Ioop's.
. -j-
Consider each of the middle n, rows of DJ. Because of gcd(dd = gcd(riyJi) = ri,
where d: E n-, data dependencies jump r, nodes in this dimension, so nodes can be
grouped into-r, independent partitions along this direction. Recalling the analysis of the
nature of the independent partitioning in Section 2.2, this is exactly the Greatest Common
Divisor method. Therefore the SNF method can be said equivalent to the GCD method
after a linear transformation T.
The FOR-loop with respect to i. corresponding to the r, > 1 can be split into two
loops
DOALL jf = 0 TO r~- 1
FOR i. := if + ri r(lt- in/Til TO ut STEP Ti
Because r~+nl = ... = rN-l = 1, to outer-most levels the last n, FOR-loops with
respect to ino+np··· ,iN-l cannot be split any more, so they remain unchanged. Let
np = no + tii. All "DOALL if = 0 TO ri - I" can be moved outwards. In fact the
outward movements change nothing with the lexicographical order of the execution of the
loops, because it is the ii's (not if's) that are the real indices to access an node and they
.38
stay in their original order. Therefore, the parallel algorithm has a form:
Loops 3.1.1 The parallel algorithm of Independent Partitioning
DOALL jo := l~TO u~
DOALL jno-l := l~o-1 TO U~o-1
DOALL j~o = 0 TO r~o - 1
DOALL j~p-l = 0 TO r~p_l - 1
FOR . 'P + I r'ho -j~o 1 TO j STEP IJno := Jno rno r~o Uno rno
FOR' .p + I r'~p-l-j~P-ll TO j STEP IJnp-l := Jnp-l rnp-l r'. Un -1 rn -1
np-l p p
FOR i-; := l~p TO u~p
FOR jN-l := 1~_1 TO U1v-l
The FOR-loop's are executed in one processor.
It has been mentioned that in many cases HSNF partitioning (uH, rH) is preferable
to DF partitioning (UP, rD). This statement relies on the fact that although the DF and
HSNF share the same product of the diagonal elements, the latter may consist of more
"I" elements than the former. For instance, in the example of the last subsection, DF has
the diagonal elements of 11,2,4 and HSNF has 1, 22, 4. If the DF partitioning is adopted,
two loops have to be split, whereas for HSNF, only one loop needs splitting. The less the
loops are split, the simpler the parallel algorithm will be. There is essentially no difference
between HSNF and SNF. So HSNF may be more useful because less computation is
required to produce the HSNF .
3.2 A Synthesis Method using LSGP Partitioning
for Given-Shape Regular Arrays
3.2.1 Introduction
We present a method to partition and map a computational polyhedron onto processor
arrays. Based on the theoretical framework of an existing LSGP method, a systematic
design procedure is proposed which is able to map the polyhedron onto a regular array
with given regular shape, i.e., given sizes in each dimension .
• 39
3.2.2 A New LSGP Synthesis Method
The method proposed in [24] (see Subsection 2.3.1) is an important theoretical advance.
We recall the major points of [24]. There are four essential relationships: c = IsPj SsP = OJ
IQ = OJ A = SQ. And finally, A can be made equivalent to an triangular matrix (HNF)
whose diagonal entries can be used to define a conflict-free partitioning box.
However, it has also be pointed that this partitioning method suffers a significant
drawback when mapping a polyhedron onto a given regular processor array. In view of
the neat formalism in [24], we wish to focus on modifying the method to overcome this
difficulty. To achieve this goal, we consider the problem from a different perspective.
From a s", we can construct a space mapping S, by which a projected virtual array
can be determined provided that the vertices of the computational polyhedron are given
1. Thus, the required compression factors in each dimension can be found very easily. We
can construct an upper triangular matrix A with the compression factors as its diagonals.
Then, find the Q by solving SQ = A. Finally, we determine I from Q under some
additional constraints, such as SD = PK [73]. Obviously, it is ensured that the original
computational polyhedron can be partitioned and mapped onto and within the given
regular processor array. Our method is illustrated as
ko, ... , kN-2
sP ===> S ===> A ===> Q ===> I
Compared with the method on page 23, because the compression factors are set as the
parameters of the procedure, the given-regular-shape mapping is achieved in principle and
the derivation of the HNF is no longer needed.
This new method is described in detail as follows:
Step 1 Choose a sP = [s~, ... , S~_l]T with S~_l - 1. Construct the space mapping
which has a form of
S = [S' : s'] (3.1)
where S' is a (N - 1)x (N -1) unimodular matrix.
IThis is the usual case. However, as will been seen, there are some cases where the computational
body is infinite, like a semi-infinite cylinder
.40
Let S' = I for simplicity (not necessary, but beneficial to the following operations).
By the definition of S, we know that SsP = 0, where 0 is a null-filled column vector.
,. b . d ' S'[ P P]T bPIs IS 0 tame as s = - so, ... , sN-2 , ecause sN-l = .
Step 2 Determine the compression factors in each dimension. At first, project all the
possible vertices to a set of projected vertices W = SV.
The compression factors in each dimension are formulated as, i = 0, ... ,N - 2
k
i
= rmaXo<j<nv Wi,j - mmO<j<Rv Wi,j 1
t.
where Wi,j E W, nv is the number of vertices and l, is the length of the processor
(3.2)
array in the i-th dimension. The desired c = nf:(j2 kj•
Step 3 Construct an (N - 1) x (N - 1) uppertriangular A with ko, ... , kN-2 as the
diagonal elements.
May fill (-not necessary, but beneficial to reducing the design complexity later) the
entries above tlre diagonal: 'Vi andj , 0 ~ i, < N - 1 and i < j < N - 1,
a- - = { f
l,} 0
9 = gcd(kj, kj) > 1
otherwise
(3.3)
where f = ;, GCD(f,g) = 1 and r ;:::1. That is, f is a factor of kj and without 9
as its factor. For example, for kj = 3 X 52 and 9 = 5, we have r = 2 and f = 3.
Step 4 Derive integral solutions for Q, an N x (N - 1) matrix, where
SQ=A (3.4)
Because rank(S) = N-1, we know that the general solution of Q consists of two
parts,
(3.5)
where Qp is a particular solution, q", the null-solution, and b is a combination
vector. Qp and qn are given by the following proposition.
Proposition 3.2.1 qn = sP and Qp = [!1
·41
where 0" stands for a null-filled row vector.
Proof: Obviously, since Sq" = 0 and S' = I, the first N-l elements of s" are
-S'-lS' = -s' if we let t'N-l = 1. Thus q" = s". Fortunately, it is not difficult to
find a particular solution under our arrangement for S. Letting the last row of Qp
be a null vector, we have
(3.6)
o
Step 5 Select a Q from its general solutions obtained by changing b, according to the
following condition:
Condition 1. 'v'qi E Q, GCD(qi) = 1, where GCD(v) means the gcd of the elements
of the vector v.
Step 6 Derive the timing vector t by solving
tQ=O" (3.7)
and normalise by its gcd = GCD(t). Test two conditions:
Condition 2. tsP = c (the test for this condition is unnecessary if S' = I and the
A is filled as eqn (3.3)).
Condition 3. rn ~kp• kp will be defined later (to allow implementation with a
set of given interconnection primitives).
If conditions do not hold, go to Step 5.
3.2.3 The Conditions for Valid Q and t
In this section, we explain the three conditions and the synthesis of A.
Valid Q
In the proof of Theorem 1 of [24], Matrix Q is a basis for a subspace in which all the nodes
are executed at the same time but must be allocated to different processors in order to
-42
avoid conflicts. To avoid any conflicts, the subspace spanned by Q must include all the
nodes which are executed at the same time. In other words, it is necessary that Q can
be extended to a basis for the full space, by which no nodes may be missed. Such a Q is
termed valid. Let Q be extended to a square matrix R = [rg : Q], which is the basis to
access the full integer space. R must be unimodular (if not, there must be "holes": in the
integer space spanned by R). Obviously, :lrf E Q such that 9i = GCD(rf) > 1, we have
I detR I~ 9j > 1. So we give Condition 1 to avoid this case. Condition 1 is necessary but
not sufficient.
The necessary and sufficient condition for an unimodular R is that
GCD(Ro, ... ,RN-d = 1 (3.8)
where R; is the sub determinant of Q obtained by deleting the i-th row and prefixing
(-1f In fact, if R is unimodular, we have
N-I
detR = L ri,O~ = 1
i=O
(3.9)
where ri,O E rg and"R; is also the cofactor of ri,O' It is well-known that the necessary and
sufficient condition for an integral solution of rg with eqn (3.9) is GCD(Ro, ... , RN-d
= 1. However, eqn (3.8) is not easily tested because we have to calculate N (N-l)-D
determinants. Fortunately, we have another method to provide the sufficient condition
Theorem 3.2.1 Q is valid, i.e., GCD(Ro, ... ,RN-I) = 1, ifIsp = c.
This is Condition 2 where we use the t which is produced from Q, instead of Q itself, to
check the validity of the Q. Since the c is known and t also is the result of the design, the
test is easy to carry out.
Proof: First a little preparation. It is claimed that a row vector I is called a valid timing
vector if GCD(t) = 1 and IQ = o. Second S can be extended to an unimodular matrix
s, = [ 0 ··s01 1
It is claimed that the first row of R -1 is a valid I and the first column of S;;1 is sp .
• 43
In fact, let 9 = GC D(Ro, ... ,RN-d. Derive rg by solving Ef:'(/ ri,oRi = g. Obviously
detR = g. Let the first row of R-1 be b which is nothing else but just [Ro, ... ,RN-dig.
It is clear that the b possesses the two properties of t.
S _[0 ... 01]_[0 ... 01]
tI - S - S' s'
Letting the first column of S~l be x, we know S'x = 0 and [0, ... ,0, 1]x= 1, which means
;rN-l = 1, by definition. So the x must be sp.
Finally, it is known that the cofactor of the first element of SuR is c and det(SuR) = 9,
because Su is unimodular. (SuR)-l = R-1S~1 whose first element is tsp. Thus tsP = clg.
So we have 9 = 1 if tsP = c. 0
Because all the null solutions of a system of N-1 homogeneous N-D equations are
different only by a scalar factor, Step 6 of the method produces the timing vector, and
requires the solution of one (N-1)-D diophantine equation. According to Theorem 3.2.1
Condition 2 unnoves ~he possibility of conflicts. Condition 2 is necessary and sufficient,
and includes Condition 1. However, we keep Condition 1 in Step 5, because it is easy to
test without solving diophantine equations and will also be sufficient if A is chosen as in
eqn (3.3).
The Design of A for valid Q
However, if we do not design the activity matrix A carefully, it is possible that only a
small portion, or even none, of the Q's produced by Step 5 can pass the test of Condition
2. A great waste of time. It would be best that Step 5 produces only valid Q. To discuss
the problem, let us consider the case of 3 dimensions. Our question is this, can A be just
a simple diagonal matrix? If it cannot, then how do we fill the entries above the diagonal?
At first, consider the case when A is a diagonal matrix. We have
Theorem 3.2.2 Let the A = diag(ko, kt). If GCD(ko1 kI) = 9 > 1, no Q is valid;
otherwise all are valid provided that Condition 1 is true .
• 44
Proof: In fact, from eqn (3.6) and (3.5), we have, remembering s" = sP and S~_l = 1,
QT _ [ ko + boSb bosi bo 1
- blS~ kl + blsi bl
then
It can be seen that Ra, RI and R2 have at least one common divisor g, if GCD(ko, kl) =
g>1.
Otherwise, we have GC D( ko, kl) = 1. Suppose kobl and kl bo share a common divisor d
> 1. Then kokl +kobISb+klboSb = kokl +dZ, where Z is an integer. If kokl +kobISb+klbos~
has the same factor d, kokl must contain d, i.e., either ko or kl contains d. (if not, d cannot
divide R2). Because GCD(ko, kl) = 1, d must be a factor of either bI or bo, or of both.
Assume ko contains d. bomust contain the d, that is GCD(ko, bo) = d, because kl cannot
contain d. This means that the elements of the first column of Q share a common divisor
d. But this is impossible, because it violates Condition 1. The same argument holds
assuming kl contains d. Therefore, d is not a factor of kokl + kob1sg + k1bosg, because
neither ko nor kl contain it. So we can claim that Ra, RI and R2 do not share any common
factors.D
Obviously, this is not a good choice because there is no guarantee of GC D(ko, kt} = 1.
Now we have to fill the entries above the diagonal. There are two situations to consider,
GCD(ko, kI) = 1 or > 1. First, consider the A such that GCD(ko, k1) = 1. We find some
of the Q's are valid, some are not. For example, let
A = [151 1 1o 50
and sP = [5,2, I]T. We may have a QT,
[
146 -2 -1 1 T-9 46 -2 when b = [-1, -2]
Then Ra, RI and R2 are 50, 301 and 6698, respectively. They have no common divisors.
However we can also have another QT
[
146 -2 -1 1
-4 48 -1
when b = [-1, _1]T
45
Ra, RI and R2 are 50, 150 and 7000, respectively, sharing the gcd 50. The t = [1,3, 140j.
tsP = 151, but c = 7550. So the Q is invalid. This means that ai,j should be zero if
GCD(ki' kj) = 1.
As regards the case of GCD( k«, kl) > 1, we have
Theorem 3.2.3 If A is an uppertriangular matrix in which GCD(ko, kd = 9 > 1 and
(li,j (i < j) is chosen as f = kj / s', all Q are valid if Condition 1 is true.
Proof: aO,1 is assigned as f and kl = f gr. Then, we have
and
I Ra I (3.10)
Assume I~'ih RI au.dR2 share a common divisor d. Considering Ra, d must be a factor
of either kl or boo At first, Assume kl contains d.
If d is g, considering RI! it must be a factor of bo (if not, d (=9) cannot divide
RI = gZ - fbo, since no 9 in f). But, this violates Condition 1 with respect to the first
row of QT.
If d is a factor of f, RI can be rewritten as kobl - dZ. To make d be a factor of
kobl - dZ, it must be a factor of either ko or bl. It cannot be a factor of ko, because
it belongs to the part of kI which is not shared by ko. Then d should be a factor of bt.
But, this means that d is also a common divisor of the elements of the second row of QT,
violating Condition 1.
Therefore, d cannot be a factor of kl, so bo should contain it. To make it divide
kobl - fbo, d must be a factor of either ko or bl. It cannot be a factor of ko. If it is, it will
be a common divisor of the first row of QT, not allowed by Condition 1.
So it should be a factor of bl. Thus R2 = koki + dZ. If d divides R2, it must be a
factor of ki and must belong to f. But, this is impossible, because if it is, d is also a
common divisor of the second row of QT, violating Condition 1.
• 46
In summary, we cannot find a common divisor for Ra, RI and R2 without violating
Condition 1. 0
To avoid any invalid Q, consider the two theorems. We take the form of a diagonal
matrix for A, adding non-zero element f at the position of ai,j if GCD(ki, kj) > 1. That
is eqn (3.3). Observe that these results can be extended to n-D dimension problem. but
the discussion is rather lengthy and is omitted.
Valid t
Condition 3 is required for physical implementation of intercommunication according to
the criterion SD = PK. Let DP = SD be the dependencies in the virtual projected array.
Now, we should consider the data flows among processors after partitioning. Usually, the
partitioning box is large enough to enclose the projected dependencies in all dimensions,
i.e., Vi, 1 ~ i < n,
i. ;? r!;£ox I df,j I
3=
where df,j E DP and nd is the number of columns of D. Thus, in this case, the data flows
among processors only one step in any direction. We can construct the dependency matrix
DA of the processor array by just assigning dt; = I:~ll' where dt; E DA (note that DP is
I"
different from DA: the former is the dependency in the mapped quasi-processor domain,
while the latter is that in the real processor domain). So, SD = PK becomes DA = PK.
Vdt E DA, enumerate the number k; of p's E P to implement df. Let k = [ko, ... , knd-d,
which is the lower boundary for the time to carry out the intercommunication among
processors.
3.2.4 An Example
Suppose that we map a computational graph
[
0 20 0 0 0 20 20 20 1
V = 0 0 20 0 20 0 20 20
o 0 0 10 10 10 0 10 11 ~1
[ 0
1
onto an processor array of size 4 x 4 with interconnection primitives P =
[
1 0
D= -1 1
-3 1
o -1
1 0
• 47
Partitioning and mapping
For simplicity, we choose sP = [1,1, IV. Then the S should be
s = [I 0 -1 1o 1 -1 w = SV = [0 20 0 -10 -10 10 20 10 1o 0 20 -10 10 -10 20 10
The matrix on the right-hand side gives the vertices of the projected virtual array. The
desired compression factors are 8, 8.This means that the nodes of the virtual array in
a 8 x 8 partitioning box will be allocated into one processor. The dependencies in the
virtual array are
DP= SD - .[ 4 -1 1 0 1
- 2 0 2 1
It can be understood that the 8 x 8 blocking box is large enough so that no dependency
vectors in the projected virtual array (i.e., the column vectors of SD) can penetrate
through the partitioning box. Therefore
A [A A A A] [1 -1 1 0
1
1D = do, d! ,d2 ,d3 = 1 0 1
For each column of DA, each non-zero entries must be implemented by apE P. For
instance, d~ are implemented by Po ffi Pt, where ffi means "direct sum", and Po = [l,OlT
and PI = [0,IV are interconnection primitives. Thus kp = [2,1,2,1].
According to eqn (3.3) and (3.6), the A is constructed, 'as well as Qp and q"
It is known that Condition 1 is necessary and sufficient to obtain the valid Q. In
practice, Condition 1 is easily satisfied. However, Condition 3 for a valid t is non-trivial.
In the example, 16 valid Q are tested before a valid t = [40, 19, 51, with a bT = [5,3], is
found. The corresponding Q is
[ 3 -2]Q = -5 5 ,-5 -3 T = [ ~40 ~ =i]19 5
The matrix on the right-hand side is a space-time mapping matrix, for which it is easy to
verify that det{T) = c = 64.
48
m
(a)
h\ '"
\ \
\
\ \
\ \
(b)
Figure 3.1: The Endless Cylinder for Computational Domain with One Infinite Index. h
is the axis of the cylinder.
It is also easy to verify that A = SQ, so the 8 x 8 blocking box is ensured. Let
diagonal matrix F = diaga, ~). The LSGP compression stands for a transformation
W' _ FW _ [0 20/8 0 -10/8 -10/8 10/8 20/8 10/8]
- - 0 0 20/8 -10/8 10/8 -10/8 20/8 10/8
The W' has (I, covera.ge of 3.875 x 3.875, just within the given regular processor array. In
addition, tD 7:: [6,24,14,19]. Compared with kp = [2,1,2,1]' as expected, it gives enough
time to carry out data communication relays.
3.2.5 A Particular Area of Application
An important aspect of application of so-called systolic arrays is signal processing, which
is characterised by one semi-infinite index, say t, of iteration. For example, a recursive
filter is expressed by
Tn
y(t) = :c(t) +L ajy(t - j),
j
t = 0,1, ...
=1
see Figure 3.1.(a).
Generally, such kinds of computational domain take the shape of multi-facet N-D
cylinder. Because the cylinder is straight, it is always possible to find an axis, h parallel
to the sides. Our task is to partition and map the infinite computational polyhedron onto
a (N-1)-D processor array.
One point is obvious that the space mapping vector, s", must be parallel to h. If
not, the infinite polyhedron will be projected to an infinite virtual array which cannot
. 49
be realised in practice. Usually, an endless cylinder cannot be expressed directly by a
set of vertices. However, its image projected by the sP in the (N-l)-D space domain is
a polyhedron. To determine the boundary of the polyhedron, we need consider only the
points on the edges of the N-D cylinder, since they will be projected to the vertices of the
(N-l)-D polyhedron. So any point on an edge can represent the edge. In this sense, when
h is known, we call a set V of such points which represents all the edges of the cylinder
as the vertices of the computational cylinder.
This constraint of sP II h presents a difficulty in applying the supernode partitioning.
Because in the supernode partitioning, the space mapping vector is either one of the edges
or the diagonal of the supernode parallelepiped which is determined by other requirements,
it usually diverges from the axis of the polyhedron. Furthermore, in signal processing,
sometimes, it is demanded that the processing system produces an output corresponding
to an input with certain delay. This means that the processor array must produce outputs
once after finishing each "thin" time hyperplane. However, by any supernode partitioning,
the time hyperplane becomes "thick", which implies block-data processing.
Our LSGP method is applicable to this problem, because it does not involve any
partitioning similar to supernode partitioning. The only additional operation is to in-
troduce a linear transformation B which is unimodular such that S~_l = 1, where
S~_l E Sb = B-1sP, where sb is the projection vector in the space with B as the ba-
sis. It is easy to find B. In fact, since GCD(sP) = 1, it is always possible to find bo
such that bosP = 1. Then bo can be extended to an unimodular matrix by the method
proposed in [107].
Apply the transformation on the given computation graph (V, D),
As for sb, the superscript "b" means in the space with B as the basis. The projection
and partitioning of Step 1 to Step 6 is performed in the space spanned by B.
• 50
's :~a':'vl /2.: .v3. ~~~.vS.:
t 5 ~ ; i l ) ~
• •• I I
to: ... ; ... :.... :... ; .. :.... :
Po PI P2 P3 P 4 Ps
Ca) LSGP.
t ,YO.. ,.v ·r~2.,.~3 .".4.,~5. •.
6 ' . ': :'
's ~.... :... + ··l····....~.... ~ PoP ,P2P)P4PS
(b) LPGS.
Figure 3.2: Demonstration of LSGP and LPGS. Each segment consists of 6 nodes.
3.3 Bouncing LPGS Method
3.3.1 Introduction
The disadvantage of the LSGP methods in general is that when an irregular computational
polyhedron is mapped within a regular processor array, there will be a great waste of the
computational resource at the edges. The higher the dimensions, the larger the overall
waste will be. The LPGS method in [73] is superior with regard to this problem.
For example, suppose a 2-D problem is implemented by a linear array with 6 processors,
see Figure 3.2. In LSGP, the 6 nodes in a segment are carried out sequentially by one
processor and one processor is responsible for one column (for instance, processor Po is
for column vo), while in LPGS the 6 nodes are calculated in parallel by 6 processors, one
segment after another from left to right, from bottom to top. In the case of LSGP, at time
ts, processors Po, PI and P2 are busy, while P3, P4 and Ps are idle, but we cannot enter the
next time hyperplane t6 until Po, PI and P2 finish their jobs on ts. In contrast, in LPGS,
there is no computation in b3, b4 and bs, so we can jump from 'point (ts, b2) directly to
point (t6, b1), which is more efficient.
However, as mentioned before, one of the main disadvantages is the need for additional
long distance buffering hardware which seriously restricts its application. It can be seen
from Figure 3.3.(a) and (c) that the data flow from band bo to band bI requires a long
link from processor P2 to processor Po. This kind of long link is not always available in a
• 51
given regular processor array, or covers large area when manufacturing VLSI chips. One
might try naively to use a clever re-arrangement of processors in the architecture,
long link
but this is not always possible in higher dimensional structures.
3.3.2 Bouncing LPGS
We can address the problem above by introducing another kind of allocation strategy,
the so-called Bouncing LPGS. The method is shown in Figure 3.3.(b) and (d). When
mapping band b1, the allocation of nodes is just the reverse-order of band bo, so that the
transition from one band to its next results in a "data flow" inside a processor, instead of
a long distance transfer. Data flows from left to right in even-labeled bands, then does the
reverse in the- odd-labeled bands. In effect this is like a ball bouncing between two ends of
a sealed tube. The precondition for implementing the Bouncing LPGS is a bi-connected
interconnection mesh.
The allocation function of the Bouncing LPGS in the example is expressed as
i' < 1
1 s i' (3.11)
where iP = [iP] = Si is a node in the projected space determined by the transformation
T, I is the number of processors in the linear array and i' = iP mod 21.
This method can be extended straightforwardly to n dimensions. A(iP) = [i~,... , i~_l]T,
'p - ['P 'P]T h1 -1.1, ... ,1.N_l ,were
i~< 1I
1< i~
- I
and i~ = if mod 21i (3.12)
Although LPGS (including the improved LPGS's) gives the potential to be more ef-
ficient in the cases of irregular computational polyhedron, more run-time controls are
required to achieve the goal. As mentioned, in figure 3.2.(b), for instance we can jump
from (ts, b2) directly to (t6, b1). But the jump will result in a serious confusion for data
dependencies in time domain since such jumps happen without regularity, depending only
• 52
.~~~--,,~.-~,'..---,-....: ' ' .' , , ~:
~
(,JiJo<,}!o-C-)l"o~~
(c)
. ,
"+:-'1'~~:':'r"{'l
Po PI Pz Pz PI lb
Cb)
~~e{I",~
~
when '"
(d)
"l~l:-'{':'l'''-+:''/
Po PI P2 Ib PI ~
Ca)
Figure 3.3: LPGS and Bouncing LPGS. Vector t is the normal of time hyperplane and sP,
the vector of space mapping. (a) LPGS allocation of nodes to processors; (b) Bouncing
LPGS allocation of nodes to processors; data flow of LPGS; (d) data flow of Bouncing
LPGS
Band HO
o I
Band HI
I 0
BandH2
o 1
BandH3
1 0-» •••••••••• " - •••••••••• _ ,'_ ••••••••••• ,' .
,--;- - - - - - :- -:- - - - - - -: -:- - - - - - :- -:- - - - - - ~- ,
5 I : • • • • • • : 10
: : I. .. :.
3
I
I
I 1
: Band VI
4
2 '2,
I,
I 1,
: BandVO
O. . . . . . . ,'0
I • f • I ••,--------p--------------p-------......................... - .
o 1 2 3 4 5 6 7
Figure 3.4: One time hyperplane of 2-D Bouncing LPGS. The intersections of Band H's
and Band V's are regions which the processor array computes at one moment. The number
in each node indicates its allocation in a 3 x 2 array.
53
on the boundary of the mapped polyhedron. For an ordinary LPGS, for example, we can
determine a fixed time interval when a datum produced in a processor will be used in
another processor. It will be very difficult to determine the time interval with the un-
predictable jumps. Thus some control mechanisms have to be added and this remains an
open problem for actual application.
3.4 Summary
In the chapter, we address improvements of three existing methods of partitioning. Firstly,
based on the theoretical framework of [94] a new algorithm for HSNF and SNF is developed
which reduces the computation complexity significantly for the independent partitioning
problem. More importantly, we propose a method of performing algorithm transforma-
tion so that the real-time partitioning is systematically carried out instead of collecting
computations node by node, finally parallel codes are generated systematically.
Secondly a method is proposed to modify the LSGP partitioning in [24], which guar-
autees to obt.iin the ;;iven-regular-shape. mapping easily in principle. Furthermore, due
to the proper-design of-two key matrices, the computation process is easily carried out.
The method exhibits fundamental advantages over Darte's method, such as no need to
compute HNF's and the production of a given regular array.
Finally a Bouncing LPGS is proposed to improve the widely known LPGS partitioning
method [73]. The Bouncing LPGS is characterised by mapping each consecutive band in
two opposite directions so that the data flow will go forwards and backwards in turn in
the processor array like a bouncing ball, therefore the long distance links from one side
of the array to the other are no longer needed.
54
Chapter 4
A Methodology of Partitioning and
Mapping for Given (N-l)-D Regular
Arrays
In this chapter a methodology is proposed to partition and map an arbitrary computation
graph onto a given regular M-D array under the URE condition, where M = N - 1. We start
with a positive expressing basis to obtain a set of canonical dependencies. Based on these
canonical dependencies, two basic models of space mappings, as well as, timing vectors are
derived. The partitioning parallelepipeds are scaled to map the original polyhedron into
a given array, then an LSGP method is used to improve efficiency. This methodology has
significant advantages in mapping an irregular computation graph onto a given regular
processor array while having high efficiency in both communication and computation.
4.1 A New Methodology
We found that mainly due to the difficulty of fixed-size mapping, the Partitioning-Projecting
method has not received enough attention compared with Projecting-Partitioning. But
if we have some kind of information about the space mapping before the space projection
actually takes place, it is possible to scale firstly the partitioning so that the partitioned
space can be mapped within a given-size array. This suggests that we may begin with a
set of standard dependencies and a set of fixed interconnections to exploit the possible
space mappings and .finda way of scaling the partitioning.
We extend C.T.King's concept [52] to form a supernode with N partitioning vectors
• 55
such that the entries of a supernode dependency matrix are only 0 or 1, that is, the
dependency of a supernode is very local and is orientated along the positive direction in
every dimension. This allows us to produce a canonical dependency matrix, independent
of the dependencies in the problem being considered. With the canonical dependency
matrix and two basic interconnection meshes, we are able to consider two models of space
mapping. They are the simplest mappings possible but possess the properties we require.
With the two space mapping models, we can predict the projected image of a com-
putational polyhedron. The lengths of the edges (the N partitioning vectors) of the
parallelepiped are determined such that the resulting supernode polyhedron can be allo-
cated within a given size array, and the canonical dependencies can be ensured. Thus,
we obtain an available partitioning strategy to map the original computation polyhedron
onto a given-regular-shape and fixed-interconnection array. Such a mapping will usually
have a low computational efficiency. To improve efficiency, the LSGP method is employed
to partition the supernode space further. By exploiting the knowledge of the first supern-
ode partitioning, the procedure for implementing the second (LSGP) becomes relatively
easy, and can even be embedded into the first partitioning. The two basic space map-
ping matrices are not unique and their permuted versions can be utilised for optimisation
purposes. See Figure 4.1.
In the figure, the original polyhedron is partitioned to a supernode polyhedron with
canonical dependencies. Because the supernode polyhedron may be larger than to be
required, it has to be partitioned further (actually the two partitionings are carried out
together) so that the resulting supernode polyhedron can be mapped within the processor
array (or for the case of SBC the resulting supernode polyhedron is mapped within an
EVPA and then the EVPA is LSGP partitioned into the processor array).
The methodology in this chapter focuses on the derivation of two integral transforma-
tion matrices, B and T, and an integer vector k. B is a transformation from the original
computation space to a supernode space. T is a transformation from the supernode space
to a I-D time domain and a M-D Enlarged Virtual Processing Array (EVPA) domain.
The quantity k is an LSGP compression vector containing compression factors for all
dimensions and allows the LSGP partitioning to compress the EVPA onto a given regular
• 56
supemode polyhedron
with canonical dependencies
Whe;s~c(0
, k.
compression
j ~~'!itioning_ - supemode~' polyhedron
,,
~'
S su~apping
processor array
D--,, ,, .J
/LSGP partitioning
SSBenapping
Figure 4.1: The Basic Idea of Our Partitioning and Mapping Method. The arrows within
each polyhedron indicate the pattern of dependency vectors
processor array.
The rest of the chapter is organised as follows: first the construction of a canonical
dependency matrix is discussed, second the selection of the two space mappings and
their timing vectors. Next LSGP partitioning, scaling the supernode parallelepiped and
integralization of the partitioning matrix are discussed. Finally we consider some examples
to illustrate the method.
4.2 A Transformation for Canonical Dependencies
4.2.1 A Cone Including Dependencies
It is shown l52] that in a 2-D computation space, any vectors in a set of dependency
vectors can be expressed by a positive linear combination of two vectors from the set.
This can be seen clearly by a geometrical explanation. All of the dependency vectors
form a 2-D cone whose two extreme edges are two of the dependency vectors, see Figure
4.2, where do and d3 form a two-edge cone including all dependency vectors. As a result,
any vector within the cone can be expressed by a positive linear combination of the two
edges. If example, let do = [2, _3]T, d2 = [3,2]T and d3 = [1, 3]T. It can be checked that
d2 = ~do + ~3d3. The two edges which are linearly independent, can be used as a basis of
• 57
yx
Figure 4.2: The Cone of Dependency Vectors.
a linear transformation, so that in the resulting space any dependency vectors have only
positive components, an important feature that we will make use of later.
Generally, in N-D space we cannot always form a N-edge cone from the set of depen-
dency vectors to include all other dependency vectors. For example, in 3-D space, if there
are four dependency vectors which form a 4-edge convex cone, no three of them can form
a 3-edge cone which includes the fourth.
However, as we prove later, it is always possible to find N lexicographically positive
vectors in N-D space such that any of dependency vectors can be expressed by a positive
linear combination of them.
Definition 4.2.1 A vector v = [vo,· .. ,vnl is termed to be i-th unit lexicographically posi-
tive (simply i-tli ULP) if Vi = 1 and Vj < i: dj = 0. For example [0,· .. ,0,1, Vi+1,··· ,vn]T,
where Vj, j = i+l, ...,n, is any integer.
Theorem 4.2.1 For a set of any N-D dependency vectors, there exists at least one set
of N lexicographically positive vectors, called a Positive Expressing Basis (PEB) E, such
that any of the dependency vectors can be expressed by a positive linear combination of
them.
Proof: The proof of the theorem is also the procedure to find the positive expressing
basis. Without loss of generality, suppose the first positive component of any dependency
vector d, = 1. If this is not the case, the vector can be normalised (that is, divide the
entries of the vector with di• Some of entries of the normalized vectors may become non-
integer, this does not affect the proof of the theorem). A N-D dependency matrix can
.58
be decomposed into N submatrices, each of which is i-th ULP, called submatrix i, i =
O,···,M.
The proof proceeds by induction. The first k vectors of E are supposed to be k
column vectors in i-th ULP form, i = 0", ·,k-1, such that the submatrices 0, ... ,k-1 are
positively expressed by them. Now, It is sufficient to show that the submatrices 0" . ·,k
can be positively expressed by k+I ULP vectors. This involves constructing the k-th
ULP vector, merged to the existing k vectors, so as to express the submatrix k. For
the sake of brevity, in the derivation of the k-th ULP vector below, we omit the first
N-k zero elements in each column vector, eg, [0,· .. ,0, dN-k-ll· •• ,dN-1]T is expressed by
[dN-k-1,' .. ,dN_dT• So the submatrix k will be expressed as
1 0 0
[ d.~',O d.+_1 I ek,k-l 1 0 0ek-l,k-2 1 0-
ek-2,k-3 0
duo d1J,I-l 0,
ek,O ek-l,O ek-2,O 1
ak,O ak,l-l
ak-l,O ak-l,l-l
X ak-2,O ak-2,1-1 (4.1)
ao,O aO,I-l
where the matrix on the left-hand side is the submatrix k with 1 vectors. The first matrix
on the right-hand side contains the first k+l vectors of the E in which the first column
(the k-th ULP vector) is to be determined, and the second matrix [ai,j] on the right-hand
side is the cceificient inatrix for the positive combination, which is also to be determined.
All entries in the coefficient matrix must be positive as required by the definition, and
all elements of the first row, ak,O,"', ak,l-b must be 1. In fact, considering eqn(4.1), it
can be seen that the first row of the left-hand side matrix is the product of the first row of
the middle matrix and the right-hand side matrix. But since the first row of the left-hand
side matrix is [1 0 ... ], we have [1 ... 1] = [ak,O" • ak,l-l]'
• 59
Next, choose ek,k-l and determine ak-I,O, .•. , ak-l,l-l. Let
ekk-l = min (ldk-1 iJ, 0)
, 09::;1-1 ' (4.2)
then
ak-I,i = dk-I,i - ek,k-I i = 0, ... , 1 - 1 (4.3)
because ek,k-l ~ dk-l,i, ak-I,i ~ O.
Now, the coefficients with respect to the (k-l )-th ULP vector in the subset are known,
and we can modify the submatrix k such that the contribution of the (k-l)-th ULP vector
is eliminated. The modified sub matrix k can be expressed as
1 1
= [dLo
do,o
o
o
1
ek-l,k-2
ek-2,k-3
dk_;H I
do,l-l
o 0o 0o 0
1 0
ek-3,4 0
o
o
1
d5,1-1
1
ek-l,O
ek-2,O ek-3,O
ek,k-l
ek,k-2
1
and we proceed similarly to determine ek,k-2 and ak-2,O' .. ak-2,1-1' Repeat the process
until ek,Ois produced.
If the submatrix k is empty, let ek,k-l = ... = es,o = O. 0
Let the final coefficient matrix be AP. Obviously, D = EAP .. Theorem 4.2.1 implies
that by a linear transformation with E the dependencies in the new space consist of only
positive components, and, as will be seen later, this turns out to be quite and important
property.
Example 4.2.1 Derive B from a D
• 60
Suppose a given dependency matrix D and decompose it as follows:
D
o 0
1 0
-2 1
submatrix 0
$ [n
Obviously, the O-th column of E should be [0,0, l]T. An equation can be formed to
determine the 1st vector of E
Let el,O= -2, then 8.0,0 = o.
In the similar way, the 2nd vector of E can be determined according to the submatrix
2.
[.1 1 1 [1-1 1 = e2,1o -3 e2,O ~ ~ 1 x [a~,o a~'l 1- 2 1 ao,o aO,1
Let e2,1 = -1, then [al,O,al,l] = [0,2]. So the submatrix 2 is modified as
[ ~
1 -~1] - [~ 1 ~]- [ ~ ] x [0 2]o 1 0 -3 -2
[~:n X [a~,o ~J
Let e2,0 = 0, then [ao,o, ao,1] = [0,1]. The procedure of producing the E is finished. As a
result, we obtain
[
10
E= -1 1o -2 and
o 0
1 0o 1 ~1
It can be checked that D = EAp.
4.2.2 Forming Canonical Dependencies in Supernode Space
The vectors in E indicate the directions of the edges of a clustering parallelepiped for
grouping the computation graph nodes into supernodes .
• 61
Definition 4.2.2 A quasi-supernode space S" E RN, indexed by q, is an image of an
integral space by a mapping of M-I and det(M) > 1. All integral points in the quasi-
supernode space form a subspace of S", termed a supernode space SB and indexed by s.
The quasi-supernode space is a compressed version of the original computational space
which allows partitioning along its coordinate axes. It remains to show that the depen-
dencies in the supernode space can have a canonical form.
We are interested in a pair consisting of a PEB matrix and its corresponding coefficient
matrix, Bd and Ad, such that
(4.4)
and 0 ~ al,i ~ 1, Val,i E Ad.
It is easy to determine a positive diagonal matrix F d - diag(ft,···, IfJ) (Fd"l -
diag(1/ It,...,1/f'k)) such that
(4.5)
where B, = F~Fdl.anti A = FdAp. In fact, Vi, 0 ~ i < N
d r«.» W' 0<'ai,j = u ai.i' v), _) < nd
where al,j E Ad, af.j E AP. Let aiQX = maxo~i<ncl af.i' Then,
ft s ~ax
aj
Let amax = [aoax,' .. ,aM-ax].
(4.6)
We make parallelepipeds with the columns of Bd as their edges to partition the com-
putation polyhedron. To cluster more than one node into a partition, det(Bd) > 1. This
is equivalent to a linear transformation with Bd as the basis from the original compu-
tational space se to a quasi-supernode space S", so Bd can be. called the basis of the
quasi-supernode space. For any node i E se
(4.7)
where q E sq is the image of i. In S"; the mapped dependency matrix Dq is exactly Ad,
i.e.,
(4.8)
.62
So we have
0<d1<1'- 1,) - , (4.9)
where cf/,i E D". Because det(Bd) is larger than 1, the nodes of the original space may be
mapped to non-integral points of sq. In S", collect all projected nodes q = [qQ,"', qM1T
in a hypercube
s'EZ'J'-O ... MJ , -, , (4.10)
to the integral point s = [SO," • ,sM1T, s will be a supernode, including all the projected
nodes in the hypercube. Any projected node can be re-expressed as
q s+A;
° < s, < 1; j=O,···,M (4.11)
where A E RN and Doj E A.
Now, we can derive the dependency relations among the supernodes. In the original
computational space, any dependency relation between two nodes can be expressed as
(4.12)
where d, E D. Applying the linear transformation of eqn(4.7) to eqn (4.12) produces
B-l·2 - B-l·l + B-ld'd1 - d1 d" (4.13)
where q2 and q! E S"; d? E D". Consider eqn (4.9) and eqn (4.11), we have
where q; E q2, sj E s", sJ E S2, and t!fj E D", 81 and 82 are two supernodes.
Let Xi = Do} + elfj. Because 0 :5 Xi < 2, sJ = s} if Xj < 1 and sJ = s} + 1 if
Xj ~ 1. Therefore, we can list all possible dependency vectors dj, j = 1"", 2M in S·.
For example, N = 3,
d~= [0,0, 11T;
d~ = [0, 1, of;
d3 = [0,1, IjT;
d: = [1,O,Of;
d~= [1,0,11T;
ds = [1,I,OlTj
d; = [1,1, 11T;
s~+ Do~J J
j=O,"',M (4.14)
X2 < 1,
X2 < 1,
X2 < 1,
X2 ~ 1,
X2 ~ 1,
X2 ~ 1,
X2 ~ 1,
Xl < 1,
Xl ~ 1,
Xl ~ 1,
Xl < 1,
Xl < 1,
Xl ~ 1,
Xl ~ 1,
Xo ~ 1
Xo < 1
Xo ~ 1
Xo < 1
Xo ~ 1
Xo < 1
Xo ~ 1
• 63
", , , ,7 ---;-1'
---- "
.... - - - ""_-_ ~--- ----------
(a) (b)
Figure 4.3: Supernode Partitioning and Canonical Dependencies. (a) the original do-
main with irregular dependencies is partitioned by partitioning parallelepipeds, (b) the
supernode domain with canonical dependencies
Collecting all dB's into a matrix, we obtain the 3-D canonical dependency matrix
[
1 001 1 OIl
DB= 0 1 0 1 0 1 1o 0 1 0 1 1 1 (4.15)
in supernodc space. Note that there is no difference when permuting the columns of a
dependency matrix. We have the following general theorem
Theorem 4.2.2 By the transformation Bdl all these dependency vectors in SB form a
canonical dependency matrix D' which looks like a table of binary numbers without O.
This theorem can be explained intuitively. By using E as the edges of the supernode
parallelepiped, any components of dependencies in the quasi-supernode domain become
positive. By scaling the size of the parallelepiped so that no components of the dependen-
cies penetrate an adjacent parallelepiped, we guarantee that the components of dependen-
cies between supernodes (parallelepipeds) can be only 0 or 1. Hence the theorem means
that by partitioning with Bd any original arbitrary dependencies can be transformed to
an universal set of canonical dependencies. See Figure 4.3.
The number of supernode dependencies is fixed as 2N - 1, independent of the original
ones. Sometimes the number of supernode dependencies may be more than that of the
original problem. This does not add to the complexity in the dependency problem. In
.64
fact, the supernode dependencies rearrange the data dependencies more efficiently so that
data flow paths are simplified for some data. For example, an original dependency vector
[2,4jT becomes a combination of [1,OlT, [O,llT and [1, IlT, so some data is routed locally
and by point-to-point connections.
4.3 Selection of Space Projection and Timing Vec-
tor
4.3.1 S-T Transformation and Interconnection Primitives
The supernode space S· behaves like an ordinary integral space, therefore we can use the
space-timing mapping concept to transform it onto a I-D time domain and a M-D space
domain.
We assume that our computational resource consists of a mesh connected network of
processing cells which form a regular array. In many applications, only simple patterns of
interconnection primitives such as these are available. Therefore, we work with the two
simple ones, SUC and SBC, mentioned in Chapter 1. It is obvious that if a computation
can be implemented with the SUC and SBC patterns, it must be able to be carried out by
using more complex patterns of interconnection primitive with less data relays provided
that the complex pattern includes the sue and SBC as a subgraph. Of course the SBC
includes the sue as a subgraph, but, because, there is a significant difference between
the methods dealing with them, it is worthwhile discussing them individually.
This selc. r ion of networks is not arbitrary. In fact, they are just the uni-directional
and bi-directional connections of a hypercube structure. Our task is not to invent new
structures but to develop strategies based on the most widely available interconnection
meshes.
The matrices for sue and SBC can be expressed as a unit matrix and a directive sum
of a unit matrix and a minus unit matrix, respectively, that is,
PSUC = IMxM (4.16)
PSBC = [IMxM : -IMxM] (4.17)
where ":" indicates a partitioning of the matrix .
• 65
4.3.2 Choosing Sand t
The canonical dependency matrix is the base from which we can derive the timing vectors
according to some criteria and conditions together with a space mapping which has a
certain kind of canonical form.
At first, it is easy to see that for the canonical dependencies, the condition of eqn (1.5)
is equivalent to
ti > 0, Vi,O~i<N (4.18)
In fact, considering D"' in eqn(4.15) whose first N column vectors compose an identity
matrix, we have ID"' = [to'" tN-I, dL·· .],where the drs are a positive linear combination
of to' .. tN-I, which indicates that eqn (1.5) is equivalent to eqn (4.18). Another obvious
fact is that for the P sue the corresponding space mapping Ssue must be non-negative
to satisfy the criterion SsucDs = PsucK. In fact, SsucDs = PsucK is equivalent
to [Ssue : M] = K, where M is a matrix. Ssue must be non-negative because K is
non-negative as required in [73], see page 21.
With the canonical Dependency matrix and the simple patterns of interconnections,
the criterion of SDs= PK can be simplified to the following theorem.
Theorem 4.3.1 With the canonical Dependency matrix, the data flows can be imple-
mented with the simple patterns of interconnections and arrive at the correct place at the
correct time if and only ifVj, j = 0," " M
M-I
t; ~ E Is;,;1
i=O
where S·· E S'"
(4.19)
Proof: note that any cl"' E D' can be written as d" = [do,'" ,d:,'" ,dM]T, where
di = 0,1, Vi, 0 ~ i < n. The image of the transformation T in the time-space domain is
r ~~o·~o,;di I
l M-I: - 2: (~~OSj,idnlj + (~~otidnIM~~O!M-I~id: j=O~i=otidi
where 1; is a vector such that its j-th element is 1, and zero otherwise. The first term in
the right-hand side of eqn (4.20) is Sd' and will be implemented by the column vectors
(4.20)
• 66
of Psuc or PSBC. Because the columns of Psuc and PSBC are given in terms of Ij or
-Lj , m "vector-times" of the primitive vectors are used to implement the SdB, where
M-I M
m = L I LSj,idil
j=O ;=0
(4.21)
and one vector-time means to use a primitive vector once (for instance, if SdB = [1,-1],
it will be implemented by Po and P3 of PSBC and the vector-times are two).
This means that a data flow indicated by d" can be relayed k times to arrive where it
will be used, via the available interconnection primitives of Psuc or PSBC. So, to have
enough time to perform the relays
M M-I M
Ltidi ~ m = L ILSj,;dil
;=0
(4.22)
j=O ;=0
If eqn (4.19) holds, it can be deduced that
M
Ltidi >
;=0
M M-I M-IM
L L ISj,;dil= I: I: ISj,;dil
;=(J j=O j=O ;=0
M···l·· M
L II:Sj,jdil
j=O ;=0
>
Therefore the condition of eqn (4.19) is sufficient. Furthermore, it can be seen that the
condition eqn (4.19) is also necessary. In fact, For all the dependencies d" = Ij, vi. 0:5
j < n, eqn (4.22) is equivalent to eqn (4.19) because
d~ = {1 if i = j
I 0 otherwise
, therefore any violation to-eqn (4.19) will result in a violation to an eqn (4.22). 0
Theorem 4.3.1 simplifies the criterion for selecting the space mapping S in our context,
but there are still many choices. We need further criteria to limit the scope of the choices
for the mapping. It is pointed out in [94] that for a fixed-size domain of computation,
the total execution time is a monotonically increasing function of the absolute values of
the entries of t. This is true when partitioning is not considered. However, in some
circumstances, as will be seen later, we may enlarge the entries of t to improve efficiency,
i.e., reducing the total execution time.
.67
However the principle can be used here. It is hoped that the norm of t can be made as
small as possible but with a lower bound given by eqn (4.18) and eqn (4.19), since a large
positive t will enlarge the extension of the mapping along the time domain. Furthermore
it is clear that we need to construct S with as few non-zero entries as possible and with
the smallest absolute value possible for the non-zero entries. Otherwise S may produce
a large mapping extension in the space domain, as a result of which more nodes have to
be squeezed into a supernode so that the supernode domain can be mapped into a given
regular processor domain.
We choose Ssue as
Ssue (4.23)
The corresponding projection vector s~ue = [1,0,· .. ,O]T, and the timing vector tsue is
chosen as
tsue = [1,· .. ,1] (4.24)
Clearly the pair of t and S satisfies eqn (4.19), and possess the agreeable property
that e = tsueS~ue = 1. The image of D' in the domain mapped by the space mapping
is (permuted),
[
00110011]
SsueD" = 0 0 0 0 1 1 1 1 (4.25)
when N = 3. This means that both the dependency vectors of D', [0,dl, ... ,dM]T and
[1, dl,···, dMF, must be implemented by one combination of the columns of Psue·
We propose that SSBe takes a form of
[
-1o -1
SSBC = [O,IM]+ [-IM,O] = ~
1 o
1
(4.26)
The image of D' mapped by SSBe is,
.68
[~ 0 1 1 o 0 1 ~ ]- [ ~ 1 0 1 0 1 0 ~ ]o 0 0 1 1 1 0 1 1 0 0 1
[ ~ -1 1 0 0 -1 1 0] (4.27)0 -1 -1 1 1 o 0
when N = 3.
Comparing eqn (4.25) and eqn (4.27), it is found that in SSBCDS, data flows are
dispersed by using the interconnection primitive [1,O]T and _[I,O]T (and [0,1]T and -
[0, 1]T) once instead of [1,O]T (and [0,1]T) twice, i.e., a balanced usage of the bi-direction
interconnection primitives (this is why we choose SSBC in this way). Furthermore this
property is true for N dimensions, due to the followingproposition.
Proposition 4.3.1 SsBcD' has the same number of non-zero entries as SsucD', but
half of them are positive, half are negative.
Proof: As mentioned before, D' is like a transposed table of binary numbers from 1 to
2M - 1. It is trivial to add 0 at the beginning to make it a full table of binary numbers.
The i-th row of the full binary table can be expressed as a repeating pattern crrcrr
2N-i-2 times, where '& = [0··· 0], total Z' "O"sin a line, and r = [1·· ·1], total 2i "1"s in
a line. Similarly, the (i+l)-th row can be written as a repeating pattern crcrrr 2N-i-2
times.
In eqn (4.26), the item [o,IM]D' is the table with the first row deleted, and the item
[-IM,o]D", the table with the last row deleted. Thus, the i-th row of SSBCD' is the
differenceof the (i+l)-th and the i-th rows of the table, i.e.,
repeated 2N -i-2 times, where -1 i = -1··· - 1, total 2i "-1"s in a line. It is clear,
therefore, that there are 2M-1 "1"s and the same number of "-I"s in every row of SSBcD'.
Obviously, SsucD' has 2M non-zero entries, but all are positive. 0
It can be proved that the corresponding projection vector S~BC = [1,· .. ,1]T. In fact
considering eqn (4.26), there is one "I" and one "-I" in each line, thus SSBCS~BC = 0, as
required by definition. The t which satisfies eqn (4.18) and eqn (4.19) with the smallest
·69
norm has the form [1,2"",2,1]. So e - E~o ti. Unfortunately, e ~ 1 can occur -
implying poor efficiency.
4.3.3 Permuting the Space Projection Matrix
Obviously the matrices [o,IM] and [o,IM] + [IM'o] are not the only matrices for Psuc
and P SBC, respectively, satisfying the criteria. In fact, by permuting the- columns of Ssuc
and SSBC, we obtain N! versions. We term eqn (4.23) the normal form of Ssuc, and eqn
(4.26) the normal form of SSBC.
Any other version of Ssuc (and SSBC) is expressed by S~CPi"'PM) (and S~·(t···PM»),
where Pi means that the ith column of Ssuc is permuted to the Pi-th column of S~CPi"·PM).
For instance, with N = 4
[
0001]
S~J~2)= 1 0 0 0
o 0 1 0
d S(1302) _ [ 01an SBC-
-1
-1 0 1]o 0 -1
010
S~UCPi"'PAr) can also be expressed by SrucPi"'PM) = SsucP~"'Pi"'PM), where P~"'Pi"'PM)
is a permuting matrix and obtained by permuting the i-th column of I to the position of
the Pi-th column. For example,
[ ~
1 0
~ 1
p(1302) = 0 0
P 0 0
0 1
In many places, for simplicity, the superscript of P~"'Pi"'PM) is omitted when the
meaning is clear. By permuting the columns of Ssuc, its projection vector sP may be
changed. For instance, the projection vector of s~:ci)is [0,1,0, O]T; However, the column
permutation of SSBC does not change the projection vector. It can be seen from S~~~)
that, for any permutation there is one "1" and one "-1" in every row. Thus S~J~)S~BC =
o. For general cases, we deduce that
P -I P S(pO"'PM)_Po = SSBCSSBC = SSBCP pPP sSBC = SBC ::TSBC
where p;1 is a permuting matrix, while p;IS~BC changes nothing.
When N is large, flexibility from the N! versions of Ssuc and SSBC apparently com-
plicates matters. However, it does give the opportunity to optimise the partitioning and
mapping. So it is quite worthwhile keeping all of them as candidates for S in the follow-up
procedures of design.
4.4 Further LSGP Partitioning for SBC
If e > 1, only one processing cell in everye cells is active at a moment. A great waste
of computational resources. We perform a second partitioning (the first is the supernode
partitioning) to improve efficiency using a LSGP method. A number of LSGP methods
are proposed to deal with the problem. We choose the method in [24] here due to its brief
mathematical nature (see Chapter 2 page 22-24).
vVeproceed by defining an enlarged virtual processor array (EVPA) of size (lo x ko) x
'" X (IM-l x kM-d. At first, the supernode parallelepiped is scaled so that the resulting
supernode polyhedron can be mapped within the EVPA using the space mapping, then the
supernode polyhedron is compressed by the LSGP partitioning onto the original processor
array.
We propose a method to perform the LSGP partitioning, which is expressed as the
following strategy.
Strategy 4.4.1 Use an "enlarged" I such that e = E~o t, = gM, where g is a small
integer such as 2 (when N >3) or 3. "enlarged" means that the entries ti oft are greater
than their lower bounds. For all S~ctM), changing I under conditions of eqn(4.19} and
with L~o t, = gM, evaluate Q : IQ = 0, S~(tM)Q, A and M! k's. If there exists a k
such that ko = ... = kM-1 = g, put the t into a set "T-PO"'PM), otherwise ignore it. In
this way, each permutation S~~:tM) is accompanied by a set oft's, T(Po···PM). In fact by
employing any pair of S~"BctM) and I E T(Po"'PM), we are guaranteed to make an LSGP
partitioning of even compressions with a factor g on every dimension.
Clearly it is a very time-consuming task to work out all the available Ps when N is
large. Fortunately the rPo"'PM),s can be pre-computed. In addition, we introduce the
following proposition to reduce the required computation by half.
m(Po"'PM)Proposition 4.4.1 SSBCPp and SSBC(JNPp) have the same T , where IN ss an
antidiagonal identity matrix.
Proof. Because
-SSBC
we have
HNF - (SSBCPpQ)U
- (JMSSBCJNPpQ)(-U)
where U is a unimodular matrix. So, SSBCPp 1-+ HNF and SSBCJNPp 1-+ JMHNF, but
JMHNF does not change the diagonal elements of the HNF. 0
Let P~ = J NPp. p;J is also a member of the set of the permutation matrices. If the
7(Po"'PM) for SSBCPp has been computed, we need not do it for SSBCP~, because they
share the same one as proved in Proposition 4.4.1.
A further reduction by half is achieved by the followingproposition.
Proposition 4.4.2 Suppose S is a member of the set of permuted SSBC and is accompa-
nied by T; ifS' = SJ (i.e., P~ = PpJ), then S' will be accompanied by TJ.
Proof. In fact, suppose A = SQ and IQ = o. Since SQ = (SJ)(JQ) and (IJ)(JQ) = 0,
t' = IJ with S' will result in the same A as I with S. 0
Notice that S' is also a member of the N! permuted versionsof SSBC, therefore if T
for S has been computed, we can simply obtain the T' for S' by just doing TJ without
complex computing.
.72
4.5 Scaling the Supernode Parallelepiped and Op-
timisation
Once a space mapping is known, we can determine the scale of the parallelepiped for
partitioning nodes to supernodes. In subsection 4.2.2, the cone basis E has been scaled
to Bd which defines the partitioning parallelepipeds for forming the canonical supernode
dependency. However, to map the supernode domain within a given regular processor
array, we must to rescale the E to B, which defines the partitioning parallelepiped for
fixed-size mapping. Usually, the parallelepiped of B, is larger that of Bd so that the
former can take over the latter, otherwise we should combine them together as to be
discussed later.
Similar to subsection 4.2.2, the scaling of E can be expressed by introducing a positive
diagonal matrix F = diag(fo,' .. ,1M)' forming a scaled cone basis B. = EF-1•
As also known in subsection 4.2.2, partitioning the computation polyhedron with the
parallelepiped defined by B, is equivalent to a linear transformation with B, as the
basis from the original space to a quasi-supernode space sq. Let yq be the transformed
polyhedron vertices in S", i.e.
(4.28)
where W = E-l Y contains the vertices in a space with E as the basis. The space mapping
S maps yq onto the processor array space
(4.29)
where SF = SF. Eqn (4.29) shows that the scaling of E is equivalent to a scaling of S
(i.e. scaling each column of S) associated with the unchanged vertices Wand simplifies
the scaling procedure greatly.
It is known that S has N! permuted versions. We have to compute the F(Po'''PM)
corresponding to each of them, one way to do this follows from the fact that for any
permuted version S(Po"'PM)of S, eqn (4.29) is modified to
U - S(Po'''PM)yq= (SPp)F(Po'''PM)W)
- SF'(PpW) = SF,W(PO'''PM) (4.30)
• 73
where F' = ppF(Po'''PMlp;l 1. These formulas are applicable to a permuted vertex matrix
W(po'''PMl in order to find a scaling diagonal matrix F', and the scaling diagonal matrix
F(PO'''PM) which corresponds to S(PO'''PM) will be
(4.31 )
Definition 4.5.1 The upper bound and lower bound ofU in the i-tli dimension are defined
as
u_
lli _ max Ui jO$j<n '
respectively, where Ui,j E U. And wi is a vertex associated with ui, and w~ a vertex
associated with uL i.e.,
w~ _
I
w! _
I
{Wj : ui = l[SFwj, 0 ~ j < nv}
{Wj : u~= l[SFWj, 0 ~ j < nv} (4.32)
where wi, w~ and WJEW, I, is a sampling (or choosing) vector.
Definition 4.5.2 An accurate scaling of the S is such that U is just within the processor
array 2, i.e., Vi, O~. i < M, ui - u~= t; If Vi, 0 ~ i < M, ui - u~ ~ Ii, U is said to
be acceptable. ui - u~ is called the projected size in the i-th dimension.
It would seem that an accurate U is the best result of re-scaling E. However, this is
not what we actually want in practice. In fact, an accurate Umeans that, in the case N
= 1, u~ is assigned to processor 0 and u(j, to processor 10, so 10 + 1 processors are needed
in total instead of 10 processors.
4.5.1 SUC Cases
Scaling Ssuc
The scaling of Ssuc is straightforward. Because there is one and only one "I" in each
row of Ssuc, Vi, 0 ~ i < M, we have,
(4.33)
IF' must be and is a diagonal matrix whose diagonal entries are those of F but permuted by Pp. In
fact, Pp can be decomposed as Pp = PkPk-I" ,PI. where Pi is a basic permuting matrix: permuting
. two rows. F' = Pk(Pk-I('" (PIFPi"l) ... )P;~l)p;l. Each term in ( ... ) from the innermost to the
outmost is a diagonal matrix produced by permuting two diagonal entries of its predecessor.
21f without special indication the processor array also indicates the EVPA .
• 74
where Wi,j E Wj. Now we cannot know ui and uL because Ii+l is unknown. However, due
to the requirement that Ii+! > 0, it is possible to find expressions for them, viz
(4.34)
where
w'! = max 10' ,
, O~j<nv S,J
I
wi = mm W"
O~j<nv S,J
(4.35)
An accurate scaling of the Ssuc demands that
f~= [i-1 U' 1<. NI U I' vZ, _ Z <
Wi - Wi
(4.36)
where superscript "a" means that the variable is for accurate scaling. Note that 18 is not
determined. Let 18 = +00 which will disappear later.
Eqn (4.36;0can also be expressed in a form ofmatrices. Letting L = diag( +00,10, ..• ,1M)
and wu-l = (l:'ag( =uS; ... , u 1 I ), Eqn (4.36) is equivalent to
wo-wo WM-WM
(4.37)
where Fa = diag( +00, If, ... ,1M)'
Eqn (4.36) or eqn (4.37) are not the only formulas for determining the fi. Recall
that one purpose of the supernode partitioning is to include all the dependencies in the
parallelepiped, in eqn (4.5) a diagonal matrix, F d, was introduced for this purpose, and
its elements 'letermilled by eqn (4.6). Therefore, the fi could be formulated as
fi = min(ft, ff) Vi, 0::; i < N (4.38)
Obviously, if 3i, 1 < i < M, ff < fi, we cannot obtain the accurate scaling. Conse-
quently to create the canonical dependencies we have to sacrifice some efficiency.
Scaling permuted versions of Ssuc
In order to produce F(Po'''PM)a = diag(f8, ... ,fM) for any permuted version SsuCP~"'PM)
of Ssuc (note P~'''PM) simply as Pp), it would be better to make a use of eqn (4.30) and
·75
eqn (4.31). Comparing eqn (4.30) with eqn (4.29), we know that F' can be derived in the
same way as F, by just permuting the rows of W. For the case of SUC, this is equivalent
to permuting the diagonal elements of wu-l in eqn (4.37). Therefore 3
(4.39)
Substituting eqn (4.39) into eqn (4.31) produces
(4.40)
Then every element of F(PO"'PM)a should be checked and modified with eqn (4.38).
4.5.2 SBC Case
The scaling of SSBC is a much more difficult problem than that of Ssuc because there
are two non-zero elements in each row of SSBC. Note that
[
-10 II
o -II
SF = . .. .. .
o ...
o "'J° ...
-I~-1 I~ .
(4.41)
then, Vi, 0 S i < M
(4.42)
where sf = 'tSF = [0,···,0, - Ii, Ii+110" . ·,0]. According to the definition of accurate
scaling, a system of equations is obtained, Vi, 0 Si < M such that
Ii - ui - u~ = Ii[SFWi - SFW~]
- - Iiw~o + Ii+l W~1 + Iiwto - Ii+l wb
- (wLo - w~o)1i + (W~1 - wLl)Ii+1 (4.43)
where
u -1 uw'o = iW'I, I u -1 uw'l = i+lW,I, I
I - I
w'l = 1i+lW,I, I (4.44)
3if M is a diagonal matrix, PpMp;l performs simply the permutation of the diagonals of M
·76
There are (N - 1) equations but N variables, so one of 10,"', 1M can be taken as
an argument, say Ik' and the rest can be determined as a piece-wise linear function of
Ik. Two criteria that 10, ... ,1M> 0 and Jo" .. ,L must be large enough to enclose the
dependencies D, are used to delimit the valid interval [lLb, I;:b] of Ik. The actual procedure
is complex (see Appendix A). We describe this procedure briefly as follows.
Step 1 Take !k as the argument and let /k = 0 be an initial point, then compute a set of
initial wi and w~corresponding to the initial Ik.
Step 2 Derive the linear functions for Fa with these wi and w~.
Step 3 Increase /k from the initial point. At some points of I». wi and w~are overstepped
by other Wi'S. Replace the old wi and w~with the new ones.
Step 4 Repeat Step 2 and Step 3 until any of 10, ... ,1M is less than zero.
Step 5 Redetermine the valid interval of I» such that Ii :5 lid, 'V i, where It is defined by
eqn (4.6).
In Appendix A, only the situation where SSBC are the space mappings is discussed. The
algorithms derived there can be extended to any permuted versions of SSBC. Fortunately
this extension is not very difficult, considering eqn (4.30).
It can be concluded that a complex model of S also complicates the procedure of
fixed-size partitioning dramatically.
4.5.3 Optimising
Optimising for the SUC Case
The permuted versions give N! choices for optimisation. The procedure of optimisation
is given in Algorithm B.2.2. We should work out all the N! F(Po'''pu) and compute the
transformed polyhedron vertices yq by eqn (4.28) for each of them. Note that tsuc yq
gives the time moments for executing each of the vertices and the difference between the
longest and the shortest time moments indicates the total computation time. We have to
compute the total computation times for all the cases of N! F(Po'''pu)'s and choose the one
which minimises the total computation time .
• 77
Optimising for the SBC Case
Once the problem is scaled we need to choose the best mapping. It can be seen that up
to now, we have many possible choices and must
1. determine jj, within its interval [J£b, I;:b],
2. select which of 11,"', fM-l is used as the argument fk'
3. select timing vector t from the set T(Po"'PM) and its accompanying S~~CPM)
4. select a Sk~CPM) from all permuted versions of SSBC.
Optimising each of these tasks is considered below.
1. The criteria to determine Ik is to directly minimise the volume of supernode, i.e.,
to maximise
M 1
Vk = II -
i=O Ii
which gives the minimal total computation time. But some of the edges of the resulted
(4.45)
parallelepiped-may be quite long, meanwhile some others may be quite short, so the shape
of the parallelepiped may be strange. However, for the sake of minimising computation
time, let us take it as the optimum criterion for determining Ik, that is, determining the
volume and shape of the parallelepiped. Let the minimal Vk be vrin (we may change the
criterion in practice).
2. For completeness, we should let argument Ik = Ii, and evaluate vrin, Vi, 1 ::5 i < M.
Let vmin = maxo~k<n vrin. Choose the Ik, as well as the corresponding 1o,'" 1M, such
3. Here we make a transformation with EF-l as the basis to determine the vertices
yq in quasi-supernode space by eqn (4.28). Pre-multiply vq by each Ii E T(Po'''PM)
accompanying the current S, 0 ::5 i < ne, nt is the number of the timing vectors in the
iji<Po···PM). The shortest computation time tmin is
(4.46)
where vJ E Vq, this is equivalent to a matrix rPo···Pu)FE-1y. Next among all rows find
the minimal difference of the maximal element and the minimal element of a row. The
• 78
optimal timing vector t:"in will be
(4.4 7)
4 For each S(Po···PM) compute tmin vmin and select the minimal one and take. SBC' (Po···PM) k
its corresponding S~CPM) as S. By the following proposition, the searching space, N!
permuted versions, can be cut down to half.
Proposition 4.5.1 For the processor array, if I = JMI (e.g., 1= [5,6,8,6, 5]T), then the
two permuted versions of SSBC produced by Pp and JP p will behave the same.
Proof Let U = SSBcPpFV and U' = SSBCJNPpF/V, and U is accurately mapped.
Suppose F = F'. Then we have U = -JMU/, because JMSSBCJN = -SSBC. Thus U' is
also accurately mapped, because I = JMl. This means that the assumption of F = F' is
correct.
Meanwhile, we know from Proposition 4.4.1 that the two versions have the same
accompanied rPo ...PM), therefore they result in the same tmin. 0
Because SSBCPp and SSBCJNPp both belong to the set of permuted versions of SSBC
and they achieve the same tmin according to Proposition 4.5.1 if I = JMI, only one needs
to be tested.
4.5.4 Integralization of the Quasi-supernode Transformation
Matrix
Up to now, the matrix of the quasi-supernode transformation has been obtained, but in
the real number domain. The matrix B. has to be approximated to an integral matrix B
(Vi,O :s; i < N, approximate hi E B, with hi E ZN, the hi form B) to guarantee that
supernodes are equal up to a translation.
The integralization process makes the elements of hi deviate from hi, so there exists
the possibility that the quasi-supernode polyhedron transformed by B may no longer fit
within the boundaries of the processor array. For the case of sue, if the length of hi is
shorter than that of hi, the size of supernodes will decrease and the size of the supernode
~79
domain will increase so that the supernode domain can no longer be mapped into the
given array by the S. For the case of SBC, the situation is complex because the span of
U in the processor domain along anyone dimension is affected by the sizes of supernode
domain along two dimensions. When the hi deviate from h1, if the sizes of supernode
domain along the two dimensions do not change together properly, it is possible for the
spans of U to be enlarged over the sizes of processor domain.
Therefore, we must choose the suitable hi, from a number of candidates, which force
the transformed quasi-supernode polyhedron inside the domain of the processor array.
Unfortunately, the accurate match of the size of the quasi-supernode polyhedron to that
of the processor array cannot be guaranteed in general. The only thing we can hope for
is to find an integral version B of Ba such that the size of the resulted quasi-supernode
polyhedron is smaller than but approaches to the size of the processor array in every
dimension.
We produce the candidates for hi in a simple way. Suppose
9i = r ).;1+j (4.48)
where j = 0" .. ,ne - 1and ne is the number of candidates of the hi. Then let
(4.49)
where e, E E. These integers, form a vector g = [go,'" ,gN-I]T indicating the size of a
supernode, and a diagonal matrix G = diag(go,' .. ,9N-I)' Obviously B = EG.
For the case of sue, we can just let ne = 1, i.e., hi = r*lei. This formula always
ensures an acceptable- mapping, because hi is longer than hi. For the case of SBC, for
all hi, we begin by taking a vector from the bottom of the list for candidates of hi as the
actual hi, forming a B. Compute U = SB-IV. Check whether the size of U is beyond
the processor array. If this is the case, take another set of vectors from the lists and repeat
the process, until a B for which U is acceptable is found. Generally speaking, if we make
ne big enough, it is always possible to find an acceptable B. In fact, more choices for hi
from eqn (4.48) allow the lengths of hi to increase evenly in every direction, so that the
sizes of the supernode polyhedron will decrease correspondingly without change in shape.
As a result, all the sizes of mapped U in every dimensions decrease instead of increasing.
·80
4.6 Examples and Discussions
Examples are given here to aid understanding of the methodology and illustrate its prop-
erties.
4.6.1 A 3-D Example
Example 4.6.1 3-D computation Problem and 2-D SBC Array
The computational graph is arbitrarily given by
v= [~
100 0 0 0 100 100 100 ]
0 100 0 100 0 100 100
0 0 100 100 100 0 100
D = [ ~1
0 0 i3]1 0-2 1
which are extracted from a nested loop program of the form.
(4.50)
(4.51)
FOR io := 0 TO 100
FOR il := 0 TO 100
FOR i2 := 0 TO 100
A(io,il,i2):= A(io -l,il + 1,i2)
+A(io,il -1,i2 + 2) + A(io,it, i2 -1) + A(io -1,il-1,i2 + 3)
Let a mesh connected regular array be defined by the interconnection primitives P =
P SBC, and processor array I = [4,4]T. Let k = [3, 3]T, and pre-compute the ~ possible
T(Po"'P2) shown in the table below.
(Po'" P2) 2 1 0 1 2 0 2 0 1
0 1 2 0 2 1 1 0 2.rPo"'P2) 1 2 6 2 1 6 1 3 5
1 5 3 2 3 4 1 6 2
2 4 3 2 4 3 2 3 4
3 2 4 2 6 1 3 1 5
3 4 2 4 2 3 3 2 4
3 5 1 4 3 2 3 4 2
4 2 3 5 1 3 4 3 2
6 2 1 5 3 1 6 1 2
• 81
(210) and (012) share the same t set, as do (120) and (021), and (201) and (102). For
example
and
[
1 0 0 1
p1021) = 0 0 1a 1 a
Obviously, P1120) = JP1021). Further more, the columns of (120)+ (021) and (201)+(102)
are just the reverse of each other. It is easy to check that P1120) = P1102)J and P1201) =
p(021)J
p
Next derive E. Having the aid of Example 4.2.1. D can be re-expressed as
D = [ .' 1 ~ ~ ~ 1 = EAp = [ ~1 ~ ~ 1 x [~ ~ ~ ~ l·a -2 1 -3 a -2 1 0 0 1 1
where the first matrix of the product is the E, and the second is the coefficient matrix.
This shows that D is positively expressed by E. And we have a;nax = [1, 2, 1].
Next the processor array is enlarged to a EVPA defined by 10 = 11 = 4 x 3 = 12,
and because the EVPA is square, there are a total of three distinct kinds of permutations
(210), (120) ;\:'td (201 '.. Let's take one, (120), to show the procedure of scaling a supernode
parallelepiped (read Appendix A first to understand the procedure).
[
0 1 0 1p1120)= 0 0 1
100
Note that p1120)p~201) = I. Thus
[
0 100 100 0 100 100 200 200 1
W = E-1V = 0 200 200 100 300 300 400 500
o 100 0 O' 0 100 100 100
(4.52)
When N = 3, only /1 is used as the argument to express fo and 12, so a piecewise function
is derived by means of Appendix A with
fg { 0.2ft + 0.024 0 < ft :5 O.OS
- - /1 + 0.12 O.OS< ft :5 0.12
f; - 0.5fl + 0.06 0 < ft :5 0.12
n~OIi is a N-order piecewise polynomial of fie, we use the optimum searching method
to maximise it. The optimum fo, ft and f2 are 0.036, 0.OS4, 0.102, respectively. It is
known that F' should be re-permuted as F(120) = diag(0.OS4, 0.102, 0.036) .
•S2
To compute the executing times of the quasi-supernode vertices V? associated with all
- -(120) -(120) (120) -(120)possible timing vectors t ET, we evaluate T F W, where the vector set T
behaves like a matrix. Evaluating the differences of the maximum and minimum values
of each row gives the executing times [145 150 152 157 128 130 116 121] for each of the
eight timing vectors in -r120), respectively. The shortest is 116, thus the timing vector,
[5, 1, 3] gives the minimal executing time t~~~) = 116. However, this is not the only thing
that we are concerned with, the size of the parallelepiped is important too. Thus, the
. ., tmin
total computation time is proportional to t~~~rtal := t~~~)v~un = f~}~~;= 214272 steps.
In the same way, the total computation times associated with the other permutations
are computed,
(PO'" PM) (210) (120) (201)
fo x fl X h 9.5 10 4 3.1 10 4 3.6 10 5
tmm 152 116 110'(Po"'PM)
tzun:tot~) 160062 377373 303369(1Jo"'PM
tC:~"PM) [6,2,1] [5,1,3] [6,1,2]
Obviously, the permutation (210) gives the minimal total computation time. Its cor-
responding F(2IO) = diag(0.12, 0.114, 0.07), t = [6,2,1],
O.
8.7
-17.5
Now, B, has to be integralised to B. Let ho, hI and h2 have candidates as follows:
Selecting ho = [9, -9, oV, hI = [0, 10, -20jT and b2 = [0, 0, 16jT from their own
candidate lists, produces a candidate B, which is checked for validity using U = S~~~BV
and is valid. Everything is okay, so the projected size u~ - u~= 11.5 and Ut - u~= 11.6,
just inside the 12 x 12 EVPA.
In summary, we obtain two matrices
o 0 110 0
-20 16
and [
0 ~11 -~11T= !
·83
Table 4.1: 4-D examEles.
small polyhedron large polyhedron
SUC SBC SUC SBC
EVPA proj. EVPA proj. EVPA proj, EVPA proJ.
O-th d 4 3.97 8 7.8 4 3.94 8 7.96
1st d 4 3.93 8 7.9 4 3.97 8 7.86
2nd d 4 3.75 8 7.2 4 3.97 8 7.7
time 2.4s 0.6s 0.3s 0.7s
and a LSGP compression vector k = [3,3jT. Observe that this example is relatively simple,
because when N = 3, the scaling job is easy. There are still a lot of details we cannot
cover here, so the example gives only a indication how this methodology works. Consult
the appendices for more detailsO
4.6.2 More Results and Discussions
More results are given below to show the properties of the methodology. In the following
examples, the size of the processor array is 4 in all dimensions. Let k = [2,2]T for an
SBC.
Table 4.1 gives tl!I' results of partitioning and mapping of 4-D graphs, where "proj."
means the projected array and "time" indicates the time consumed for executing the
algorithm in seconds. The "small polyhedron" is a computational polyhedron of the form
[
0 10 0 10 -10 0 10 0 20 10 1
V = 0 0 20 20 0 40 20 20 40 40
o 0 0 10 0 10 10 0 10 10
o 0 0 0 10 10 10 10 0 10
the "large polyhedron" is one with the same shape but enlarged 1.0 times in every dimen-
sion.
Table 4.2 is similar, but the "small polyhedron" is
0 10 0 10 -10 0 0 10 0 20
0 20 20 0 0 0 20 20 40 40
V= 0 0 0 10 0 0 10 10 0 10
0 0 0 0 10 0 10 10 10 0
0 0 0 0 0 10 10 10 10 10
·84
Table 4.2: 5-D examEles.
small polyhedron large polyhedron
SUC SBC SUC SBC
EVPA pro]. EVPA pro]. EVPA pro]. EVPA pro].
O-th d 4 3.94 8 7.9 4 3.94 8 7.8
1st cl 4 3.87 8 7.78 4 3.95 8 7.9
2nd cl 4 4. 8 6.7 4 3.96 8 7.9
3rcl cl 4 3.75 8 7.9 4 3.84 8 7.9
time 2.6s 296s 1.s 3.4s
and the "large polyhedron" also is the same one but enlarged 10 times in every dimension.
From the results of above, it can be seen that the projected arrays usually match the
EVPA's well, especially for the "large polyhedron".
4.7 Summary
In this chapter; a methodology is proposed for general partitioning and mapping problem
uncler the URE condition. The algorithm can be briefly summarised as follows and is
illustrated in Figure 4.4:
Pre-compiling
Find the timing vector sets for all permuted versions of one basic space mapping
model.
Compiling
1. Derive a positive expression basis from a dependency matrix.
2. Using the positive expression basis as the directions of supernode partition-
mg parallelepipeds, scale the parallelepiped according to the basic space mapping
model.
3. Optimise the parallelepiped from the permuted versions of the basic model
according to their timing vector sets.
4. Approximate the edges of the parallelepiped to integral vectors.
The methodology shows significant advantages over the previous methods [73] and
•85
I N-K,
: processor array A (if PSBC~lndK= I,
I
: enlarge AN-J by k , implying a LSGP I
I I
I with k in future) I'-----------l------------/
EVPAt :-S~~lin~- E- ;.; II ~~tl,~t-~
~: original polytope V can be :
-------------------, I
'Td If' . I ~mapped within EVPA II wo mo e s 0 space projection I _
I and timing vector SSUC& tSUC:
: and SSBC & tSBC
,-------- --------- ....
I T~o models PSUBan~ I$BC :
~of interconnection pnrrunves I------------------,
,-------------------~
Figure 4.4: A Conceptual Chart of the Partitioning and Mapping Method
[24]. In particular, it considers all the properties with which we are concerned in this
thesis: mapping onto a given re processor array, implementing dependencies with given
interconnection primitives, transferring data efficiently (locally and packaged), and high
computational efficiency (no "holes" in the computation space). There is no further
restriction upon it except the URE condition, so it is a practical working model.
• 86
Chapter 5
Optimal Mapping Onto
Lower-Dimensional Regular Arrays
In this chapter a new methodology of partitioning and mapping an N-D computation
graph onto a given regular and fixed-mesh M-D array is proposed (M = N - K and Kc-L).
The partitioning method of Chapter 4 is also applied here as an initial step to obtain
a supernode polyhedron with the canonical dependencies. The resulting N-D supernode
polyhedron i:-_then, projected into a K-D time polyhedron and an M-D space polyhedron.
The latter can be allocated within a given M-D array by scaling supernode partitioning
parallelepipeds properly. The former is re-projected along a I-D time domain by a valid
minimum projection vector p which is derived to achieve high efficiency.
5.1 A Methodology for Partitioning and Mapping
onto Lower Dimensional Array
A significant challenge to researchers is to partition and map a computational polyhe-
dron onto a given regular processor array of lower dimension with fixed-shape and fixed-
interconnections. We propose an integrated method for this purpose by building on the
methodology developed in Chapter 4. We extend this method to the cases of given M-D
processor arrays, where M = N - J( and J( > 1. This extension is non-trivial when we
consider a valid and efficient mapping onto a lower-dimensional array. We abandon the
method of searching directly for a conflict-free t as in [58], [59], and [94], because of the
obvious difficulties outlined in Chapter 2. Instead, we employ the method given briefly
87
multi-D time
polyhedrona remapping
?mapPing
S space
mappmg ~
\___j space domain
I-D time domain
~I ~~t
partitioning !
as Chapter 4
Figure 5.1: The Basic Idea of Lower dimensional Partitioning and Mapping Methods
below:
Step 1 Derive a positive expression basis from a dependency matrix.
Step 2 Find a unimodular space-time transformation T by which the supernode computa-
tional polyhedron is mapped into a K-D time domain and M-D space domain.
Step 3 Scale the parallelepiped in some dimensions, according to S, such that the supernode
polyhedron can be mapped within a given regular processor array.
Step 4 Find a valid and minimum projecting vector p by which the K-D time domain of
the supernode polyhedron can be mapped to a I-D time domain.
Figure 5.1 shows that the supernode polyhedron is mapped within the processor array
and a multi-D time domain. The key point is that the multi-D time polyhedron should
be remapped by p along an I-D time domain as compact (short) as possible.
Since Step 1 and Step 3 have been discussed in the previous .chapter. We deal with
Step 2 and focus especially on Step 4 under the assumption that the supernode partition-
ing has been carried out so that a supernode space has been established and canonical
dependencies have been obtained.
Some researchers have pointed out the necessity of avoiding data link conflict in syn-
thesis of lower-dimensional array. In fact, this conflict can occur anywhere if the number
.88
of links is less than that of dependency vectors. To come over the obstacle, all the out-
going data, no matter what variables and what sources, must and can be transferred out
as a packet along a link as long as they go in the same direction, which will be discussed in
Chapter 6 and 7. Nevertheless, the canonical dependencies can prevent the kind of data
link conflict described in [58], [59] if links are provided in the same way as in [58], [59],
because the greatest common divisor of the entries of each canonical dependency vector
IS one.
The rest of the chapter is organized as follows. Firstly, we select two families of the
space-time transformation matrices for SUC and SBC. Next, we focus on mapping the K-
D time domain to a l-D domain efficiently by the derivation of a valid minimum projection
vector. Finally optimisation is discussed and a number of examples are given to illustrate
the operation and performance of the method.
5.2 The Transformation into K-D Time Domain and
M-D Processor Array
5.2.1 Selecting a Family of T
Let a non-singular matrix
T=[~l
be used to map the original N-D computational polyhedron to form a K-D time polyhedron
(where IT is a I< x N timing matrix) and a M-D processor array (where S is a M x N
space mapping matrix).
For the SUC and SBC interconnection patterns, we choose Ssuc and SSBC on the
understanding that they should have as few non-zero entries as possible, and that SsucD"'
and SSBCD" can be implemented with Psuc and PSBC, respectively. In addition, it is
desired that SSBcDs has bi-directional data flows. Let
SSUC - [OMxK,IM]
SSBC - [OMxK,IM] + [OMX(K-l), -1M, 0] (5.1)
As before it can be verified that the Ssuc and SSBC satisfy the requirements above,
and that in each row of SSBCDs, there are the same number of "I" and "-I" values .
•89
When choosing II, a number of criteria must be taken into account. Firstly, the
resulting T must be non-singular to produce a conflict-free mapping, and there must be
enough time for data flow between processors. If these conditions are satisfied, we say that
the II is valid. To maintain high efficiency, the resulting T must be unimodular, which
means there are no "holes" in the resulting computational polyhedron. Furthermore II
has the least non-zero entries and the non-zero entries are as small as possible, in order for
the transformed supernode polyhedron to be as compact as possible. Such a II is termed
priori-optimum. The matrices IIsuc and IIsBc are selected as
IIsuc [IK' LKxM]
IIsBc - [IK L(K~)XM] (5.2)
where L is a matrix such that there is one and only one "I" in each column. Tsuc =
[ ~:~~ ] and TSBC = [ ~::~ ]. For example, when N = 5, K = 3, we may write
0 0 0 1 0 0 0 -1 1 0
0 0 0 0 1 0 0 0 -1 1
Tsuc = 1 0 0 0 0 TSBC = 1 0 0 0 1
0 1 0 0 1 0 1 0 1 0
0 0 1 1 0 0 0 1 0 0
Obviously det(Tsuc) = 1 because det( [ ~:~~ ]) = det( [ ~;~~ ]) and the latter is
a upper matrix with only "1"s on the diagonal. It can be proved that det(TsBc) = 1. In
fact, adding the i-th column to the (i-l)-th column, i = N - 1, ... .K and swapping the
first M rows with the left K rows, we observe that
T [ OMxK 1M] [UKSBC ===} ===}UK MKxM OMxK
where UK is a upper-triangular matrix. The key point to observe is that in the last row
of IIsBc, there is only one "I" in the (K-l)-th position.
As will be discussed in the next section, the K-D time domain will be projected into a
I-D time domain by a projecting vector p = (po,... ,PK-2,PK-l]' (usually, PK-l = 1 and
Po" .. ,PK-2 > 1). Thus, eventually, we still attain a timing vector t such that
t = pII and T' = [ ~ ]. (5.3)
.90
The transformed dependency matrix is T'Ds = [ ~~: 1 and the last row tn- indicates
the time for data flow (or delay) among processors to implement the data dependency
SDB.
In Chapter 4 theorem 4.3.1 was given to guarantee enough time for data delay, it
can be seen that the condition is satisfied by pTIsuc and pTISBC. From eqn(5.2), we
obtain f = [PO, ... ,PK-I,XO, ... ,XM-I], where Xi E {Po, ... ,PK-d for SUC, and Xi E
{Po, ... ,PK-2} for SBC. In the case of SBC, for j = K, ... ,N - 2, L::';;ol ISi,jl = 2, while
the corresponding tj is larger than 1, since tj E {Po, ... ,PK-2}'
Furthermore, it can be checked that the TIsuc and TISBC contain the least and smallest
non-zero entries (i.e., removal of any element will result in them being invalid). Therefore
they are priori-optimum. It is pointed out that because det(TsBc) = 1, the further LSGP
partitioning is no longer needed unlike the case of (N-1)-D array in Chapter 4.
5.2.2 Scaling the Supernode Parallelepiped
As proposed in Chapter 4, the positive expressing basis E is already derived and then
should be scaled and approximated to an integral matrix B which defines the partitioning
parallelpipes. The procedure of scaling is quite similar to that of Chapter 4. However,
since S is lower-dimensional, there is more freedom in determining the partitioning par-
allelepiped, (i.e., the diagonal matrix F = diag(fo,"', IN-I) in eqn (4.29).). In fact,
because U = SFW and SF = SF, for a Si E S such that Si = 0, Ii has no contribution
to SF and U, so the Ii will not be determined by the requirement of mapping within a
given array. Instead the Ii is delimited only by an upper bound o,,!/IZ' In other words,
I
the size of the parallelepiped in the i-th dimension has a lower bound of aroz, which is
required by the canonical dependencies in supernode space.
Observe that in the (N-1)-D SUC case, there is only one free dimension caused by
the 1-D time vector. But for the lower-D case, there are K free dimensions for SUC and
K-l free dimensions for SBC. The problem is how to determine the actual sizes of the
supernode parallelepiped with respect to the relevant dimensions with these degrees of
freedom. From the point of view of theory, the sizes should be as small as possible to
achieve the minimum total executing time. But in practice, the smaller granularity means
.91
more supernodes, and hence more overhead operations. A tradeoff has to be made.
5.3 Maximum Local K-D Time Domain
The supernode polyhedron, ps, produced by the partitioning transformation can be de-
fined by its set of vertices, VB. However, it can be understood that this supernode
polyhedron is not really the polyhedron which will be computed in one processor. In fact,
a particular processor p will be assigned a part of pB, called local supernode polyhedron
P~. With S, we can split ps into P~'s for each processors. However, before doing so,
some preliminary work should be presented.
5.3.1 Intersecting a Polyhedron with a Hyperplane
For any face of a n-D polyhedron, the facet is a more precise description. A facet is plotted
by a number of edges. Each of the edges is defined by a pair of vertices. So the facet is
determined by a number, say nv, of vertices vi' nv ~ n. When a facet is determined by
vu,···; Vnv-l, we define the facet space range gf and g~in the i-th dimension as
i= 0,··· ,n-1 (5.4)
where Vi,i E Vi.
In the example above, we saw that a n-D polyhedron is intersected by a hyperplane,
which produces a (n-1)-D sub-polyhedron. The vertices and faces of the sub-polyhedron
must be derived. Fortunately, this job is not too difficult, since in our case, the intersecting
hyperplane is perpendicular to a coordinate axis. A hyperplane i; = h will intersect all
facets whose space range is such that g~ ~ h ~ g~. Now substituting h as ir into the
faces corresponding to these facets, we obtain the faces of the sub-polyhedron.
It can be understood that the vertices of the sub-polyhedron can only be the intersec-
tion of the hyperplane with the edges of the original convex hull (when n > 2), not with the
faces. That is, any point on a facet but not on an edge of the facet cannot be a vertex of a
sub-polyhedron of the original, see Fig 5.2 where a 3-D polyhedron "ABCD" is intersected
by plane p, creating 2-D polyhedron "abc". It can be seen that points a, b and c are the
intersections of edges AD, BD and CD with plane p, respectively. More strictly, if point
.92
BD
Figure 5.2: The Intersection of Polyhedron with a Hyperplane.
j = [jo, ... ,jr-t, h,jr+I,'" ,jnf is such a point on a facet, it is always possible to find a
short enough vector ~j = [~jD'" ., ~jr-b 0, ~jr+I, ... , ~jn]T such that j +~j and j - ~j
are still on the facet. After intersection, all [iQ, ... ,jr-I, jr+I, ... ,jnJT, [iD+~jo, ,jr-I +
~jr-I' jrH + b..jrH, ... .i; +~jnf and [jo- ~jD' ... ,jr-I - ~jr-I, jrH - b..jrH, .i; -
~jnlT are also on the surface of the sub-polyhedron. Therefore, [iD, ... ,jr-I,jrH,'" ,jn]T
cannot be a vertex, because it lies between two points. On the other hand, only the in-
tersection of a hyperplane with an edge determines a point. Then, for all edges, if the
pair of vertices VI and v2 is such that v; ~ h ~ v~ or v~ ~ h ~ v;, a vertex w of the
sub-polyhedron is determined as
h - VI
"w' = 'r~+ r (v~ - v~) i = 0", . ,r - 1, r + 1"", n - 1
I, 2 I' I'
Vr - Vr
(5.5)
Obviously, the vertices of the sub-polyhedron are not necessarily integral vectors.
5.3.2 Maximum Local Supernode Domain
Notice that the computation task splitting is easily done in the processor-time domain.
Therefore, the process of splitting consists of a domain transformation from supernode to
processor-time, then splitting, and finally domain transformation 'from processor-time to
supernode.
Step 1 Map p~ into the processor-time domain by U to form a N-D processor-time polyhe-
dron pq = Up~,defined by its vertices vq = U'V", The U can be any unimodular
matrix with S as its first M row vectors. Thus any of the T in the above section
can be used. Suppose the processor-time domain is indexed by j .
•93
Step 2 Use M hyperplanes jo = Po,'" ,)M-I = PM-I to intersect the N-D Pg in series,
determining a Pg assigned onto a particular processor p = [Po,'" ,PM-IV. That is,
firstly, intersecting the N-D Pg with jo = Po results in a (N-l)-D sub-polyhedron;
then, intersecting the (N-l )-D sub-polyhedron with ji = PI results in a (N-2)-D sub-
polyhedron, and so on. At the end, we obtain a (N-M)-D sub-polyhedron indexed
by jM,'" ,jN-I, as well as its vertices vb's which have a form of [v~"'" V~-IV,
Collect all vb's into Vb which is the vertices of (N-M)-D sub-polyhedron with
respect to processor p.
Step 3 Remap the (N-M)-D sub-polyhedron back to the supernode space. Attaching p
to the front of vb's, we obtain the vertices, v~'s, of the sub-polyhedron in the N-
D processor-time domain, i.e., v~ = [Po,'" ,PM-I, v~"'" V~-lV, Now, compute
vi>= U-IV~ for all v~'s, which confine the local supernode polyhedron assigned to
the processor p. Collect all vi>forming Vi>which is the vertices of p~.
Step 4 Repeat this procedure for every processor.
Note that P~ is independent of U as long as the condition in Step 1 for contructing
U holds. Unfortunately, for a large processor array, there are many P~'s. In some cases,
the P~'s share the same shape subject to a translation and we can take only one (only
the shape of p~ affects the following operations). Otherwise we can merge them into a
maximum local supernode polyhedron.
At first, translate P~'s to the origln point by substracting the average value of each
row vector of V~, and then merge all the central-translated Vi>into a large set of points.
Finally, feed the set of points to an algorithm which can produced the convex hull of the
set (A convex-hull algorithm [65] [98] can be used which gives the vertex set VK and all
faces, characterized by their normal equations, of the convex hull). The maximum local
polyhedron and its vertices are indicated with pi and Vi, respectively.
5.3.3 Mapping to K-D Time Domain
Because the number of the integer points (supernodes) closed within pi is finite, the Vi
is also the polytope of pi, and, by definition, any non-vertex point s (integer) of pi can
.94
be expressed by convex combinations of Vi, that is,
nv-1
S = L AjV;
j=O
(5.6)
where v~ E Vi, and 2:j:::"i;t x, = 1 and 0 ::; Aj < 1.
By applying IlK upon pi, we obtain a K-D polyhedron IlKpl ----t pK. Let VK indi-
cate the vertex set of pK (henceforth, "polyhedron" simply indicates such a polyhedron
in the K-D time domain). Note that VK ~ IlKV' and this is true for any arbitrary IlK.
In fact, applying IlK on eqn (5.6), we have IlKS = ,£j:::Ol AjIlKv~ where A/s are the
same as those of eqn (5.3.3). This means that IlKS cannot be a vertex of pK, so we can
only figure out the vertices of pK from the IlKv~'s. A convex-hull algorithm is applied
to the set of points IlK Vi to find its convex hull C'HK.
5.4 Valid Minimum Projecting Vector
It is easy to determine a hypercube with bounds bo, ... , bK-l to contain the polyhedron,
i.e,
nK-l K nK:-l K
bJ· = max v· . - mill v· .+ 1;i I~ i I~ j = O,... ,K-l (5.7)
where vt; E vf E VK and n« is the cardinality of VK.
As mentioned before, the straightforward method ( see [104] ) to map the K-D poly-
hedron to a I-D domain is to construct a projecting vector p = [Po, ... ,PK-l], where
PK-l = 1, and Pi = biHPi+l. i = K - 2, ... ,0 (5.8)
Obviously, when the shape of the polyhedron is not a regular hypercube, the efficiency
can be very low, because too much of the surrounding hypercube is identified with null
computations. The problem is that the elements of p defined in eqn (5.8) rely on a
sufficient condition for avoiding conflict, but not a necessary one.
An example in Figure 5.3.(a), which is rather extreme, illustrates the possibility of
decreasing the elements of p while keeping the conflict-free property. The method in eqn
(5.8) yields p = [6,1]. However, it is easy to verify that a vector [2,1] is sufficient and
necessary to achieve a conflict-free mapping. Intuitively this is because the band width
.95
io dOL
dl
9 H
III
~3 34 35
6 7 8
22 23 24
3 4 5
II 12 13
0 I 2
00 01 02
io °G
-6 6 5 7 -4 8
30 6 31 732 8
-3 5 -2 6 -I 7
21 322 4 23 5
0 4 I 5 2 6
12 o 13 -I 14 -2
3 3 4 4 S s I
03 -3 04-4 05 -S
d
(a) i I (b)
Figure 5.3: 2-D Time Polyhedron. For both (a) and (b), the numbers at the bottom-
left corner of a square is the index of the node and the number at the top-right corner,
the allocation in the I-D domain projected by p = [2,1]. In (b), the numbers at the
top-left and bottom-right corners are the allocations projected by p = [-2,1] and [2,-1],
respecti vely.
of the polyhedron is 3. This observation is not true in general as demonstrated by the
polyhedron of Figure 5.3.(b) where the polyhedron is simply the reverse of Figure 5.3.(a),
the same p causes a conflicting mapping.
From Figure 5.3_{h), it is also clear that a small p (e.g., [-2,1]) is possible to avoid
conflicts. Bur, that the p is unacceptable, because it violates the dependency relation in
the io direction. A naive attempt to fix the problem is to reverse the polyhedron along
the il axis (i.e, a transformation i~ = -it, such that Figure 5.3.(b) is the same as (a)
in shape). This transformation is equivalent to using p = [2,-1]. However, since the
execution sequence is from left to right, this projection, or the transformation, violates
the dependency relation in the il direction.
Definition 5.4.1 A projecting vector p is said to be valid V jt,j2 E pK, if V J' )--j2,
pjl > pj2,. and to be minimum if any decrease of its elements will. cause the projection to
be invalid.
Now it can be understood that "valid" guarantees no violation of data dependency by
such a mapping. Henceforth, p indicates the valid minimum projecting vector. We use
an example to illustrate the idea of obtaining a valid minimum projecting vector.
Example 5.4.1 The Projection of 3-D Time Domain to I-D Time Domain
.96
iO = 0 layer
· . ... .... .. .... -;- ...... -: ........ ~.. .. .... .. ..· . ,· . .· . ,
iO = 1 layer iO = 2layer
, . .
... _ .. _-:-----:._---:---- -.· . .· . .
............................ -· . .· . .· . .
_ .. - ---- ......... ---- ...... _. -_· . .· . . 9 10 III121 122 123
7 8
1I1 1I2 ........ ~
· . ........... .... ,.... --· . .· . .
...... ~...... -:- ...... -:- ...... ~.. .. .... .. ..
• • I •· . . .· . . .
(a) (b) (c)
Figure 5.4: The layouts of a 3-D Polyhedron. The number at the bottom of each box
is the index of the node and the number at the top, the allocation in the 1-D domain
projected by p = [4,2,1].
A polyhedron is defined by a set of 6 vertices
[
0 0 0 2 2 2]o 0 1 244
2 4 4 0 0 3
The layouts of the 3-D polyhedron intersected by three planes io = 0, 1, 2 are shown in
Figure 5.4.
Now consider the determination of the minimum p = [PO,PI,P2], in the i2 direction,
the series of adjacent nodes implies that P2 = 1 is required to avoid conflict. We need
to determine the minimum Pl. Since PI is used to project the rows of different it, it is
desired that the first node, say jl = rio, il + 1, IV, of the (il + 1) row is just behind the
last node of the il row, say jl = rio, it, I]T, that is [Po,Pt, P2]jI = [Po,PI, P2]j' +1. Therefore
PI = 1- 1+1;
However, there are a number of adjacent rows in each layer of the polyhedron, we must
choose the maximum Pl. Both the pair of rows il = 1 and il = 2 on the layer io = 1 and
the pair il = 3 and il = 4 on the layer io = 2 produce a PI = 2.
Finally, we derive the minimum Po. Because Po is used to project the layers of different
io, we hope that Po is chosen such that each layer can be projected one by one. That
is, if the last node of a layer is jl = [io,It, 12]T and the first node of the layer above it is
~97
j! = rio+ 1,11, I2Y' then [PO,P1,P2]j' = [PO,P1,P2]j' + 1. So, similar to P1
Po = ([11,12] - [11,12]) [~~ 1 + 1
The last node of the layer io = ° is [0,1, 4]T, the first and the last nodes of the layer io = 1
are [1,1,1Yand [1,2,3]T, respectively. Similarly the first inlayerio = 2is [2,2,0]T, so the
pair [0,1, 4Y and [1,1, 1]T produces a Po = 4. The pair of nodes [1,2, 3]T and [2,2, OY do
the same. Thus we obtain a valid projection vector p = [4,2,1]. And it is easy to verify
that a decrease of any elements of the pwill cause conflicts. The instance of execution for
each node projected by p is also shown in Figure 5.4. The polyhedron has 16 nodes and
the executing time is 18 steps. In contrast, by means of the conventional method of eqn
(5.8), the projecting vector will be p = [25,5,1] and the corresponding executing time is
72 steps. In this example, the efficiency is improved significantly, from 0.22 to 0.89. 0
It must be pointed out that the time mapping efficiency is not the same as the overall
efficiency of the whole processor array. The former indicates the internal efficiency of a
processor during its active period, while the later is not only proportional to the internal
efficiency of a processor but also depends on the distribution of the active periods of
every processor. The distribution of the active periods of processors is determined by
data dependency, that is, some processor must be inactive until other processors provide
necessary data.
5.5 General Methodology for Valid Minimum Pro-
jecting Vector
We will develop a general methodology for deriving p. Unfortunately, this is not an easy
algorithm to describe and requires careful reading.
5.5.1 The First and Last Nodes of Polyhedrons
Before giving the method of deriving p, more information about an arbitrary polyhedron
is required.
,98
Definition 5.5.1 A (K-r)-D subpolyhedron, noted as pK-r, is a polyhedron created by
intersecting a K-D polyhedron with r hyperplanes io = ho, il = hI, ... , ir-I = hr-I step
by step. For uniformity, the indices of pK -r are marked i.; ... , iK -1.
A pair of two adjacent (K-r-1)-D subpolyhedra are such that the lower, pfj-r-I, is
produced with hyperplanes io = ho, ... , ir-1 = hr-I, i; = hr' the upper, pf-r-1, with
Definition 5.5.2 An integral point in a J<-D polyhedron P (or subpolyhedron) is referred
to as the first node, fK, if V integral points j E P, j ~ fK. Similarly, an integral point is
referred to as the last node, IK, if V integral point j E P, IK ~ j.
If the polyhedron represents a nested loop computation, fK is the first iteration to be
executed, and IK, the last one. If the polyhedron is the original convex hull, so that all
the vertices are integers, fK and IK must belong to the vertex set. By the definition it
is easy to look for fK and IK in the vertex set. However, if the polyhedron is a (K-r)-D
subpolyhedrorr, we r~lluire a different method to find fK -r and IK"".
Consider fK-r = [J!!-r, ... , f!f:rlT for pK-r. Start from the index ir. First, find the
lower bound b; of pK-r in the ir direction according to b; = min, Wr,j, where Wr,j E Wj
and Wj is a vertex of pK-r. Let f!!-r = rbrl. Then intersect pK-r with the hyperplane
i; = f!!-r. We obtain further a (K-r-1)- D subpolyhedron pK-r-l. Next, find the lower
bound br+I of the pK -r-l and let f!!.tlr = rbr+11. Continue the recursive procedure until
iff:~is found. If rbr+I1 is outside pK -r-t, this means that the intersection by i; = rbr 1
is invalid and we must reassign f!!-r = rbrl + 1, and intersect pK-r by i; = rbrl + 1.
The process is repeated until rbr+11 is inside pK -r-l. The process is illustrated in Figure
5.5. The intersection made by ir = rbr 1 is invalid, because rbr+ll is outside the pK -r-l
of Figure 5.5.(b). Letting f!!-r = rbrl + 1, we try again. The pK-r-l of Figure 5.5.(c) is
acceptable. Thus, finally, we obtain fK -r = [2, 2]T.
The same procedure is applicable for determining IK, but we replace "lower bound"
with "upper bound" and replace r 1 with l J.
In a (K-r)-D subpolyhedron, for a pair of adjacent (K-r-l)-D subpolyhedron, a Cross-
layer distance dK-r indicates the difference of the first node of the upper subpolyhedron
.99
I I I I I •
, • • • I •_ _ _ ..
I • • • • •
I I • • I
I • • • •
" ,..................
,pk-r-I
I 1r+1
br+1 rbr+ll
(b)
pk-r-I
~,
ir+1, ,, ,,
br+1 rbr+ll
(c)
. ........ _- - --- .
• I I • I I
• I I • • I
, , • • • I
I • • I
(a)
Figure 5.5: Finding fK-r for pK-r. (a) is the pK-r being intersected by two planes. (b)
and (c) are the resulting subpolyhedra pK -r-l produced by intersecting with i; = rbr1
and i; = rbrl + 1, respectively. The selection of the hyperplane which determines fK-r is
effected by the resulting subpolyhedra, that is, begin with i; = rbr 1, if rbr+11 is outside
the resulting pK -r-t, modify i, = fbr 1 + 1.
and the last node of the lower one, i.e., dK-. = [ff~'-l ]- [lf~'-l ].
5.5.2 Deriving p
From the Example 5.4.1, we know that the process to derive P is a recursive procedure from
PK-l to po. In a (K-r)-D subpolyhedron, for a pair of adjacent (K-r-1)-D subpolyhedron,
Pr should be determined such that no overlapped-projection is made, that is
-K-rd'·:-r > 1 > -K-r-l(IK-r-l _ fK-r-l) + 1P _ =* Pr_P 0 1 (5.9)
where pK-r = [Pr, pK-r-lj. Furthermore this should hold for all possible (K-r)-D sub-
polyhedra and for all possible pairs of two adjacent (K-r-1)-D subpolyhedra. We must
find the lexicographically greatest dK -r. In principle, this can be achieved by trying all
the intersecting hyperplanes io = ho, ... , i; = hr, but obviously, this approach is not
practical when the size of the convex hull is large.
Fortunately, it is sufficient to search for the potential Pr by considering only the in-
tersecting hyperplanes associated with the vertices of all possible pK -., S = 0, ... , r + 1.
We explain the method in real space for simplicity.
Lemma 5.5.1 Pr is a piecewise linear function of ho, ... , hr .
.100
Proof In fact, fK-r-1 and IK-r-l belong to the vertices of the sub polyhedra pK-r-l.
Thus each of them is the intersection of K-r-l faces of pK-r-l. The K-r-l faces of pK-r-l
are produced from the K-r-l faces of the original C1{K, intersected by r-l-I hyperplanes
io = ho, ... , ir = hr. Let h = [ho, ... , hrf. For the p//-r-l, from the K-r-l equations
1 t d t lK -r-l' 0 l' 1re a e 0 0 , 2 = ,... ,1\ - r -
[ai,O,"" ai,r]h + [ai,r+1l"" ai,K_tll~-r-l + ai,« = 0
we see that l{f-r-l can be expressed in a form of l{f-r-l = b' + A'h. Similarly, for thepr -r-l which is produced by using io = ho, ... ,ir-1 = hr-1 and i; = h; + 1, the ff -r-l
has the same form, ff-r-1 = b/ +A/h. Therefore Pr is a linear function of ho, ... , hr as
Pr = aoho, ... , a.h;+ar+1' But, when ho, ... , li; vary, they may go outside the boundaries
of some facets which are associated with the l~ -r-I or those with the ff -r-I. Thus some
of the faces become unrelated, while other faces will join in. When the membership of
the related faces change, the linear function for p; will change, too, that is, ao, ... ,ar+l
are also piecewise linear functions of bo, ... ,hr. 0
If ho, ... , hi-I are fixed, Pr can be rewritten as Pr = aihi+, ... , +arhr + a~+1' For a
linear function, there is no critical point for local maximum or minimum value within its
area of definition. However for a piecewise linear function, the critical points appear only
at those points where changes of any of ho, ... , h; will cause a change of the function
to another set of ao, ... , ar+1' If this were not the case, when ho varies while ao keeps
constant, Pr is still a linear function with respect to ho, so there is no critical point in an
small interval around the ho.
Theorem 5.5.1 If he = [ho," ., h~]T is a critical point, then Vi,· hi (or hi + 1 when i =
r) must be at a vertex of the pK-i, where "at a vertex" means hi is the Vi of the vertex
[Vj, ... ,vK-If·
Proof: The proof proceeds by induction. At first, fix ho, ... ,hr-I to produce a pK "",
then PI' = a.h; + a~+1' The PI' cannot achieve a local maximum value unless h~ is at a
vertex of -ps -r where any movement of h; will cause a change of al" If not, when li;
.101
varies in an interval small enough, the set of facets intersected by h; remains unchanged,
because h.; does not move outside these bounds in the ir direction (where the bound is
the component of a vertex in the given direction), c, is unchanged.
Suppose Vj, i+ 1 :5 j :5 r, hj is at a vertex of pK - j. When hi varies, the hi+!, ... , h~
must change as well to keep the property, since their corresponding vertices are changed.
This implies that if the hi does not move over the bounds of the facets it currently
intersects, there is no change of associated facets of pK -i, so hi+! is still at the vertex of
the same set of facets, and so on for hi+2" .. , h;. Thus, ai, ... ,ar remain unchanged. In
contrast, hi+!, ... , h; determined in this way are linear functions of hi (because a vertex of
the pK -i-1 is a linear function of the intersecting hyperplane i, = hi, and so on. So there is
a recursive linear relation between hi and hi+ll"" h;). Then Pr = aihi+, ... , +arhr+a~+!
can be rewritten as p; = a~hi+ a~~l' Therefore, when i= r, to achieve a local maximum
value of Pr, hi must be allocated at a vertex of the pK -i. 0
By the above theorem, unlike [94] and [34] whose searching spaces are related to
the size (volume) of the original polyhedron, the computing complexity of our method
is related only to the number of the vertices of a series of polyhedra. For any sensible
computational polyhedron, the number of vertices is quite limited, no matter how large
the polyhedron becomes. This is the major computing advantage of our method over
others.
For a subpoly hedron pK -i with a vertex set VK-i ,we form a set H K - i = {Va, ... , vnh }
of candidates of intersecting hyperplanes by collecting all different Vi,; E V; E VK -i ,where
nh is the cardinality of HK-i. Now, the operation to derive p can be described as follows.
Step 1 Create a set of polyhedra SPi which consists of all subpolyhedra pK-i (SPa =
{pK}). When i > 0, SPi is produced from SPi-I' For each of the subpolyhedra
of S'Pi:«, say pK-i+!, find intersecting hyperplanes ii-l = Vj for all Vj E HK -i+1 ,
producing nh pK-i'Sj collect all such created pK-i,s into SPi. K-l SP's are created,
including SPa.
.102
Suppose that p(K-r-I) = [Pr+b ... ,PK-I] have been determined. Now, derive the
pr. For brevity, let r' = K - r.
Step 2 Take one subpolyhedron of SPr, say -pr'; For each Vi E Hr', make three intersecting
hyperplanes, i; = Vi - 1, i; = vi and ir = Vi + 1. Then intersect the -pr' with the
three parallel hyperplanes, obtaining p~;-I), pt-I) and p~r'-I).
Step 3 Find the first and the last nodes for each of the three subpolyhedra, f~;-I) and
(,.1 1) (r' 1) (r' 1) (r' 1) (r' 1)L1- ,fo - and 10 - ,and fl - and 11 - . Calculate a possible pr
_ {n{r'-I)(l(r'-I) r.(rl-I) n{rl-l)(I(rl-I) f(rl-I)} + 1Pr - max P -1 - 0 , p 0 - 1 (5.10)
Step 4 Repeat Step 2 and Step 3 for all sub polyhedra of SPr. Let the p; be the greatest
value evaluated by eqn (5.10).
Example 5.5.1 The Projection of 4-D Time Domain to J-D Domain
An arbitrary example is given to show the method. The original -p« is
-5 0 5 0
o 0 0 0
20 20 0 0
o -5 0 -5
10 5]10 10
o 0
-5 0
The expressions of the faces are omitted, for brevity. At first we create the SP's. SPo =
{P4}. The H4 of the -p« is {-5, 0, 5, 10}. So four hyperplanes io = -5',0,5,10 are used to
intersect P4. However, since the intersections with -5 and 10 produce only a single point
polyhedron, they are omitted. Two polyhedra n and P:are produced
[
0 15 15 10 10 20 20 1 [10 10o -2.5 -2.5 -5 -5 -5 -5 0 0
o 5 7.5 5 0 7.5 10 5 10
5
-2.5
5
o 20 5
o -5 -2.5
2.5' 10 2.5 ~ 1
forming SP1• Then we create SP2 from SP1• The Hl which corresponds to Pg is
{O,10,15,20}. The intersection with hyperplanes il = 10,15,20 produce poly tones PJ,
Pi and Pi,
[~136-!.6 -;5 ~51[-~.5 ~~553~;5~~1[~~ ~g1
.103
while the hyperplane il = 0 yields a single point polyhedron, omitted. For Pr, similarly,
two 2-D subpolyhedra n and n are produced. All the five polyhedra together form the
SP2•
We know P3 = 1, and suppose that P2 = 5 has been determined from SP2• Let us
derive PI from SPl. Take Pg as an example. In H~, at first, take 10 to form three
hyperplanes 11, 10 and 9 intersecting the Pg. The 2-D subpolyhedra produced by the
hyperplanes 11 and 9 are
[
-1.8 -1.8 -4.5 -5 -5] [-1.5 -1.5 -4.5 -4.5]
3.6 5.5 1 0.75 5.5 3 4.5 4.5 0
Note that the polyhedra n ,PJ and P:I are the 2-D subpolyhedra produced by il =
11, 10, 9, respectively. We can find the f; = [-5, l]T and 1~= [-2, 5]T, and fJ = [-5, of
and 1:1 = [-2,4]T. Due to p2 = [5,1], by eqn (5.10), PI is evaluated as 20. Repeat the
procedure for the remaining elements of H~, we find no new PI greater than 20. And
repeat the procedure for P:, PI = 20 is still the greatest. So, at last, p3 = [20,5,1].
Repeating the procedure for Po, the valid minimum projecting vector p is derived to
be [386,20,5,1]. The 14 and £4 are [10,10,0, -5F and [-5,0,20, O]T. The actual executing
time is 6166 time units. The conventional projecting vector by eqn (5.8) is [1386,66,11,1],
and the corresponding executing time is 22056 units.
o
Again a great improvement (3.57 times) to the executing time is observed. Of course,
everything depends on the shape of the polyhedron. However, for many irregular poly-
hedra, the improvement is quite significant. A testing program was used to check for all
the examples used so far that every node in a given polyhedron is mapped by the p's
into their proper position. Algorithms C.3 and C.4 collected in Appendix C describe the
process for intersecting and find first and last nodes of a polyhedron.
5.6 Optimisation
5.6.1 Method
Optimisation of the mapping can be made with both S and II, where the columns of S can
be permuted. However, since there are K empty columns in Ssuc and K-1 empty columns
.104
in SSBC, which make no difference, there are ~ipermuted versions of Ssuc, and (:'-'1)'
permuted versions of SSBC. It is obvious that if the columns of II are permuted with
S, the T keeps the desired properties, such as unimodularity and satisfying eqn (4.19).
The criterion for selecting S is that the F evaluated using S should yield the largest
supernode, that is, nf=ol Ii should be as small as possible. This criterion expresses the
practical choice of granularity for the supernodes, and will be discussed later.
As for II, we can change the position of the only non-zero element "I" of each column
of LKxM and L(K-l)XM, resulting in f{M versions for sue and (1< - l)M versions for
SBC. Furthermore, the rows of II can be permuted to produce K! versions, which causes
no violation of eqn (4.19) and has no effect on the unimodularity of T. Therefore, the
total optimisation space is f{!f{M and f{!(I< - l)M for sue and SBC, respectively.
The optimisation procedure is briefly described as follows:
Step 1 Find all permuted versions of S.
Step2 For each S, evaluate the F's. Choose the F which results in the maximum volume of
supernode. From the F, determine B and then, compute the supernode polyhedron.
Step3 Produce all TI's containing all the possible L's. Then for each of them generate all
permuted versions.
Step4 Produce the K-D time domain polyhedra by multiplying each of TI's by the supern-
ode polyhedron. Then, for each of the K-D polyhedra, find the valid minimum 15',
as well as the corresponding executing time. Choose the II and p which results in
the minimum executing time.
To compare with the method in [94], we fix S. Because t = pII we divide the opti-
misation procedure into two steps, instead of one step as in [94]. From the construction
of TI, we know that the TI is priori-optimum. Because the shape of the original poly-
hedron is changeable, we have to exploit all possible priori-optimum II to achieve the
posteriori-optimum, i.e., to find which of them match the shape of the polyhedron best,
when combined with its corresponding p. Of course, actually, we want to obtain an
optimum t.
;'05
5.6.2 An Example
Example 5.6.1 Example of an Irregular Problem
FOR io = 0 TO 9
FOR il = io TO io + 9
FOR i2 = il TO i1 + 9
A(io, ill i2) = A(io - 1, il - 1, i2 - 1) + A(ia, il - 1, i2 - 1)+ A(io, iI, i2 - 1)
Suppose that a linear array is used to implement the computation. Thus, N = 3 and
J( = 2. The computational graph can be found out:
[
00009999]
V = 0 0 9 9 9 9 18 18
o 9 9 18 9 IS 18 27
The computational polyhedron is a skewed long prism which is quite irregular, stretching
from point (0, 0, 0) to point (9, 18, 27) and including 1000 nodes. Since this chapter
focuses on lower-dimensional mapping only, we do not pay attention to the partitioning
which is full \' ,liscus:-;t.:d.in·Chapter 4 aud do not make it here, that is, let B be an identity
matrix. However, since D is a subset of the canonical dependencies D", the design of T
proposed in Subsection 5.2.1 can be applied directly to this example.
Letting S = [1,0, OJ, the computational polyhedron is mapped onto 10 processors.
Then we can have the model of rr, [~ 10 1, 1 where only one element of 10 is "1',
and the same holds for h. There are 2 candidates produced from the model, II~ =
[~ ~ ~] and n~= [~ ~ ~]. We also row-permute the 2 candidates and obtain
II~= [~ ~ ~] and II~= [~ ~ ~].
Let U = [ ~~ ]. It can easily be checked that
o 0 0 9 9 9
o 9 9 9 9 18
9 9 IS 18 27 27
There are 4 relevant edges which can be expressed in parametric form
EdgevZv:: [ia, ja, 2jajT, Edgev1v~: [io, jo, 2jo + 9]T
Edgev~v~: [ia, io + 9, 2jo + 9f, Edgev~v~: [io, jo + 9, 2jo + lS]T
.106
(9,18)
(18,9)
(9,9)
(a) (b)
Figure 5.6: Two Maximum Local 2-D Time Polyhedra. (a) is Pr defined by Y~ and (b)
Pi defined by Y~
U" . h 4 d btai v! [p p p + 9 p + 9 1smg)o = p to intersect tee ges, we 0 am p = 2p. 2p + 9 2p + 9 2p + 18
for any processor p. Adding "p"s to be as the first row of y~ and then pre-multiplying
V-Ion it, we just obtain
[
PP
y;= P P
P p+9
p
p+9
p+9
p 1 [0 0 0p + 9 , for instance y~= 0 0 9
p + 18 0 9 9
Notice that in this particular example all Y; are the same subject to a translation
(p, p, plT, thus we just take Yo to be as v' defining the maximum local supernode poly-
hedron. Furthermore we should map it to 2-D time domain for all n2,s. That is,
vs = nsv' _ mv' _ [0 0 9 9] d vs = rnv =mv' = [0 9 9 18]
1 1 - 2 - 0 9 9 18 an 2 0 3 0 0 9 9
all of which are parallelograms and are already the 2-D time convex hulls. They define
two maximum local 2-D time polyhedra P; and Pi, respectively, see Fig 5.6. Both of
them contain 100 points.
Note that for V, Si = [9,18, 27]T and sI = [0,0, O]T. It is easy to compute the
4 timing vectors and their corresponding executing times: to = pn~ = [9,1,9] and
343 units, tl = pn~ = [9,9,1] and 271 units, t2 = pn~ = [1,9,1] and 199 units and
t3 = pIT; = [1,1,9] and 271 units. As a result, we take n~and p = [9,1] as the optimum
II and the optimum p. As a result, we have
T = [~ ~ ~1 and p = [9,1), or T2X3= [; ~ ~]
.107
The speed-up is 5.02 = 1000/199 and the overall efficiency is 50.2%. The major
reason for the 50.2% overall efficiency is that there is a delay, 11 units, between adjacent
processors along the array. The accumulated delays over 9 processors result in 99 time
units. Since [34], [58], [59], [94] and [104] did not give a clue of how to deal with such an
irregular problem, it is difficult to make an evaluation of these methods on the problem. If
there are no extra measures added to deal with irregular problems, the straight approach
we can consider is to embed the irregular polyhedron into a large cube of size 10 x 19 x 28
and then apply these methods on the cube. However, it can be expected that the result
cannot be any good, because the cube is only 18.7% filled. Suppose that these methods
can achieve xx% efficiency for a full-filled cubic problem, the efficiency for this particular
problem can be only xx% x 18.7%, that will be very poor. 0
5.7 Special Example: Partitioning and Mapping a
Knapsack Problem onto a Linear Array
Generally speaking, the design process of partitioning and mapping are based on numerical
computations. since the.computational problem is described by much data. Unfortunately
we have as yet no general technique for parameterizing the design process. For the fol-
lowing example which is defined by a small number of parameters, we can parameterise
the design procedure.
5.7.1 Description of Computational Structure
The problem of minimizing or maximizing some linear function subject to a set of l,in-
ear constraints is called a linear program, or an integer problem if their unknowns are
restricted to be integers and non-negative. A particular integer program is the so-called
Knapsack problem which is described by
Fk(Y) = max L:j=l VjXj k = 0, ... ,n
subject to L:j=l WjXj ~ Y Y = 0" .. ,b (5.11)
where Wj and Vj are the weight and value of the j th item and Xj is the number of items
of type j to be included in the Knapsack of capacity b. They all are integers. In general
108
Figure 5.7: Knapsack Data dependency Graph n = 4 and b = 10
the Knapsack problem is NP hard. A dynamic programming method was proposed for
solving Knapsack problem. The basic method consists of two passes [42].
Forward Pass: Fk(y) = max{Fk-1(y), Fk(Y - Wk) + Vk}
Backward Pass: find the solution vector x· E N" such that Er=l ViXi' = Fn(b)
where Fo(y) = 0, Fk(O) = 0, Fk(i) = -00 for k = 0,···, nand Y = 0,···, b, and i < O.
This is a 2-D recursive procedure, but not an URE problem, see Figure 5.7 which shows
the dependency graph of an example of n = 4 and b = 10. Note that the dependency
vectors are changed while nand b are changed. [71] developed a run-time dependency
algorithm which extends the computational polyhedron to 3-dimensions, but, with a set
of URE dependencies, a significant improvement in computability. This algorithm has a
form of set of recurrence equations
F(i) = f(F(i - ed, D3(i), V(i))
o
V(i - r)
V(i) = v(J(i))
A(i - r)
A(i) = a(J(i))
R(i - r)
R(i) = b(J(i))
A(i)
C(i) = C(i - d.) - 1
R(i)
SCi) = SCi - dd
{
D3(i - d3)
D3(i) = Dl(i - d2)
Dl(i)
. {F(i)
C(l) = D, (i - dl)
if iE P
if i3 P
if i-rEP
if i- r 3 P
if i-rEP
if i- r 3 P
if i- rEP
if i- r 3 P
if i E P
if i E P - P
if iE P
if i E P - P
if C(i) =f:. 0
if C(i) = 0 and SCi) = 1
if C(i) = 0 and SCi) = 0
if i E P
if i E P - P
(5.12)
.109
where r = [O,1,Of, el = [1,O,OV, dl = [O,1,1]T, d2 = [O,1,Of and d3 = [O,1,-1f;
P, the computation polyhedron, is a rectangular cube n x b x c, where c = r~l. We do
not give a full explanation of the set of recurrence equations and the meaning of their
variables, which is beyond the scope of the thesis, see [71] for detail. The matter we
are concerned with is the computation graph produced by eqn (5.12). Because P is a
rectangular cube, the computational polyhedron is also defined by the set of vertices, yo
[
0 n 0 0 0 n n n]
yo = 0 0 bOb 0 b b
o 0 0 c c c 0 c
(5.13)
By comparing the left-hand sides and righ-hand sides of eqn (5.12), the dependencies
are
[
1 0
D = [vo,VI, V2, V3] = 0 1o 1
o
1
-1
(5.14)
This is an URE problem with conditional statements to control computation.
5.7.2 Supernode Polyhedron
Suppose that a, linear array with m processors and unidirectional interconnections is given
to implement the computation of the Knapsack problem.
We derive a positive expressing basis from D
[
1 0 0 1E = 0 1 0
o -1 1
(5.15)
It is easy to verify that all the column vectors of D can be positively expressed by the E.
The supernode parallelepiped B is obtained by scaling each column of E. In this
special case, since the processor array is 1-D, only the last column is required to be
scaled. Thus
o
1
-1 ~ 1 [
1 0 0 1B-1 = 0 1 0o ! !
k k
(5.16)
is generated to transform the original polyhedron to a supernode polyhedron .
•110
We need to know the boundary of the resulted supernode polyhedron. It is easy to
find the images VB of the original vertices VO in the supernode domain. That is
[
On 0 0
VB = LB-1 VO J = 0 0 b 0
o 0 LfJ L~J
o
b
L~J
n
o
l~J
n
b
lfJ
(5.17)
As seen later, since the space mapping matrix takes a form of [0, 0, 1] (liner array),
the last row of VS will determine the projected coverage of supernode domain along the
1-D array. Thus
b+c
-k- < m or {
ti£ - 1 if ti£ is integralk- 7:. m- r~l otherwise (5.18)
5.7.3 Transformation onto a Time-Processor Domain
A . d 1 . .. tri T [ SIX3]. d . hunimo u ar space-time projection rna nx 3x3 = II
2X3
IS use to project t e
supernode polyhedron into a I-D Processor domain by S and a 2-D time domain by II.
The 2-D time domain is, then, re-projected along a I-D time domain by a valid minimum
projection row vector p. Letting S = [0,0,1]' the supernode polyhedron can be allocated
onto to the m- processor linear array as follows.
There are four candidates for II which have the least and smallest non-zero entries
(this is the criterion for priori-optimisation) and makes TU unimodular. Among them, we
find that II = [~ ~ ~] is optimal. Therefore, we have
[
0 OIlT = 0 1 0
101 [
-1 OIl
T-1 = 0 1 0
1 0 0
(5.19)
Suppose that the tiu.e-Processor domain is indexed by j = (jQ,jbh]T.
Then the supernode polyhedron is mapped into a 2-D domain
IIV· [
0 0 bOb
o n LfJ L~J l~J
Vt [ 0 0 b
= 0 L~J + n LfJ
o b
LfJ + n lfJ + n
L~f+n]
. b ]
L~J +n
where V! defines a trapezium in the 2-D time domain. The upper and bottom edges of
the trapezium are L~ J - aJ + n and l~J + n, respectively. Because l ~ J ~ lfJ + l~J,
the upper edge is equal to or larger than the bottom.
111.
processor 0 processor m-I
Vn ,... ,
Figure 5.8: The m processor linear array implementing the Knapsack problem
Let
where b+c bPo = L-k-J - Lk J + n and PI = 1 (5.20)
It can be verified that the pmaps the trapezium V' to a l-D time domain without mapping
two points on one location, and gives' the minimum executing time te = b( l~ J - lfJ +
n)+L~J+n+1.
We can also give a space-time projection matrix T' = [~ 1 = [~ !~1 which
projects the supernocle polyhedron directly into the l-D processor domain and the l-D
time domain.
Finally, the input of the Knapsack problem is the series of vi's and Wi'S on eqn (5.11).
According to the boundary conditions in [71],Vi and Wi should be sent to the node (i,O,O)
of the original polyhedron, for i = 0, ... ,n. When considering how to assign Vi'S and Wi'S
to the linear array, we must find which processor the node (i, ° ,0) is allocated to. It is
easy to see that the node (i, ° ,0) is allocated to processor j, where j = LSB-l[i,O,O]TJ.
Obviously j = 0, Vi, i = 0, ... ,n. Thus all vi's and wi's are enter the array at processor
0, see Figure 5.8. Similarly output is from the processor m - 1.
5.8 Summary
A methodology of partitioning and mapping an arbitrary computation graph onto a given
array with lower dimension is proposed. The original computational problem is trans-
formed to a computational polyhedron with canonical dependencies. Two models, for
SUC and SBC, of unimodular space-time mapping matrix T consisting of S and prioi-
112
----------------
: Original dependencies D:
~-------_-_-_-f-_-_-_-_-_-_-_- __
"Positive expressing basis ~-------{---------
'~~n?~i~~ ~e~e~~e~~i~~ ~D:~
,- - - - - - - - - - - -~' I
I processor array A I
~w~~~~~~ ~lp~~~p~ection-,
Ssucand Sssc :
\ I
_------t----- _
~Scale E to B so that V I
I I
I transformed to Vs. VS I
I I
I can be mapped within AM. I
,------- --------~
~'T~O-~~d~lsPs~~~n~-P ;c-':
~of interconnection pnmmves I
------------ 1
,:K~~~~u ~C~:I------~---------,
:,:a~i~~~n_i,!,u~.P~ol:.c:i~n_~ J
--------------------,
~Timing vectors 'SSC and 'sue :\_-------------------~
Figure 5.9: A Conceptual Chart of Lower dimensional Partitioning and Mapping Methods
optimum II are proposed. The resulting N-D polyhedron is then projected into a K-D
time polyhedron by n and an M-D space polyhedron by S. The K-D time polyhedron
is re-projected along a 1-D time domain by a valid minimum projection vector p. For
clarity the whole process is illustrated by the conceptual chart given in Figure 5.9
The main effort is to develop an optimal lower-dimensional mapping method which
can cope with irregular polyhedra and gives improved efficiency while at the same time
delivering fundamentally reduced computing complexity (polynomial with respect to the
number of vertices of the K-D time polyhedron).
The methodology shows many advantages: implementing dependencies with given
interconnection primitives, transferring data efficiently and mapping onto a given-shape
processor array with, more importantly, lower dimension. The lower dimension mapping
is highly efficient, especially for an irregular computational body: The method proposed
here can handle irregular polyhedra effectively to improve the efficiency in time domain
to an acceptable level.
.113
Chapter 6
The Structure of Parallel programs
In this chapter the emphasis moves away from the mapping process proposed in chapters
3 and 4 and concentrates more on the practicalities of generating the associated parallel
programs. A method of algorithm generation is developed for the situations where M-
D processor arrays are given to implement a N-D nested loop structure. The resulting
parallel algorithm is characterised by M DOALL loops in the space domain and a FOR
loop in the time domain. Let K = N - M. There are two cases to consider: K> 1 and K
= 1, which are treated separately.
6.1 Introduction
The task of this chapter is to transform sequential algorithms to parallel algorithms, which
have the following forms:
SequentialAlgorithm ===} paralielAlgorithm
FOR
DOALL .........
FOR ........ ·
statements
DOALL ·
FOR ·
statements
using the partitioning and mapping techniques developed in the two previous chapters
and, where "statements" may involve some sequential loops. Clearly the DOALL then
corresponds to spatial position on a processor array and the sequential loop, to the algo-
rithm for individual processors.
l14
As shown above, partitioning is necessary to ensure that the computational polyhe-
dron will be mapped within a given regular shape processor array, but it also presents
many difficulties to automatic algorithm generation. A major problem of an algorithm
generation method is to determine the structure of the parallel algorithm and the struc-
ture of the "statements" under the DOALL loops. In our case, the statements will be a
set of sequential FOR-loops for a single supernode.
Obviously, we must know the bounds of both the DOALLs and FOR-loops. These
bounds result from the supernode partitioning, so, at first, in this chapter we must exploit
the features of supernodes and supernode space to determine the boundary of the supern-
ode domain and the boundary of a particular supernode. The bounds of the DOALLs
and FOR-loops are obtained by applying transformations on these boundaries.
The method of algorithm generation is heavily dependent on the techniques of par-
titioning and mapping. When K > 1, a lower-dimension mapping exists and, we have a
problem of addressing N-D space from a (M+1)-D processor-time space. To solve this
problem, an interesting mechanism is invented to make a K-D time domain behave as a
I-D time domain (this is termed the Lower-D Case). In addition, if the Locally-Sequential-
Globally-Parallel (LSGP) method is involved in the partitioning, we have to cope with
another difficult situation where a special time basis is introduced to access nodes. A fast
algorithm is developed here to produce the index of the required nodes (and is termed
the LSGP Case). For each of the two cases we wish to derive an algorithm generation
technique. For brevity the case of an (N-1)-D processor array with SUC mesh, which
involves neither lower-dimensional mapping nor LSGP partitioning, is taken as special
case of the LSGP Case. Furthermore, we target a distributed-memory machine and so
have to consider the problem of communications. Thus rules to create data flow packets
among processors have to be established.
There have been many papers on the problem of partitioning and mapping from a
theoretical viewpoint. However, so far, the actual generation of parallel algorithm remains
quite an open topic for which little work has been done. It is "real obstacle to application",
as was pointed out by an anonymous referee of our paper [18],because of its difficulty. All
the methods presented in the chapter are original, and pave a practical way to applications .
.115
In the following discussions, Band T, for the different cases, p and ko,' .. ,kN-2 are
all given without derivation (refer to chapters 3 and 4 if necessary). Now in order to
clarify concepts and methods we will make reference to the following example throughout
the chapter.
Example 6.1.1 A 3-D URE Problem
Loops 6.1.1 Original Sequential Nested Loops
FOR io := 0 TO 20
FOR il := 0 TO 20
FOR i2 := 0 TO 10
A(io, it, i2) := A(io - 1, it + 1, i2 + 3) + A(io, it - 1, i2 - 1) +
+ A(io,it -1,i2 + 1)+ A(io,il -1,i2)
The computational polyhedron can be determined by the set of inequalities:
Zo > 0 -1 0 0 0
Zl > 0 0 -1 0 0
Z2 > 0 0 0 -1 i:5 0 (6.1)< 20 or 1 0 0 20Zo
Zl < 20 0 1 0 20
Z2 < 10 0 0 1 10
which contains 4851 = 21 X 21 x 11 nodes, and whose dependency matrix is
D = [~1 ~ ~ ~ 1 (6.2)
-3 1 -1 0
In the example, the cubic computational polyhedron is selected only because it is easy
to know how many nodes it contains. We point out that our methods can also cope
with irregular polyhedra. For Example 6.1.1, we also suppose that a 1-D array with 4
processors and sue mesh is given to implement the computation of the problem. This
choice demonstrates the technique for the lower-dimensional case. 2-D arrays with 4 x 4
processors and sue or SBC mesh are used, for the other cases.
The rest of the chapter is organised as follows: firstly we exploit the features of su-
pernodes and supernode space which are the basis we work on. In Section 6.3, we deter-
mine the out-going data from a supernode. Section 6.4 transforms the algorithm onto a
lower-dimensional array, while section 6.5 considers the case of LSGP. In Section 6.6, the
derivation of the bounds of nested loops from a set of inequalities, which is essential for
forming a loop program is discussed
116.
6.2 On Supernodes
Once the partitioning matrix B is obtained, a transformation from the original com-
putational domain to the quasi-supernode domain can be made by i = Bq, where q,
which may not be integral, is the index of the quasi-supernode space. The volume of the
computational polyhedron is compressed by this transformation because det(B) > 1.
Collecting all the projected nodes q = [qO,·· . ,qN-l]T in a hypercube
S j < qj < S j + 1j S' E Z· J. = 0 ... N - 1 orJ , "
s :::; q<s+l (6.3)
to the integral point s = [so,···, sN-l]T, s will be a supernode, including all the projected
nodes in the hypercube, where 1= [1,·· ·l]T. All such supernodes form a supernode
polyhedron. We say that s is the floor integration of the q, noted as s = LqJ, where
Lmatrix J means to take the floor function for all elements of the matrix.
For such a supernode domain, we are concerned with two problems. What is the
boundary of the supernode polyhedron? and which nodes are involved in a single supern-
ode?
6.2.1 Boundary of the Supernode Polyhedron
For the original set of nested loops, a system of inequalities expresses the original com-
putational polyhedron
(6.4)
where m is the number of the inequalities, see eqn(6.1).
It is easy to obtain the quasi-supernode polyhedron by substituting i = Bq into eqn
(6.4)
(6.5)
where A':n.N = Am.N B.
.117
For instance in Example (6.1.1), here we use the B which is derived by means of the
method of Chapter 5,
[
1 0 0 1E = -1 1 0
-3 -1 2
B-1 = ~ [~ ~ ~ 1
6 12 3 3
(6.6)
Obviously g = [6, 1, 2] and det(B) = 12, which means that there will be 12 nodes enclosed
in each partition. Eqn (6.1) is transformed to
-qo :::;0 -1 0 0 0
6qo - ql :::;0 6 -1 0 0
lSqo + ql - 2q2 :::;0 IS 1 -2 q:::; 0 (6.7)3qo :::;10 or 3 0 0 10
-6qo + ql s 20 -6 1 0 20
-lSqo - ql + 2q2 s 10 -18 -1 2 10
However, to find a set of hyperplanes confining the supernode polyhedron is nontrivial,
because applying L J operations to all q = B-1i so as to condense them to integral
points produces some supernodes outside the quasi-supernode polyhedron confined by eqn
(6.7). For instance, in Example (6.1.1), point i = [1,20, IV, which is within the original
polyhedron eqn(6.1), is transformed to q = [0.166,21, 12.5]T and, then, condensed to
s = [0,21, 12]T. However, substituting s into the left-hand side of the fifth inequality of
eqn (6.7), we see that [-6,1,0][0,21, 12V = 21 > 20, so s is outside the quasi-supernode
polyhedron «onfined by eqn (6.7) (when substituting a point into the set of inequalities
confining a polyhedron, it is said to be inside the polyhedron if all the inequalities hold,
or said to be outside if any of the inequalities fail to hold). This phenomena is shown
more clearly in a 2-D case.
In Fig 6.1, ABeD is the transformed quasi-supernode polyhedron. The "o"s indicate
the images of the original vertices. The dots "." are the supernodes which are outside the
quasi-supernode polyhedron (the point j, marked by "x", will be explained later). We
must construct an enlarged polyhedron in ZN which includes all and only the supernodes.
The polyhedron abcda is the required supernode polyhedron.
Enclosing all supernodes
The method of constructing the supernode polyhedron is to translate some of hyperplanes
of the quasi-supernode polyhedron. Any hyperplane consisting of coefficients smaller than
118.
. . . . . . .. . .. . . . . .
Figure 6.1: 2-D quasi-supernode polyhedron and the corresponding supernode polyhe-
dron.
-1 is translated along its outward normal so that all supernodes are in the halfspace
confined by the corresponding inequality. It can be understood that we must find the
outermost supernode along the normal direction of such a hyperplane (termed furthest-
supernode of the hyperplane), and that a furthest-supernode is made by floor integration
of a point on the hyperplane. In Figure 6.1, the integral point e is the furthest-supernode
associated with the plane (line) AB. The line ab results from the translation of AB so
that it passes through e.
We restrict ourselves to the situation where all nodes produced by a hyperplane of the
original set of inequalities are integrall, so the nodes are on the hyperplane. It is possible
to produce all nodes on the hyperplane. A basis consisting of N-l integral vectors can be
found which generates all the integral nodes on the hyperplane.
Let a hyperplane h : iii + c = O. An algorithm is given to search for the furthest-
supernode for a hyperplane and translate ~he hyperplane to pass through the furthest-
supernode.
IThis is quite a common case, especially in polyhedra generated from loop programs. In fact, this
holds if 1a, 1= 1, where a, is the last non-zero coefficient of the hyperplane. For, instance, 2io+3i1-i2 = 3
119
Algorithm 6.2.1 The translation of a hyperplane h
Define Dist( q, h): laq+cl
Vai.T
Generate a unimodular matrix Va with a as its first row
Derive V~1 = [zo, ZI,' .. , ZN-l]
FOR i = 1, N-l
compute ~ = B-IZj and normalize such that gcd(Yi, Yi) = 1
qj := {q(x) : Dist(q(x), h) = max~~:~::::',~:-=-ll:~Dist(q(x), h)}
where x = [XI,"', xN-l]T and q(x) = Lq.+ L~~l ~XiJ
c:= -a'qj
End of Algorithm
The algorithm is explained as follows. In fact, for a hyperplane with a = [ao," . , aj, 1,0,·· .],
and because gcd(a} = 1, we can build an unimodular matrix U, with a as its first row
[107]. Make V;;-l, the inverse matrix of Va' Obviously, the last N-l column vectors, Z =
[ZI,' .. , zN-d, of V;;-l are perpendicular to a and span the null-space of a. Any integral
node on ai + c = 0 can be expressed as a linear combination of the Zl,"', ZN-l plus a
special point which may be a vertex on the hyperplane.
After projection to the quasi-supemode space, the hyperplane becomes aBq+ c = 0,
and any transformed nodes on the hyperplane can be expressed in a form of
N-l
q = q. + (B-1Z)x = q. +L YiXi
i=1 Yi
(6.8)
where qs is an initial point on the hyperplane, the elements of the vector x are the
combinational coefficients; ~ = B-1zj, Yi is an integral vector and Yi is an integer,
and gCd(Yi,Yi) = 1. Compute LqJ for all q which are expressed by eqn {6.8} to find a
point which is of the maximum perpendicular distance from the hyperplane, that is, the
furthest-supernode. We need to check only the transformed nodes expressed by eqn (6.8)
and covered by Xi = 0, ... , Yi - 1 for i = 0, ... ,N - 1, because the distances of LqJ are
a periodic function of the Xi'S. The hyperplane for a supernode polyhedron is created by
translating aBq+ c = ° to pass through the furthest-supernode.
By applying the procedure, the quasi-supernode polyhedron of eqn (6.5) is expanded
to
{6.9}
J20
Eqn (6.9) is different from eqn (6.5) only with respect to the constant vector.
In Example 6.1.1, consider the last inequality of eqn (6.7), -lSqo - ql + 2q2 - 10 ::; 0,
which corresponds the original inequality i2 - 10 ::; O. It can be verified that
o. = V;:;l = [~ ~1 ~ 1
1 0 0
and B-1Z = ~ [~6 !1
6 -3 12
So all supernodes can be expressed by
where [0,0, 5jT is a vertex on the plane. The furthest-supernode is [0,5, 15jT, Therefore
the plane is translated to -1Sqo - ql + 2q2 - 25 = o.
Similarly, eqn (6.7) is modified to
-qo s OJ
6qo - ql ~ OJ
1Sqo+ ql - 2q2 ~ I:
3qo s 10
-6qo + ql ~ 25
-lSqo - ql + 2q2 ~ 25
(6.10)
to enclose all.supernodes. Obviously the supernode [O,21,12]T previously outside the
polyhedron ~, uow.within the enlarged supernode polyhedron.
Enclosing only the supernodes
When enlarging the quasi-supernode polyhedron to enclose all the supernodes, it is quite
possible to enclose some integral points which are not valid, i.e., false supernodes, into the
new polyhedron, for instance, the point j marked "x" in Figure 6.1. Because point j is
included within the enlarged supernode polyhedron, so it will be treated as a supernode,
but it contains no nodes at all. These false supernodes should be removed because extra
time is required to compute them in real algorithms.
To remove the false supernode, let us consider the 2-D problem shown in Figure 6.1.
The following facts can be observed.
1. It can be seen that the false supernodes appear only around the vertices associated
with any translated lines, for instance, point A in Figure 6.1. This makes it possible to
dismiss the false supernodes by adding some extra cutting lines passing the vertices, such
.121
as the line f g, which dismisses the false supernode j.
2. It can be seen that for the vertex A which is not a integer, such a cutting line must
pass the supernode k which is the floor integration of point A and is not allowed to pass
through the original polyhedron. This is because the cutting line must not force any valid
supernodes outside the domain.
3. Some of the extra cutting lines, i.e. the dashed lines, are not necessary, because
there are no false supernodes to be dismissed around the relevant vertex. For example, eh
passing the supernode e cuts nothing, because there are no supernodes (integral points)
in the triangular area ehbe.
As a result, Figure 6.1 requires only one cutting line ts, and the supernode polyhedron
is confined by 5 lines .fdcbg f .
These results can be extended to N-D cases. The false supernodes may appear only
near the margins of facets of the polyhedron, that is, along and about the edges associated
with any translated hyperplanes, rather than the vertices. So all such edges should be
found initially. The edges can be found as follows. Suppose that there are nv vertices lying
on a hyperplane. It is easy to find the nv(n;-I) connecting lines between any two vertices.
For each of the connecting lines, generate an auxiliary hyperplane which contains the
connecting line, and then check with the other nv - 2 vertices which are off the connecting
line. If the 71 v - 2 vertices are only on one side of the auxiliary hyperplane, take the line
as an edge; otherwise, ignore the line. The main computation lies in the checking which
needs about Nnv(nv-;I)(nv-2) arithmetic operations. If nv = N, t~e checking is no longer
needed, because all the connecting lines are edges.
Recall that all the edges are intersections of N-1 hyperplanes of eqn (6.9). It is easy
to find which N-1 hyperplanes are associated with an edge. Suppose an edge exists
between two vertices WI = [w5,"" Wh_I]T and w2 = [wg, .. ·, Wk_l]T, corresponding
to the vertices vI and v2 in the original space. We give an algorithm to generate the
necessary cutting hyperplanes to remove false supernodes along the edge.
•122
Algorithm 6.2.2 Generate Hyperplanes to Remove False Supernodes
d := [do," . ,dN_dT = w2 - WI, d is the integral direction of the edge
d" := v2 - VI and normalise such that gcd( d"] = 1
dW := B-Idv
j = {r : d,:! = maxf~;oIdi}
Generate two planes pI: qj = lw} J and p2: qj = rw;l
Generate CJy hyperplanes hi: di2qi1 -di1qi2 -(di2Wil -di1Wi2) = 0, where CJy is a binomial
coefficients, it and i2 E [0,N - 1], and il =1= i2
FOR i = 1, CJy
IF hi not cutting through the quasi-supernode polyhedron
di1 := md~ and di2 := md~ such that di1, di2 and m are integers,
and gcd(dipdi2,m) = 1, where it and i2 indicate the subspace hi lie on
qj:= {q(r): Dist(q(r),hi) = max'j=c/Dist(q(j),hi)},
'd' 'dwhere q(j) = UWfl + J~I ,wf2 + J~2lJ
Translate hi to pass through qj
Form a prism with the translated hi and the N-l hyperplanes of eqn (6.9)
Close it by pI and p2
IF there are integral points inside the closed prism
Append the translated hi into the set of inequalities of eqn (6.9)
End of Algorithm
The basic idea of the algorithm is:
Step 1 Generate CJy potential cutting hyperplanes containing the edge, each of which is
perpendicular to a coordinate plane. If a cutting hyperplane penetrates the domain,
remove It.
Step 2 Translate a cutting hyperplane so that it passes the related furthest-supernode of
the edge.
Step 3 Form a prism with the translated hyperplane and the N-l hyperplanes which are
associated with the edge.
Step 4 Check if there are integral points, false supernodes, in the 'prism. If yes, add the
hyperplane to eqn (6.9) to remove the false supemodes.
We still use Fig 6.1 to explain. Point A is equivalent to an edge; line Ig corresponds to
the cutting hyperplane which contains point A originally but; then, is translated to pass
point k which is the furthest-supernode related to point A. Lines ad and ab are equivalent
to the N-l hyperplanes associated with the edge, they together with Ig form a prism (a
123
triangle, here). Point j is an integral point in the prism. It is a false supernode which has
to be removed by adding f g.
The supernode polyhedron has the form of
(6.11)
where m' ~ m. Eqn (6.11) is the same as eqn (6.9) except that m' -m cutting hyperplanes
are appended.
For Example 6.1.1, we, omit the complex operations and find that 4 cutting planes
are needed to remove all possible false supernodes. Finally, the supernode polyhedron,
indexed with an integral tuple s = [so,"', sN-l]T, is confined by ten inequalities
-1 0 0 0
6 -1 0 0
18 1 -2 1
3 0 0 10
-6 1 0
s~
25
(6.12)-18 -1 2 25
0 -2 1 5
0 2 -1 30
0 -1 2 70
0 1 0 40
6.2.2 Vertices of the Enlarged Quasi-Supernode Polyhedron
The next problem we are concerned with is to find the vertices, V", for the enlarged
quasi-supernode polyhedron. In Subsection 6.5.1, the vertices will be projected by T to
find a bounding box in the processor-time space. The straightforward way of solving
the problem is to find the intersections of the new set of hyperplanes, i.e., to solve C~
q
systems of N-D equations selected from the new nq hyperplanes confining the enlarged
quasi-supernode polyhedron, and then form the resulting intersecting points and remove
those which are outside the polyhedron. This process is very time-consuming, so we use
a faster approach given below.
Step 1 For a translated hyperplane, find all edges which pass through the vertices on its
previous hyperplane and penetrate the hyperplane .
•124
Step 2 Extend each of the edges to penetrate the translated hyperplane. The intersection
points are new temporary vertices, take them in place of the corresponding old
vertices. Then renew all the relevant edges according to the new vertices.
Step 3 Repeat Step 1 and 2 for all translated hyperplanes.
6.2.3 Boundary of a Single Supernode
A single supernode containing a collection of the original nodes is a basic unit of compu-
tation in a processor for a given time step. To do the computing, for any supernode s,
we must know which actual nodes are contained inside. There are two ways to solve this
problem.
1. Substituting q = B-1i into eqn (6.3) yields bs :S B'i < bs + bI, where B' is the
adjoin matrix of H, b = det(B). However, we must remember that no nodes can be
outside the original polyhedron. Therefore, the polyhedron for a supernode addressed by
S IS
2. Because E is integral and unimodular, we build a transformation q' = E-1i which
maps i to another integral space Sql indexed by q'. Obviously, q' = Gq. Substituting
q = G-lq' into eqn (6.3), we have Gs :S q' < Gs + g. Therefore, in Q' space, the nodes
in a supernode form a hypercube with size of g. In this space, the dependencies become
Dql = E'-lD. Furthermore, let q" = q' - Gs, where 0 ::5 q" < g. The dependencies with
q"'s are the same as n-', because q" is equivalent to q' except for a shift.
However, the original polyhedron boundary still has to be imposed to keep out invalid
nodes. Substituting i = E(q" + Gs) into eqn(6.4) produces Am,NEq" :S c - Am,NBs.
In addition, as will be seen later, in most cases, a supernode is addressed by a mapping
s = T-lj, where j is the index in the processor-time domain. Thus, the nodes indexed by
q" in the supernode addressed by j are confined by
J25
(6.13)
where [ ~I 1 q" < [ ~ 1 is termed the hypercube-boundary-subset, and the remaining
part is referred to as the non-hypercube-boundary-subset because they are associated
with the boundary of the original polyhedron instead of with the hypercube-boundary
of a supernode. We also say the supernode is confined in an area [0,go) x ... [0,gN-t},
where [x, y) indicates an integral interval from x to y-l. Since in the second method,
manipulations in a supernode become much simpler, we use this approach for the following
work.
6.3 Data Flows
An important problem is to find out what data will be transmitted outside the supernode
and in which directi .•u..of the processor array t.hey flow. It is not necessary to discuss
the problem of the in-coming data, because the outgoing data of a supernode is just the
in-coming data of another supernode at a correct location and correct time determined
by the dependencies in the processor-time domain.
6.3.1 Outgoing Data of a Single Supernode
Let us consider one of the original dependency vectors d ED. The data dependency
projected in the quasi-supernode domain is produced, dq = B-1d = [dg, ... ,JlN_1]T.
For a transformed node q where s :5 q < s + 1, if q, + elf ~ 8, + 1, the variables in q
involved with dq will flow over the i-th boundary of the supernode. Such data flows can
be thought of as a data transfer "I" in the i-th dimension of the supernode domain. We
denote the inequality Sj + 1 - cI1 :5 qj as Gj, and Sj + 1 - ~ > qj, as G;. C, and G; can
be also expressed as gj - cI1' :5 q" and gj - cI1' > q", respectively. For simplicity, consider
the 2-D case of Figure 6.2.
If there is a quasi-supernode dependency vector dq = [dg, di]T, a supernode is divided
into four parts. The transformed nodes in A2 are confined by GoC; and transfer data
126
s 1 s 1+1
Figure 6.2: Dependencies in the quasi-supernode domain and in the supernode domain.
out in the O-th dimension, forming a supernode dependency vector dA2 = [1, oV; the
transformed nodes in Al are confined by COCI and transfer data out in the dimension 1,
forming a supernode dependency vector dAl = [0, IV; the transformed nodes in A3 are
confined by COCI transfer data out in both the 0 and 1 dimensions, forming a supernode
dependency vector dA3 = [1, 1V. Notice that there is no data flow out from A4 where the
nodes plus the d? are still in the supernode.
Generally, a N-D supernode can be divided into 2N areas confined by nf=_r/Cf ' where
Cf = Cj or Cj• For a given area, if Cj = Cj, the area will form a data flow just to the
next supernode along the j-th direction. Therefore, we have a s t of 2N - 1 canonical
dependencies in the supernode space, which form a N-bit binary table without [0··· 0],
i.e., DS, as introduced in Subsection 4.2.2. For instance, from th 3-bit binary table, a
supernode dependency vector [1, 0, 1V, noted as dfol' can be expected to contribute from
the area COCI C2.
With the expression, such as nf=_r/Cj, describing the confining area, we can draw the
part of a supernode where variables flow outside in the direction d". This is achiev d by
replacing the inequality Si ::; qi in s ::; q < s + 1 with Si + 1 - d? ::; qi if the term is
Cj, or replacing the inequality s, < qi with s, + 1 - d? > qj if the term is Ci, for every
term of the expression. It is now straightforward to obtain a hypercube similar to the
hypercube-boundary-subset of eqn (6.13) by replacing the i-th entry of right-hand side
constants with -(9j - dt') + 1 or the (N+i)-th entry, with (9i - d?\ if the term is Cj or
.127
Ci, respectively. We call this a Data-Flow-Cube (DC). There are 2N - 1 DCs at most for
a given N-D dependency vector.
6.3.2 Outgoing Data Packets of a Processor
It is easy to map the dependencies of supernodes onto the processor-time space. That is,
(6.14)
where d" is the dependency in the processor domain, indicating the direction of the data
flow; dt is the delay step (time-delay) between the producing time and the consuming time
of the data. Therefore, each DC has three attributes, d", dP and dt, written as DG~:,d'.
All the dP's contribute to a dependency matrix DP in the processor domain. For
instance, for 3-D computational problem with 2-D processor array, for the two cases of
S = Ssuc and S = SSBC, we have
DP-3-
for sue
[
dK-J dg,_l d~,o d~I,O d~l,l dt_t 1
o 0 1 -1 -1 1 for SBe
1 -1 0 0 1 -1
The discussion above considers only one original dependency vector. The same op-
eration must be repeated for all d E D. Note that each dependency produces the same
kinds of dependency vectors like SSUCD8 or SSBcD8 in the processor array domain, but
defined in different DCs. 'Considering all the dP's and d's, we can have nDp X nd such
values. That is Qi,; = {q: along df E DP and associated with d; ED}, where tit» and
nd are the numbers of vectors of DP and D, respectively. A Qi,; may consist of more than
one DC. For efficient data transference, all DC's which flow in the same direction, i.e. the
same d", can be collected into a single data packet.
Note that in Example 6.1.1,A(io, iI, i2) = A(io -1, il +1,i2+3)+ A( io, it -1, i2 -1) +
A( io, il - 1,i2 + 1)+ A( io, il - 1,i2), so that all the dependencies are associated with a
single variable A(i,j). This means that it is quite possible to transfer the same members
of A( i, j) along one direction several times. To avoid this case, if the original dependencies
.128
dj,, ... ,djr_1 are associated with one variable, say A, we may transfer the members of A
in the area U~:~Qi,jk along df only once. In addition, it is desired to modify the logical
sum to the form of direct logical sum, that is, any two terms in the sum do not share any
elements. Therefore, Qi,jk should be modified to Qi,ik - Qi,io -,' .. , -Qi,ik-l'
6.4 Algorithm Generation For A Lower Dimensional
Processor Array
Given the above techniques we can proceed to the problem of algorithm generation for a
given array. For the Lower-D Case, with a K-D time domain, we analyse the drawback of
the K-D time parallel algorithm, and develop the 1-D time parallel algorithm which are
similar to the K-D ones in form.
6.4.1 K-D Parallel Algorithms
As known, forthe Lower-D Case, the unimodular space-time projection matrix
T = [ SAlxN 1
IT/{xN
is used to project the supernode polyhedron into a M-D processor domain by S and a
K-D time domain by IT. That is, a supernode s will be computed at the processor Ss at
the moment ITs. The M + K (M +K = N) dimensional processor-time space is indexed
with an integral tuple j = [iQ,' .. ,jN_l]T.
However, our interest in this chapter is not the mapping from the supernode domain
to the time-processor domain, s ===> j, but the reverse mapping, s <== j, since for an
actual parallel computation, we must know which supernode should be executed at a
particular processor at a particular moment. We use s = T-1j to reference a supernode
to be computed at a certain processor at a certain K-D moment .. Substituting s = T-lj
into eqn (6.11), we can obtain a N-D processor-time polyhedron
Ai . < im',NJ _ C (6.15)
where A~"N = A:n/,NT-l.
From eqn (6.15) and by means of the algorithm proposed in Section 6.6, we derive
loops with the following form
.129
Loops 6.4.1 N-D Processor-time Loops
DOALL jo := 1oTO Uo
DOALL jM-I := [M-IUO,'" ,jM-2) TO UM-IUO,'" ,jM-2)
FOR jM := lMUo,'" ,jM-d TO UMUO,'" ,jM-I)
FOR jM+I := lMHUo,'" ,jM) TO UMHUO,'" ,jM)
FOR jN-I := IN-I(jO,'" ,jN-2) TO UN-I(jO,'" ,jN-2)
computing a supernode
where [i(jO, ... ,ji-I) involves r 1 and max operations, but, here, we pay attention only to
the fact that it is a function of jo,'" ,ji-I. Similar comments are true for Ui(jO,'" ,ji-l)
which involve L J and min operations
6.4.2 K-D Time Domain to I-D Time Domain
There are K FOR-loops for a processor in algorithm 6.4.1, corresponding to the local K-D
time domain which is usually not regular and different for each processor. If we let these
K FOR-loops run individually in each processor without some kind of control, the time
delay of data transferences between processors will become variable. Therefore, we should
reorganize thf>K FOR-loops into a single loop and let the single loop run synchronously
in each processor.
If the K-D time domain is re-mapped with p along a 1-D time domain indexed by t,
for any dependency d" E D", the dependency in this domain is pnd·, which is a con-
stant and can be implemented with a uniform pipeline mechanism. Therefore, we modify
the K nested FOR loops with respect to jM,' ., ,jN-I by introducing a single increasing
argument t and decompose the t into jM,' .. ,jN-I such that t = Ef:;A}P'-N+Kj,. The
decomposition must be unique, because it is used to reference supernodes. If the de-
composition is not unique, more than one supernode may be referenced at a processor
at a particular time t, but only one can be executed. Fortunately, it turns out that by
employing the boundaries of iu-: . ,jN-I, it is possible to find such a decomposition.
Let the first executing supernodes be sI. The value sI is computed at moment t' =
pITs/. The Loops 6.4.1 are changed as follows
130
Loops 6.4.2 Processor-time Loops with 1-D Time domain
DOALL jo := 10 TO Uo
DOALL jM-I := [M-I(jO,'" ,jM-2) TO UM-I(jO,'" ,jM-2)
t = t'
FOR jM := [M(jO,'" ,jM-I) TO UM(jO,'" ,jM-I)
FOR jN-2 := [N-2(jO,'" ,jN-3) TO UN-I(jO,'" ,iN-3)
FOR iN-I := t - E;;,"fll Pi-Mji TO UN-I(jO,'" ,iN-2)
IF iN-I ~ [N-I(jO," . ,jN-2)
compute a supernode
t := t + 1
Let us prove the correctness. For the inner-most loop, because the lower bound of
. . t ~N-2 . hi')N-I IS - '"'i=M Pi-MJi, t ere ation
t ~N-I .= '"'i=M Pi-N+K)i
holds at the lower end of the inner-most loop, since PK -1 = 1 which is the coefficient of
jN-I. As i» -1 and t increase in step, the equation always holds. This is the decomposition
we expect. However, only when jN-I ~ IN-I(jO," . ,jN-2), is such a decomposition valid
and a supernode can be referenced; otherwise the decomposition is outside the valid
processor-time domain defined in eqn (6.15), so no supernode will be computed.
It is interesting to compare the two loops. The loops in Loops 6.4.2 look quite similar
to the those in Loops6.4.1. However in Loops 6.4.2, the real count variable is the t which
increases one by one from t'. Although the K nested FOR loops remain, their purpose is
to decompose t to jMl' .. ,jN-I subject to the boundary condition.
In Example 6.1.1, we find that
[ 1 0 0]T-I = -1 0 1o 1 0 (6.16)
The corresponding p = [18,1]gives the minimum executing time of 1034time units. And
T' [100]so = 1 1 18 .
131
Substituting s = T-lj into eqn (6.12), the processor-time polyhedron is produced
- Jo ::;0 -17 jo + 2j1 - j2 ::; 25
7jo - j2 ::; 0 2jo + j1 - 2h ::; 5
17jo - 2j1 + h ::; 1 -2jo - it + 2h ::; 30 (6.17)
3jo < 10 jo + 2j1 - h s 70
- 7jo +h ::;25 - jo +h ::;40
Then, we can obtain the boundaries of the processor-time domain:
la = 0
Uo= 3
11(jO)= r24jr11
U1(jO)= Lmin(55, 12jo + 25, SjorS, 18io2±SS)J . .
12(jo,jt) = rmax(7jo, -17jo + 2j1 - 25,jo + 2j1 - 70, 21O±l-s)l
u2(jo,jd = Lmin( -17jo + 2it + 1,tjo + 25,jo + 40, 2io±~!±30)J
(6.18)
where the explicit derivation follows from Section 6.6.
Now consider what should be done in a processor at a given moment. According to
eqn (6.13), in Example 6.1.1, the non-hypercube-boundary-subset of a supernode is
-q~ < -6jo + 1;
q~ - q~ < 7jo - h+ 1;
3q~ + q~ - 1q~ < 17jo - 2j1 +h+ 1;
q~ < 6jo + 21
-q~+lqr < -7jo+h+21;
-3q~ - qf + q~ < -17jo + 2j1 - h + 11;
and the nested loops for a supemode are
Loops 6.4.3 Single Supernode Loops
FOR q~ = max(6jo, 0) TO min(6jo + 20,5)
FOR qf = max(q~ + tjo - h ,0) TO min(q~ + 7jo - h+ 20,0)
FOR q~ = max(3q~+qf + 17jo -2it +h, 0) TO min(3q~+qf +17jo -2j1 + h+ 10,
1)
Because g = [6,1, 2]T, the supernode is also said to be confined in [0,6) x [0,1) x [0,2).
As regards outgoing data, in Example 6.1.1, snq = [l, 0, 0, 0]: Fortunately, since there
is only one nonzero element in SDq and we are given a. linear array with a sue mesh,
this is the simplest case to find the outgoing data.
C2: S2 + 1- ~ ~ q2 and ~ = ~. Note that
. [1 0 0 0 1nq' = E-1D = 0 1 1 0
020 1
That is, there is only one condition:
(6.19)
132
Replacing the constant of the first inequality of eqn (6.13) with -4 (i.e. -4 = -(go -
<flo' = -(6 - 1) + 1), this condition results in q~ 2': 5. Therefore, there is only one data
flow cube Dcio~which is confined by [5,6) x [0,1) x [0,2) in a supernode.
In summary, the transformed parallel algorithm becomes
DOALL jo := 0 TO 3
t=o
FOR j1 := r24~-11 TO lmin(55, 12jo+ 25, 6jorS, 18io/6S)J
FOR j2 := t - lSj1 TO lmin( -17jo + 2j1 + 1, 7jo + 25,jo + 40, 2io+~1+30)J
t := t + 1
IF j2 2': rmax(7 jo, -17 jo + 2it - 25,jo + 2jl - 70, 2io+l1-S)1
FOR q~ = max(6jo, 0) to min(6jo + 20, 5)
q~ := 0 if q~+ 7jo - h E [0,20]
" 3" " 17 . 2 . .q20= qo + ql + )0 - )1 + J2
FOR q~ = max(q~o, 0) to min(q~o+ 10, 1)
A(q~,q~,qn:= A(q~ -l,q~,q~) + A(q~,q~ -l,q~ - 2) +
+ A(q~,qf -l,q~) + A(q~,q~,q~ - 1)
Data transfer operations
The Data transfer operations will be discussed in Chapter 7 as a special case of (N-1)-
D SBC case. Now the correctness of the methods of building the supernode polyhedron
and a supernode can be verified. We know that there are 4851 = 21 x 21 x 11 nodes
in the original polyhedron of Example 6.1.1. Without expanding the quasi-supernode
polyhedron, only 464 supernodes are enclosed in the polyhedron of eqn (6.7) and only
2119 nodes are accessible. With the expansion to eqn (6.10) (without dismissing false
supernodes) 1404 supernodes are enclosed, 288 of which are false, although all the 4851
nodes are accessible. With expansion and dismissal, there are 1116 supernodes enclosed in
the supernode polyhedron confined by eqn (6.12), all the 4851 nodes are accessible, none
of the 1116 supernodes are empty, i.e., no false supernodes. This is exactly as expected.
6.5 Algorithm Generation For LSGP Case
Now let us consider the LSGP Case where an (N-l)-D processor array with an SBC mesh
is given. In addition to the supernode partitioning, a LSGP partitioning (compression)
is used to improve efficiency. For a uniform compression, we select the set of equal
133
odt.'"
dA,1,0
I ---
dt-~ - --=- ~~~I-
I ,
lA d~,1-1,0
··w' : w'
2,0: 2,1........~ ~ .· .w' : w' :w'
1,0: 1,1 : 1,2.
-2
-3
-5
-6
-8
-9
--~
-11 o 2 3 8 9 115 6
(a)
Figure 6.3: LSGP Partitioning layout and dependencies. (a) shows the layout of a 12 x 12
EVPA located in a 4 x 4 processor array. (b) the dependencies of a processor.
compression factors k = ko = ... = kN-2 in advance. The size of the processor array is
also enlarged to (k x lo)x, ... , x(k x IN-2), i.e., creating an EVPA. See Figure 6.3.
For generating the algorithm, we have to deal with three problems:
1. We need only to define a rectangular boundary for computing in the processor-time
domain instead of an accurate boundary as in the Lower-D Case. This implies that all
processors are assumed working in all the valid time. The actual work of each processor
is defined by the supernodes assigned to it.
2. In LSGP partitioning, a new coordinate basis is introduced to access supernode
space. We must find a quick method of accessing the new space from the processor-time
space.
3. Since only the supernodes which are on the boundary of hypercubes of LSGP
partitioning can contribute outgoing data, we have to establish ~ules to produce only the
necessary data flows.
In the following, we focus on the case of (N-1)-D array with SBC mesh, while the case
of (N-1)-D array with SUC mesh and the case of pure LSGP are presented in Appendix
• 134
6.5.5 without detail.
6.5.1 Rectangular Boundary in Virtual Processor-Time Do-
main
W = TVe indicates the vertices of the mapped virtual polyhedron in the virtual processor-
time domain. The maximum and minimum values, noted as wi and wL respectively, of
the i-th row of W indicate the coverage of the mapped virtual polyhedron in the i-th
dimension. The wi's and w~'s may not be integral and may have to be reduced to integers
within the given array. The virtual processor-time domain is an integral domain and no
supernodes can be mapped outside the array. That is, wi is approximated to LwiJ, and
wL to rwn. For the first N-l rows of W, which are associated with space allocation, the
relation lwi J - rwn +1 ~ kili must hold (this has been guaranteed by the B and T derived
by the method proposed in Chapter 4). For the last row of W, the lwN -lJ - rw~ -11+ 1
indicates the computation time.
Let t' = rW~-ll and i" = lWN-IJ, and w' = [rw~l,···, rW~_21]T and WU = [LwoJ,···,
lWN_2J]T which indicate the boundary of the mapped virtual array. The mapped virtual
polyhedron is indexed by w = [wo,···, wN-2]T.
Note that since the boundary of the supernode polyhedron is not used to confine
the computation, some of processors may not compute useful work all the time. The
actual computation is produced by including the boundary of the original computational
polyhedron together with the bounds of supernodes.
6.5.2 Algorithm Generation Involving LSGP
In Loops 6.4.1, the coordinate j of the processor-time space references a supernode s
directly, that is
Loops6.4.1 :
However, when LSGP partitioning is involved, The j cannot directly reference 8, be-
cause T is not unimodular. We have to generate a special N-D time basis, indexed by t,
135
such that j can easily reference s, and establish a mapping from t to s (see later)
Loops6.5.1 :
J
solve equations of H,
==>
where H, is a lower-triangular matrix, T, is the N-D time basis, both of which are to be
discussed below. For the convenience of presentation, we generate the N-D time basis Tb
first, then find a quick transformation from j to t.
A N-D time basis
In the following discussions, we generate a N-D time basis according to Darte's definition
and, then, modify the N-D time basis to one which can be easily accessed from j.
The last row of T is the timing vector t. A unimodular matrix T~-lcan be constructed
with t as its last row. Then Tb = [Q', tN-I], where Q' is the N x (N - 1) matrix, has
the significant feature that tTb = [0"" ,0,1]. Therefore, Tb is a N-D time basis, and Q'
spans a sub-space such that all the nodes are executed at one instance. Q' is not unique,
but tN-I is, ,IUdbecause det(Tb) = 1, Tb can be a basis to access all points.
For any supernode which is expressed as s = Tbt', where t' = [t~, .. ·, tN_1]T is the
index in the space spanned by Tb, it will be executed at the moment t~ at cell
in the EVPA, where A(N-I)X(N-I) = SQ' is the activity matrix, and r' = StN-I' We can
produce (N-l)! forms of A by permuting the rows of S. Let P~'''Pi'''PN-3) be a permuting
matrix.
For the (N-l)! forms of A, we compute their HNF's, (Hermite Normal Form), i.e., HNF
= A(Po"'Pi'''PN-3)Uu = P~"'Pi'''PN-3)SQ'UU, where UU is an unimodular matrix. From the
method in Chapter 4, T guarantees that in the (N-l)! HNF's 'a HNF exists such that
all the diagonal elements of the HNF are k, noted as H•. Suppose Pk is the permuting
matrix and Us, the unimodular matrix to produce the H•.
Now we can modify the space-mapping matrix and time basis. Let Sp = PkS, Q =
Q'Uk and Tb = [Q, tN-I]' The matrix Tb is the new N-D time basis which accesses
• 136
all supernodes 2 and keeps the feature we are concerned with, tTb = [0,···,0,1]. The
corresponding activity matrix is a lower-triangular matrix with equal diagonal elements,
i.e., SpQ = Hk•
Note that since Pk permutes the rows of S, the boundary of the mapped virtual array
should be permuted correspondingly, that is, WU := Psw" and wi := PkW/.
Quick accessing from processor-time space to supernodes
Now let us access a supernode in the N-D time space from a given processor. At first, the
N-D time space is mapped onto the EVPA. The EVPA, then, is divided and assigned to
real processors. Finally, in reverse, we build a transformation from the real processors to
supernodes with respect to time.
A supernode s is expressed by s = Tbt in the N-D time space spanned by Tb, where
t = [to," . ,tN-If is its index in the N-D time space. The s is executed at moment tN-I
and at a location
(6.20)
in the EVPA permuted by Pk, where r = Pkr'.
Now, the mapped virtual array can be allocated to the actual processor array. We let
the processor j', indexed by jQ, ... ,jN-2, contain all the w such that
Wi + kj' < w < Wi + kj' + kl, or
w~+ jik < Wi < w~+ (ji + 1)k, i = 0,,,, ,N - 2 (6.21)
Thus, the supernode, indexed by t, to be executed at the processor j' at the moment
tN-I is the t which is an integral solution of eqn (6.20) subject to eqn (6.21) 3.
2However to make sure that Tb can access all supernodes, that is, T. is bijective, we must prove that
Tb is unimodular. In fact, note that generally, for any two matrices Ml and M2, deleting the i-th row of
Ml then post-multiplied by M2 is equivalent to deleting the i-th row of M1M2' Letting M~ indicate the
matrix of Q' with the i-th row deleted and Mi indicate Q with the i-th row deleted, then M; =M: U~
and det(Mi) = det(Mi). Therefore det(T.) = L:~-l(-l)idet(Mi)tN_l,i = L:~-l(-l)idet(MDto,i = 1
where tN-l,i E tN-l.
3It can be understood that for any given moment tN-i, the solution is unique, In fact, assume that
two different supernodes t' = [t'o,"" tN-i]T and t" = [t" 0,' . " tN_tlT are executed at the same time
tN-i at the same processor. Without losing generality, suppose t'o - t"o = 6:;: O. The t' is allocated at
w' in the mapped virtual array, while t", at w". Because wo' - wo" = 1c6, w' and w" cannot be in the
same processor, contradicting our assumption .
. 137
We give a quick algorithm for the integral solution
Algorithm 6.5.1 Derive t
w' = [w~" . " WN_2]T = Wi + kj' - rtN-l
FOR i=O TO N-2
w~ := w~- "i.-l h, .t: t··- r~l
I I L.J]=0 I,] J' 1'- k
End of Algorithm
where hi,j E HIc• It is easy to check the correctness of the algorithm. In fact, for any
i E [0,N - 2], eqn (6.21) is equivalent to
(6.22)
where w~= w~+ kii - ritN-l - L:~:~hi,jtj. Eqn (6.22) holds if tj := r~l, due to the fact
that x :5 rfl k < x + k. The parallel algorithm for LSGP Case has a form of
Loops 6.5.1 Parallel Algorithm for (N-l)-D Array of SBC Mesh
Wi :=Wi _ t'r
DOALL iu := 0 TO 10- 1
DOALL iN-2 := 0 TO IN-2 - 1
w" := wi + kj'
FOR jN-l := t' TO t"
FOR i=O, N-2
111._ " "i-l h· .. .'- r~lWi .- Wi - L.Jj=O a"t" ta.- Ie
tN-l = iN-I, w' := kt - Will, w" :=w" - r
compute supernode s = Tbt
where w' = w - (w' + kj/), which is useful when collecting outgoing data. The boundary
of a supernode accessed by t is eqn (6.13), but note that we replace T-1 with Tb.
The procedure of deriving t from w is a forward recursion, which is possible because
Hie is lower-triangular. This is why we preferred to modify the space-mapping matrix and
time basis to produce a lower-triangular activity matrix.
6.5.3 Outgoing Data after LSGP
Due to the LSGP partitioning, a more complex situation is introduced into the dependency
relationship, see Figure 6.3.(b). For a 2-D processor array, as defined above, w' indicates
the virtual processors in a single actual processor. The dependencies dP's of Section 6.3
• 138
become the dependencies between the virtual processors. The dependencies dA's between
processors are composed from the dP's. Only the outward-oriented dP's of the virtual
processors around the boundary of the processor have contributions to dA. For instance,
d~l = dKdwb,2 +Wi,2+W~,2)+ d~l,l·(wi,2 +wb) and d~l,l = d~l,l.wb,2' where df·w'
means the df of the w'. However, it can be understood that DA = DP, where DA is the
dependency matrix in the processor domain, composed from all the dA's.
For general cases, we may characterise a virtual processor with a location vector
IW(w/)= [1g',···,1't,·.·,/~_2]' where
{
-I if w' is on the i-th lower boundary of the processor, i.e., w~ = 0
Ii = 1 if w' is on the i-th upper boundary, w~= k - 1,
o otherwise
Notice that, in fact, the location vector shows the potential directions of outgoing data
from a particular position w~.Furthermore, we define a special "match" operator 0 such
that
to. b {aj if aj = bj
aj'Ol i = o otherwise
and a ® b = lao 0 bo, at 0 bt,' .. ]T, where aj = -1,0,1 and b, = -1,0,1. Then, for each
w' in a processor, if d" 0IW(w/) E DA for any dP E DP , we will attribute the dP of the
w' to
(6.23)
where DP's is that for SBC in Section 6.3. Eqn (6.23) means that at a position l'", for
each dimension, dP becomes dA only if dP matches the location vector IWof the w'.
For instance, for Figure 6.3.(b), we have IW(w~,2)= [0,1]T, IW(wb) = [1,I]T, so
d~,101W(wL2)= d~1,10IW(wb) = d~,t0IW(wb) = d~1.10IW(wb) = [O,I]T = d~.l E DAj
IW(w~,2)= [-1, I]T, so d~,l®IW(wb,2)= d~l' too, and d~l,l ®lw('~~,2)= [-1, I]T = d~1,l E
DA.
6.5.4 Example
We try to use Example 6.1.1 to show the procedure above. But since the procedure is
complex, we cannot present all the details. Therefore, the example give only a clue of
how our method works.
• 139
For Example 6.1.1, the EVPA has a size of 12x 12, k = 3. By a series of computations,
we have
[ 1 ° -1]T = -1 1 °513
det(B) = 168 and det(T) = 9. Here, S = SSBCp(20I). g = [7,3, 8f. Note that E is the
same as that of eqn(6.6). Without showing the vertices W, we just note that t
'
= ° and
tu = 62; wi = [-ll,O]T and WU = [O,ll]T. Obviously, the size of the mapped virtual
array is 12 x 12, which can be allocated in the 12 x 12 EVPA. See Figure 6.3.(a).
As regards the N-D time basis, we start from
[
1 °Tb = -5 -3
° 1
When p(IO) = [0 1] we have
PlO '
A(10) = [0 1] [1 ° -1] [!5 ~3] = [-6 -3]
1 ° -1 1 ° 1 -1o 1
It can be verified that Hk = A (to>U k holds if
[ 3 0] [-1 1]Hie = -2 3 and U,= 1 -2
s[o_;e T r]e~d 3d:;T~)SGP1::t~0::~:):::~ted'[1,:;T~;~:l:::i:;i:: :
1 -2 0
Wi, w" and S and must be carried out. Then, we have wi = [0, -llf, w" = [11,OV
and S = [-;,1 ~ ~1]' Substituting B-1 and T, into eqn (6.13),we can produce the
boundary of a supernode, which is similar to that for Algorithril6.4.1.
Next, we discuss the outgoing data problem. Note that n-' is the same as eqn (6.19).
,
For all potential supernode dependencies d:Z'Z' and for all d? , we list their DC as DCz-zZ',i '5
{Notice that DCZ'Z'Z',i, whose subscript is the combination of the subscripts of d!Z'Z' and
I
d? , stands for the out-going data resulting from the i-th original dependency vector and
140
Hawing along the direction [x, x, xV in the supernode space). For instance, for doo! whose
area is confined by Co C1C2, we have
DCoo!,o = [0,6) x [0,3) x [8,8); DCOOI,I= [0,7) x [0,2) x [6,8);
DCo01,2 = [0,7) x [0,2) x [8,8); DCOOI,3= [0,7) x [0,2) x [7,8)
Obviously, DCOOI,o= DCo01,2= 0 and DCOOI,1C DCool,3' So the DC for dool is DCool =
U¥=oDCOOl,i= [0,7) x [0,2) x [6,8).
Similarly, we have DCOIO = [0, 7) x [2,3) x [0,8), DCon = [0,7) x [2,3) x [6,8),
DClOo= [6,7) x [0,3) x [0,8) and DClO! = DCno = DClOo= 0.
These DC's How directions and time-delays should be determined using eqn (6.14).
Marking them to DC's as superscripts, we obtain DC;X:io,3,DCgi~l, DcO'l/,4 and DC~ihll,5.
Finally, the parallel algorithm for LSGP Case of Example 6.1.1 is as follows
DOALL io:= 0 TO 3
DOALL il := 0 TO 3
" 3'Wo = )0
w~ = -11 +3il
FOR t2 := 0 TO 62
I "t r~lWo = WO, 0:= 3
I "2 rW'lWI = WI - to, tl:= T
Wi := kt - Wi, Wo := Wo -1, q& := tto - 7tl
FOR qo := max(qoo,O) TO min(q& + 20,6)
qOI := qo - 13to + 4tl - 3t2
FOR qr:= max(qol'O) TO min(qol + 20,2}
q02 := 3qo + qf - 23to + 40t1 + 3t2
/I 1/ +10
FOR q2 := max(r~l, 0) TO min(l~J, 7)
A(qo,qf,q2):= A(qo -1,qf,q2) + A(qg,qf -1,q2 - 2) +
+ A(qo, qf - 1, q2) + A(qo, qf, q2 - 1}
Data transfer operations
We can check partly the correctness of our method for Example 6.1.1 in some way. As
expected, there are, in total, 4851 nodes contained in 327 valid supemodes. It is difficult
to describe the Data transfer operations here, since it involves a. series of complex output,
relay and input operations based on DC's. Chapter 7 will develop a mechnism to deal
with the problem.
.141
6.5.5 Algorithm Generation for (N-l)-D SUC Case
With regards to the (N-1)-D array with SUC mesh, we have det(T) = 1 for the case of
SUC. There is no need for the further LSGP partitioning step, as it can be taken as a
special case of LSGP Case where Tb = [~ ~]. Without derivation we give the parallel
algorithm for Example 6.1.1
DOALL jo := 0 TO 3
DOALL jl := 0 TO 3
FORh := 0 TO 62
qgo := -6jo
FOR qg:= max(q~,O) TO min(qgo+20,5)
qgl := qg+ 6jo - lljl
FOR q~ := max(q~I'O) TO min(q~1+20'10)
q~2:= 3q~ + qf + 20jo + 13iI - 2h
FOR q~ := max(f~l, 0) TO min(l qO,;IOJ, 1)
A(q~,q~,q~) := A(q~ -l,qf,q~) + A(qg,qf -1,q~ - 2) +
+ A(q~,qf -l,q~) + A(q~,qr,q~ -1)
Data transfer operations
A simple check shows that the parallel algorithm covers all the 4851 nodes. The Data
transfer operations are also dealt with in Chapter 7 as a special and simplified case of
(N-1)-D SBC Case, finding the DC's is omitted.
6.6 From Inequalities To Boundaries
In the sections above, a derivation from a set of given inequalities to boundaries of nested
loops is essential. We hope that the derivation gives the minimum valid boundaries and
consists of only necessary expressions in the boundaries. It is easy to form a set of
inequalities from nested loops, but, unfortunately, the reverse derivation is not so easy.
For simplicity, collect all the expressions of the set of inequalities into a matrix
E = {Am,Nj + Bm,nffi + c} = {eo,'" ,em-I} (6.24)
where ei = ai,ojO+," . ,+ai,N-tiN-I + bi,OmO+,"', +bi,n-Imn-l + Ci· E:5 0 is the set of
inequalities, while E is a collection of expressions .
• 142
Definition 6.6.1 Ei is a sub-set of E such that ai,l = 0 VI E [j + 1,N - 1] for all
expressions it contains. [} is a sub-set of Ei such that ai,l < 0 for all expressions in it;
Ui is a sub-set of Ei such that ai,l > 0 for all expressions, and m = [mo,·· . ,mn-IV is
variables.
The procedure of determining the boundaries from a set of inequalities consists of two
basic operations, finding all possible lower and upper bounds and deleting redundant
bounds. These operations are discussed below.
6.6.1 Finding All Possible Lower and Upper Bounds
For each i.. create all possible expressions for lower and upper boundaries, Li and o-,
which involves all possible inequality relationships with jo, ... .i.. The process is imple-
mented by a recursive procedure:
Step 1 Let EN-l := E; i := N - 1;
Step 2 Build L' and u·; from Ei. Create Ei-l by collecting any expressions of Ei such that
ai = o.
Step 3 For any pair of expressions, one from Li and the other from u', make a modified
sum, noted as Ef), so as to cancel the ai, produing a new expression. For instance, let
el = 2jo+2jl +3 and e2 = io-3it -1. eee2 = 3{2io+2il +3)+2(jo-3it -1) = 8io+7.
The created expression is then appended to Ei-1.
Step 4 If e-: contains a number of expressions whose coefficients are linearly-related with
positive factors, keep the one whose constant item is smallest if a, > 0 (or the
one whose constant item is largest if a, < 0), remove the others, because these
expressions are associated with the upper bound of ii, where I : (a, #- 0) n (ale =
0, Vk E [1 + 1, iD.
Step 5 i := i - 1. goto Step 2.
The procedure is essentially the Fourier-Motzkin elimination method [90]
• 143
For our example, E2 is the collection of expressions of eqn (6.17). El = {-)o, 3)0 -1O}
{
[2 7' , } {2 7 ' . 25 }0: )0 - )2; Uo : - )0 + )2 -
L2 = 1~: -,17io ,+ 2JI ,- 12 - 25; U2 = u~ : 17i~ - 2!1 +)~ - 1
[2 : 2)0 + )1 - 2)2 - 5; u2 : -2)0 - )1 + 2)2 - 30
[2 ' 2 ' , 70 2 ' , 403 : )0 + )1 - )2 - ; u3 : -)0 + )2 -
We find that 15EBu6, l~EBu~ and l~EBu~ leave only constants, and can be ignored, However,
the following expressions will be appended to El.
1~EBu~ : 24io - 2)1 - 1;
16 EBu~ : 12jo - jl - 30;
16 EBu~ : 6jo - 40;
l~EBu6 : -24jo + 2jl - 50;
1~ ED u~ : -36jo + 3jl - 80;
1~ED u~ : -18jo + 2it - 65;
l~ED u~ : 2it - 110
l~EBu~ : -12jo + jl - 55
1~EBu~ : 36jo - 3jl + 7
l~ ED u~ : jl - 85
15 ED u6 : -6)0 + 2it - 95
l~ED u~ : lBjo - 71
l~ ED u~ : 3it - 170
It can be seen that 24jo - 2jl - 1, l2jo - jl - 30 and 36jo - 3it + 7 are linearly-related.
Since al < 0, keep 24jo - 2jl - 1 which has the largest constant, and remove the others.
Similarly, keep -24jo + 2it - 50, while removing -36jo + 3it - 80 and -12jo + jl - 55;
keep 2jl - 110, remove 3it - 170 and it - 85; retain lBjo - 71, remove 6jo - 40; retain
3jo - 10, remove l8jo - 71. Then, we have
{
-jo; 3jo - 10;
El == -18jo + 2it - 65; 24jo - 2it - 1;
2jl - 110
-6jo + 2it - 95 }
-24jo + 2jl - 50
and
Ll {24' 2' I} Ul {2jl - 110;= )0 - JI -; = -6jo + 2it - 95;
Similarly, LO = {jo} , UO = {3jo - 1O}.
- 24jo + 2h - 50 }
-18jo + 2it - 65
6.6.2 Deleting Redundant Bounds
Although explicitly redundant bounds have been removed, there may be still some implicit
redundancy.
For instance, there are two expressions el : jo +h + 3 and e2 : -jo +il+ 5. Make the
modified difference, noted as e, such that al is cancelled, that is, d = el e e2 = 2jo - 2.
Let d"'az and d"'in indicate the possible maximum value and the possible minimum value
of d, respectively. Because the two expressions are associated with the upper bound of it,
144
we can retain el and remove e2 if dmin ~ 0, or retain e2 and remove el if dmax ~ O. Since
ao > 0 in d, replacing io with its lower bound and upper bound produces dmin and dmax,
respectively. Consider the following four cases: 10 = 0; 10 = 2; Uo = 0 and Uo= 2. For
the first case, we have dmin ;::::-2, so e2 cannot be removed; for the second case, because
d
min
~ 0, we can remove e2; for the third case, because dmax ~ -2, el can be removed;
for the last case, neither can be removed, because dmax ~ 2.
If the two expressions are el : io - il +3 and e2 : - jo - jl +5, d remains unchanged, as
well as dmax and dmin. However, since the two expressions are associated with the lower
bound of i.. we can retain e2 and remove el if dmin ;::::0, or retain el and remove e2 if
dmax ~ O.
It is not an easy job to find the possible minimum value and the possible maxi-
mum value for an expression in general cases. As stated above, suppose the lower
and upper bounds of jk are noted collections of expressions L" = {1~,... , I!,,} and
I
Uk = {u~"'" u~,,}, respectively. The lk's are the functions of jo, ... ,j"-ll as are the
u
Uk,S. The possible minimum values of a given expression di = aojo + ... ,aiji form a set
of numbers, noted as Dmin =. {dQin, ... , d::':'?n}, instead of a unique number. If Lk and
Uk are knowrr-Vb E [0, il, the Dmin of d can be derived recursively:
Step 2 Take an expression rin E Dmin. For the rin, if ai > 0, substitute i, with its lower
bounds in u, that is, make dmin (B Ii, VIi E Li (this is possible because ai < 0 for all
Ii's in Li). Replace the dmin with the newly created ni expressions not involving ii.
If ai < 0, do the same hut with o:
Do this for all dmin's of o--.
Step 3 i := i-I, goto Step 2.
We can find the possible minimum value set, Dmaz, of di in a similar way except that
we work the dmaz with Ui if ai > 0, or with u if ai < O. We say tJi ~ 0 if rin ;::::0 for all
rin,s in Dmi"; similarly, we say di ~ 0 if raz ~ 0 for all roz's in Dmoz.
Removing the implicitly redundant bounds is also a recursive procedure, applied from
145
LI and UI to LN-I and UN-I. Note that LO and UO is omitted since all redundant bounds
are explicit in this case. The algorithm is as follows:
Step 1 i:= 1
Step 2 Take two expressions It and 112 E u, ki "# k2• Make di-I = It 8 112 such that ji's
term is cancelled. Create nmin and nmax from di-I. Delete 112 if di-I ~ 0, or delete
111 if di-1 s o.
Repeat this for all pairs of expressions of u.
Step 3 Do the same for Ui as Step 2, but delete IiI if di-1 ~ 0, or delete lL2 if di-I ::;o.
Step 4 i := i+ 1, goto Step 2.
The operation of deleting implicitly redundant bounds is necessary for general cases.
For Example 6.1.1, no implicitly redundant bounds are found for the £1, U1, £2 and U2
and so are the lower and upper boundaries of eqn (6.18).
It is interesting to see the operation of the algorithm for a simple example.
Example 6.6.1 A Simple Example
Given the nested loops
FOR io = 1 to rno
FOR il = 2io to ml
FOR i2 = 2io + il - 1 to min(2il' m2)
Let us transform the nested loops to a set of inequalities, then transform the set of
inequalities back to nested loops by means of the algorithm. The resulting set of nested
loops is:
FOR io = 1 to Lmin( m~±l ,mo)J
FOR il = 2io to min(m2 - 2io + 1,mt}
FOR i2 = 2io + il - 1 to min{2il' m2)
146
_.••••••••..•..•.•.•••••••.•••.. '1.- ..
Determine processor dependencies If Transform Original polytope A
according to D
S
and S' · · 1 sEe . . , .
("LS'OP'p.~;~;~p: ._.............*- HO::: [::: .. :.:::::.:: i:..... :.'"B;~~~~:-~f'~i~~'I~'.,~
. p" .
:. dependencies Dj: Coverage in processor : : Boundary in processor :
··········1··········· ~-time domain, W i ~ and K-D time domain iL..--_-+-- __.."___,"'1'" T········ ' r ..·..··
-- -_.. l
.......................................... . .
• S
to Supernode domain A by B
Determine size gof single
of supernode from B
:, supernode...........................
HNF of acvitity matrix. IIIc ~
:~..- .. ...... ... .......... ...... ..... ... - .. .. .. ,
: Determine Out-going data packet DCs :. :
:, according to D and E............................................................................... ~
.............. +......
: LSGP algorithm ,
~structure, if LSGP :..........................
. .
i Lower-D algorithm structure ~
i according to p if K>1
Figure 6.4: Chart of Program Transformation
More bounds are imposed on the original nested loop. It is easy to judge their correct-
ness. Since i2 2:: 2io + it - 1, then il > m2 - 2io + 1 ==> i2 > m2. This means that there
is no room for the i2 loop since i2 's lower bound is beyond its upper bound. Furthermore,
if io > m2/1. substituting it into the lower and upper bounds of il yields it > m2/1 mean-
while it < "'22+1, so there is no room for the it loop. Therefore, the algorithm is useful in
that it reduces the invalid part of the polyhedron generated by the original nested loops.
6.7 Summary
We propose a method to generate parallel algorithms from sequential algorithms. Using
our partitioning and mapping approach, algorithms are developed to determine the loop
boundaries of the parallel algorithm. The problem of out-going data is discussed in detail,
and the operational structure in a single processor is constructed. See figure 6.4.
To execute on a lower-dimensional array, special attention is paid to the synchronisa-
tion of every processor so that the parallel algorithm can run in a processor array with
systolic features. Fortunately, we find a method for the synchronisation which requires a
very simple mechanism and avoids complex control. For the methods involving LSGP par-
titioning a simple algorithm is derived to determine which supernode should be executed.
147
The correctness of the methods are checked.
The method presented here gives realistic approach to automatically implementing a
wide range of VRE problems for both hardware systolic array design and software imple-
mentation of regular arrays. In order to achieve actual wide applications, our methods are
able to cope with the processing arrays with various physical limitations on dimension,
shape and communications. We also pay attention to achieving high efficiency by means
of supernode structure which results in packaged data communication.
148
Chapter 7
Parallel Code Generation
This chapter discusses methods of transforming sequential URE algorithms to parallel
codes. As a result of the theoretical work, proposed in the previous chapters, actual
parallel codes in Meiko C for a transputer array are generated automatically. To achieve
this, some special practical problems for the generation of parallel codes are resolved, such
as memory organization and data communications.
7.1 Introduction
In the last chapter, we have solved some important problems with respect to code gener-
ations, such " the boundary of the transformed algorithm, and the structure of parallel
algorithms. However, the work has not yet finished. We must actually generate parallel
code automatically. For this purpose, further practical problems have to be taken into
consideration, such as, creating a working model for parallel programs which can result
in relatively easy generation of actual parallel code. Finally, since we restrict ourselves to
using a distributed-memory machine, we have to solve the problem of data storage and
data communications at the machine-level.
In this chapter, we first build a processor array in Section 7.2. Section 7.3 and Section
7.4 contribute to two practical problems of the parallel codes: data storage and data com-
munications, respectively. Section 7.5 describes the outline of the parallel codes. Finally,
some minor algorithms for building data communications can be found in Appendix E.
149
7.2 Processor Array
Although in previous chapters we have had much theoretical preparation, to generate an
actual parallel program for a specific computing facility using a specific language is a very
complex and challenging problem.
7.2.1 Multi-Processor Machines
With regard to memory, there are two kinds of parallel computing machines, shared-
memory and distributed-memory machines, respectively. In the distributed-memory ma-
chine, each processor has its own local memory, forming a node of an array, while for
shared-memory, all processors share a global memory. Distributed-memory machines im-
pose stricter conditions than shared-memory machines with regard to the overhead of
operations and the difficulty of programming. Our methods were originally aimed to cope
with the distributed case, but can be applied to shared-memory models also.
We consider regular-structured distributed-memory processor arrays in which each
processor performs identical operations at each step. These kinds of arrays behave like a
software-controlled systolic array. The attractive point of this arrangement is the identity
of operations on the processors and the regularity of the whole array. The "regular-
structured" array means a hypercube-shaped array, with unidirectional or bi-directional
mesh-links (i.e., SUC and SBC).
7.2.2 Building an Array and Creating Communication Meshes
To build a processor array, we need to describe briefly the features of our available parallel
computing resource.
The distributed-memory machine used in this study is a transputer array consisting
of 16 T800 transputers, each with 4Mb memory. A software environment, CSTools, is
also provided to create a computing array [64]. In CSTools, CSBuild facilities allow a
user-written sequential C program which, running on the host computer, permits the
programmer to describe the array and the physical link mesh explicitly and also allows
control of the loading of the parallel codes.
150
The available parallel language for the transputer array is Meiko C [63]. In Meiko C,
inter-process communication channels have to be created in the application. To specify
a processor array, the number of dimensions, M, and the size, 10,"', 1M-I, of the array
in each dimension, are given as the design parameters. In the CSBuild program, the
processor array is defined as a regular array (i.e., AM in Definition 1.2.1). Then, we
specify object codes of the program fragments to be loaded into each processor of the
array (in fact, all the object codes are identical). Thirdly, the coordinate, a, of the
processor in the array, must be passed to each process as a badge for building the inter-
process communication mesh, as well as a parameter for confining the computation domain
assigned to the processor.
In the CSBuild program, all the necessary physical links among the processors are
built. The physical links are created according to the given interconnection primitives
p = [... ,Pi,' .. J. For each processor, we have to determine all allowable links. That is
Algorithm 7.2.1 Build Physical Links
For all a E AM
For all PI E P
a' = [... ,a~,· .. ]T := a + PI
IF a' E AM (i.e., V i, 0 ~ ai < Ii)
create a link from a to a'
7.3 Supernode Storage
Once we have specified a connection mesh, we have to reserve local memory for each
supernode allocated to a processor. For every processor, we must allocate a memory
area which is as small as possible and establish a supemode storage structure which can
keep the uniform data dependencies available. Problems with storage structures arise
because we must keep the uniform distance of data dependencies available even where
the dependencies go beyond the boundaries of supernodes. There are two ways to store a
supernode.
One method is to store supernodes independently, see Figure 7.1. That is, we allocate
151
~o
e·
dllc SSo-l. SI
a C1 gl
Figure 7.1: Memory layout of separate supernodes. Note that 90 = mb, 91 = dv, Co = ba
and c~= ad.
memory pieces nf:01 (9i+cT) for each of the supernodes in a processor! where ci (discussed
later) is the extra size required to store the incoming data used by the supernode. If there
are t" - t' + 1 time steps, we will have tU - t' + 1 such memory pieces.
The second method is to store all the supernodes of one processor together in such a
way that they keep their neighbouring relations unchanged, so the internal data flows can
be avoided, see Figure 7.2. This can be thought as storing the local supernode domain
together "as a whole".
The first method suffers from a serious setback: there must be internal data flows, e.g.,
the area eonke of S.0.1 to the area pqrtp of S'O'I+1 in Figure 7.1. In fact. If d' =f 0 but
dP = Sd' = 0, there will be a dependency between two supernodes in the same processor.
For the Lower-D Case, the number of internal data flows is greater than inter-processor
data flow. The internal data flows are also costly and have to be avoided. The second
approach complicates the problem of storage. We must know how large the memory
space should be for the local supernode domain. Secondly, each of the supernodes must
be located to a suitable position of the memory space, which may not be obvious.
We cannot say the first method is completely useless, because the second may require
much larger memory space than the first. Here, we focus on the the second since it is
more time-efficient. In some circumstances, the first approach may prove to be better,
especially when memory is at a premium.
1Note that the different types of variables, e.g., double precision, float and integer. take different sizes,
s, of memory, so the actual memory space should be enlarged by s. For brevity we assume s = 1.
152
7.3.1 From Supernode Domain to Local Supernode Domains
Obviously, it is unnecessary for every processor to allocate a memory space large enough
to contain the whole supernode domain. In order to allocate sufficient and necessary
memory space, we have to go back to the supernode space to determine a partitioning,
called the local supernode domain which is assigned to a processor (this procedure is
quite similar to that of Subsection 5.3.2). Then the local supernode domain which may
be irregular has to be expanded to form a hypercube because a hypercube-shaped memory
space can produce a uniform distance of data dependencies.
Eqn (6.11) defines the hyperplanes of the supernode polyhedron and Ve are its ver-
tices. They provide all the information describing the supernode polyhedron. Recalling
j' = [io,'" ,jM-t]T = Ss assignssupernodes s into the virtual processor j' 2 we do the
reverse: for a. particular j, determine all the s's mapped onto it. Applying the algorithm
of Subsection 5.3.2 on the polyhedron of eqn (6.11), we obtain all the vertices of the local
supernode polyhedron assigned to the virtual processor j'.
With the vertices of the local supernode polyhedron, it is easy to find its length, ni,
for each dimension. These lengths confine the sizes of a smallest hypercube which can
contain the local supernode polyhedron. It is easy to find the first supernode (e.g., the
left-bottom-corner point in 2-D), noted as So, of the supernode hypercube, which is used
later as a reference to locate any supernodes in the hypercube.
We have to repeat this procedure for every processor. It is wasteful to have a set of
ni's, i = 0"", N -1, for each of the processors, because we must have a set of uniform ni's
to generate uniform parallel codes for every processor. Therefore, for the i-th dimension,
we choose the largest nj of all processors to be the length nj. Clearly for the processor
we must allocate a memory space of size n (nj9j) which is necessary and sufficient for the
required data storage.
We can gain an idea of the shape of the local supernode polyhedron as follows. For
the case of (N-l)-D array with SUC mesh, the local supernode polyhedron is a straight
line segment lying along one dimension, so the local supernode hypercube containing the
2Note that without LSGP partitioning, a virtual processor is also the real processor, i.e., a = j'. As
known, with the LSGP partitioning, c virtual processors will be compressed onto one real processor a.
153
~--~--~--------~~~ L- ~ ~U
a C1 d
Figure 7.2: Local-supernodes memory space layout. Note that go = mb, g1 = dv, cg = ba,
c~= ad, rno = pu and rnl = mp.
line segment is quite compact. For the case of SHC mesh, we also use a line segment, but,
stretched in N dimensions. In this case, we have to construct a large loose hypercube to
contain the line segment. The useful points in the hypercube must be very sparse. When
the hypercube is too large to be allocated in available memory, we have to use the first
method of storing supernodes at the cost of time efficiency.
7.3.2 Supernode Storage and in/out Data Storage
Here, we focus on the second method of supernode storage. The following problems
must be addressed. Firstly, to maintain the uniform distance of data dependencies, some
margin area must be attached the lower sides of the local supernode hypercube to store
the data transferred from other processors. So the total allocation of memory space will
be larger than II (n.g.). We must know exactly how large it should be. Secondly we
must determine the locations of each supernode and each of the input/output DC's in
the modified local supernode hypercube (see Chapter 6). Furthermore, since the modified
local supernode hypercube is actually stored as a line in the memory space, we should
derive a formula to figure out the addresses of the supernodes and input DC's.
2-D example
Let us consider a 2-D example in the Q' space defined in Subsection 6.2.3. See Figure 7.2.
In the layout, only 8"0"1 and 8"0"1+1 are the supemodes assigned into the same proces-
154
sor, while 580-1,81-1, S80-l,8I' 5"0-1,81+1 and S80,81-1 are assigned to other. processors but
contribute data dependencies to the internal supernodes. Henceforth, we define genearlly
that a symbol "vxyzv" stands for a cubic region confined by points v, x, y and z. For
instance, abcda indicates the square at the left-bottom corner in Figure 7.2.
At first, consider 5"081' cij kc. three DC's are received as input data: DC01 of size
(go - co) x ci from 5"0'''1-1, DClQ of size Co x (gl - cn from S'0-1,81 and DCn of size
Co x ci from 5'0-1,81-1' Consider DCOl• We must allocate a memory space bmicb of go xci
attached to the left of 5"081 as a shadow S80,81-l' The same arguments hold for DClQ and
DCn, thus the memory space dckvd is attached underneath 5'0'1 and the memory space
abcda is attached to the bottom-left corner. The memory space for the 2 supernodes
should be extended from go X 2g1 to mOmt = (go + cO) X (2gt + c~), where m, are the size
of the local-supernode memory space in the mj-th dimension.
The first point of a square is a key reference point by which the whole square can be
located with its known sizes. It is easy to see that the first point of 8"0'1 is point c =
[co,~j and tl.e.first point of 5'0"1+1 is point k = [cO, e~+ gt].
After the processor receives all input data to an input buffer, the in-coming DC's
will be transferred from an input buffer to cubic areas in supernode memory space. It is
more convenient to express the first points of these cubic areas using the first point of the
relevant supernode as a base address and adding the correct offset. For example, DCOI
is moved to bghcb, its first point is point b = point c - [0, e~jj DCto, to deefd, point d =
point c - [co,O]; and DCn to abcda, point a = point c - [cO, c~],etc.
For the output data, consider supernode 8'0"1+1' The first point of the area n)otn
which contributes dio is point n = point k + [gO- cO, OJ. The out-going data within njotn
will be transferred to an output buffer. Similarly, point t = point k + [go - cO, 91 - en is
the first point of topqt which contributes dh; and point s = point k + [0,gl - ~] is the
first point of stqrs which contributes dOl'
The supernode memory square ampua will actually be stored as a line in memory.
Any smaller cubes, such as a supernode and DC's, in the large cube cannot be stored
continuously, so we call them segment-distributed cubes. In the parallel codes to be
generated later in Section 7.5, there are two specially designed functions, CopyFrom-
155
BufferToSupernode and CopyFromSupernodeToBuffer, which can access all nodes
of a segment-distributed cube described by the address of its first point and the size of
the large hypercube ampua stored as a linear pattern.
General formula and operations
The above arguments can be extended to generalised cases. Suppose that each of DC's is
noted as Ded:, i E [0,NDC - 1], and has a size of [h&,i' hO,i] x ... X [h~-l.i' hN-I,i], with
supernode dependency d: = [XO,j, .•. ,XN_l,i]T and Xj,i = 0,1. When compiling, we must
do the following
Step 1V j E [O,N -1], let cj = maxf':foc-1Xj,i(gi - h~,i) and mj = njgj + cj.
Step 2 Let fN-l = 1. Let 1; = mj-l1;-l, j=N-2,·· ·,0. Let f = [fo,·· . , fN_I]T.
Step 3 ViE [0,NDC - 1], let
N-liL'r: = - f·( 1- X .. ) (g. - h .. )1 J 3,' 3 3,1
j=Or: is the relative distance difference from a supernode to the segment-distributed
(7.1)
cubic area of storing the input DC .
Step 4 ViE [0.NDC - 1], let
N-lrr = L f;xj,jh~,i (7.2)
j=O
ri is the relative distance difference from a supernode to the segment-distributed
cubic area collecting the output DC.
Then, in the parallel codes to be generated, the supernodes are located as follows:
Step 1 Allocate local-supernodes memory space
N-l
m = IImj;=0
(7.3)
Step 2 At time t, locate supernode s to address
(7.4)
where the ® operator is defined as [ao,··· , an]T ® [ho, ... , bn]T = [aoho,···, anbn]T,
a de - [ e e]Tn c - co,···, CN-l .
156
7.4 Data Flow and Relay
In this section, we consider the problem of data communication. In order to implement
data communications, we must build a data flow and relay mechanism. We establish
two data in/out buffers, IB and OB, because in this way more data can be collected
to form a large data block so as to be transferred efficiently (however, as to be seen in
Subsection 7.4.2, in some situations, to improve efficiency we will try to build direct data
flows avoiding the buffers). Relay Buffers (RB) are also established to buffer the relay
data. This mechanism will carry out the following tasks:
1. Collect data to IB and then determine their orientations and destination OB's.
2. If some data cannot arrive at their destination in one step, relay them via RB's to
their destinations along appropriate directions with appropriate delays.
Theme. \._:tuism l~'rovides data flow packets (IP and OP). Note that because Meiko
C provides the function of transferring data by block, instead of individual items, the
data flow unit is a Data Vector (DV) which contains the information about a data block
and its location. A DV is described by three attributes: the buffer, the position in the
buffer, and the contents. For example DV = [OB, n(QO), Ql] stands for Ql parking in
OB at position n( QO), where the Q's are data blocks and n( Q) is a function determining
the number of Q. A OP ( or IP) consists of a number of DV. For instance, OP =
{[OB, n(QO), Ql], [RB, n(Q2), Q3]}. Obviously, the buffers involved in OP and IP are the
source and destination of data blocks, respectively.
If LSGP partitioning is used, we face a more complicated situation, because a number
of different lP's and OP's have to be produced for different locations of supernodes in
the LSGP partitioning block. The non-LSGP case can be taken as a simplified version
of the LSGP case. Therefore we deal with the LSGP case first even though it is more
complex, and then describe briefly the non-LSGP and discuss the associated direct data
flow problem.
Some notation should be introduced. As indicated previously, more than one sup ern-
• 157
ode dependency may contribute to an inter-processor dependency, so one inter-processor
dependency may involve more than one DC. For instance, if S = [~ ~ ~ l' dgl in-
volves DCg61 and DCP61; dio involves DCJPo and DCiPo; and dil involves DCMI and
DCill' Here, DC~~x's are the same of those in subsection 6.3.2 but omitting the dt from
the superscript. For brevity, we merge all the DC~~x's which share the same processor
dependency into one data flow, noted as QYY. That is, QOl = DCgJl + DCPJI' and so on.
7.4.1 LSGP Case
As discussed in Subsection 6.5.3, unnecessary data flows should be avoided. In the LSGP
case, there are c different positions, marked with w', in a block of LSGP partitioning,
where c = kN-1 is the number of the supernodes in a block of LSGP partitioning and
also the total LSGP compression factor mentioned before (See Figure 6.3.{b)). For each
of these positions we have to carry out different input/output operations.
We have to design different IPpi 's and 0 Ppi 's, where superscript Wi, i = 0, ... , C - 1,
indicates the ordinal number of computing each w', and subscript p shows the direction
of the packet 3. The IPpi's and OPpi'S describe all the input and output operations for
the w' marked as Wi. However, before establishing the lP's and OP's, we must mark Wi
for w' of a processor and know the relations of the Wi to those of the adjacent processors.
Determining wi's and finding its increment
First of all, we must determine the order of computing w' of a processor, which indicates
the order of the lP's and OP's, too. That is, we should mark w'with Wi, see Figure 7.3.{a).
for the case of Example 6.1.1. Figure 7.3.(a) shows a LSGP block where for each w', "t"
indicates the time to execute the w' and obviously also shows the order of execution.
That is to mark the w' with Wt. For instance, W~,2 is executed at t = 0 and is marked as
Wo. In fact, it is difficult to derive an explicit formula between w'and Wi. Fortunately, in
Algorithm 6.5.1, we have considered the problem. A short algorithm (Algorithm E.1) is
presented in Appendix E to determine the computing order of every w', For the processor
a, letting ~a = 0 and running Algorithm E.1, we can know which w' is accessed at time
30bviously p must be one of the interconnection primitives, i.e., pEP
• 158
(b)
Wo Wo Wo Wo
w3 w3 w3 w3
w6 w6 w6 w6
Wo Wo Wo Wo
proc a-{h]
(d)
proc
(a)
Figure 7.3: Marking w'with ordinal number w. (a), (b) and (c) show LSGP partitioning
blocks; (d) shows an initial state of a processor array marked with ordinal number w
t, where ~a indicates a shift from the referenced processor in processor domain. Then,
mark Wi with Wt.
Comparing. Figure 7.3.{b) and Figure 7.3.(a), we find an important fact: the processors
in an array may not work with the same Wi at a moment, so that an OPWi operation of
one processor may be correspond to an OpWi' operation of one adjacent processor, where
i =F if. For instance, 0 Ppo of the processor a should correspond to IPps of the processor
a + [1, O]T when t = 0. Therefore in order to determine the corresponding relations of
the sequences of the lP's and the OP's of two adjacent processors we must also know the
order of computing Wi of the adjacent processors.
For the adjacent processor a+ [O,ljT, let ~a= [O,l]T, run Algorithm E.l again. We
find that it gives the same result as processor a, see Figure 7.3.(c). However, for the
processor a+ [l,O]T in Figure 7.3.(b), letting ~a = [l,O]T and running Algorithm E.l
once more, we see that when t = 0, w&,o,which has been marked as We, is accessed. Thus,
in this direction, we find an increment 6 with respect to Wi. Therefore, an increment
vector ~w = [6,0] can be defined. At time t, if the referencing processor a accesses Wi,
159
Mark every
w' with Wj
·······T·······
I' ~..~.~..~.~..~.~..~.~..1.~..~.~..~.~..~.~..~.~.'I
1 : "
1 ~ For all w' create all possible OP~ j ! :
1 v , •• ~.~ •••••••• ~ •••• ~ •••••• ~ •• ~~ •••• r ~..~~..~~ ~ ~.:
: For all CP~ j create corresponding ': 1
1 , w '1
1 :. IP p,j' to receive them : 1
:!n!,i~ i;~·!~~·~ ;l.~·~.~ . .. ;
i=0 :
(i';':: i·~"i·.''''!-------,
_ ·.· t : .
For all IP pj , if any data vector needs
to be relayed along the i-th dimension,
modify the corresponding OP to output it.
.::::::::::::::::::::::::1:::::::::::::::::::::::::::
For any modified CP, modified the
corresponding IP to recei ve the
. relayed data vector. .................". ~ t~ ~::::..~ .
:.i= M -1 ? :.,__-------I······l·y···
Figure 7.4: The Conceptual Chart of Creating lP's and OP's
processor a' will access Wi', where i' = (i + ~w(a' - a)) mod c, see Figure 7.3.(d) which
shows wi's for a processor array at time t = O.
Establish OP's and lP's
The procedure for establishing OP's and lP's is shown in Figure 7.4. Marking w' with Wj
has been described above. Now for every position w' in the LSGP block, an OP can be
figured out and is marked with the corresponding Wi. For each of the OP's we must create
a properly marked IP to respond. So we have two initial sets of OP's and lP's, and have
to yet consider data relay if there is any data which can not arrive at their destination
in one step. Next, we-use recursive modifications to make up all the lP's and OP's. For
each direction in turn, we should check whether some data has not reached their terminal
destination in this direction, and then modify the relevant OPp; 's to send the data out
once more and modify the corresponding lP's to receive them. At last, all data can arrive
at their correct destinations.
More accurately, establishing OP's and lP's consists of the following steps.
Step 1 For each w' and for all dP E DP, compute dA by eqn (6.23). Set all entries after
the first non-zero entry of dA to zero, forming a vector p, that is, for instance
dA = [0, -1, 1,O]T ~ P = [0,-1,0, oV. The p indicates the link we work with.
160
OP;IO == IIbw~ll,--........_._..,
t----IIII Wo
....--1 proca~
Wo
proca-? .... --1
Wo
proca
Wo - Wo
OPO.-1 = IPOI
W3
proca-A
(b)
Figure 7.5: LSGP dependencies and data flows.
Create OP;', which consists of a number of data vectors which are attributed with
a-.
Step 2 For each existing 0 p;i, create its corresponding IP;I' 's, where j' = j +~wp mod
c and pi = -po Now we have initial Op;i's and IP;' 'so
Step 3 From i = 1 to 1 = M-l, for any data vector of any existing IP;"s, Null dA of the
data vector except the i-th element, form p. If p :F 0, which means that the data
vector has not yet arrived at its destination, attach the data vector to the OP;J',
where Wj' = Wj + 1 mod c. Attach the vector to the corresponding IP of OP;".
The actual algorithms for generating the in/out packets for general cases are provided
in Appendix E, (Algorithm E.3).
It is helpful to show the procedure with a 2-D example. See Figure 7.5.(a). First, for
each Wi, all the Q's which will be output should be collected into OB's which are also
attributed with .Wi. That is,
OBWO = Q-l,Q + Q-l,l + QO,l
oBW2 = Q-l,l + QO,l+ Ql,-l + Ql,Q
oe= = Ql,-l + Ql,O
DBw7 = Ql,-l + QO,-l
OBW1 = Q-l,l + QQ,l
OBW3 = Q-l.l + Q-l,Q
DBw6 = Q-l,l + Q-l,O + Ql.-l + QO,-l
DBw, = QO,-l + Ql,-l + Ql.O
161
For example, the right-bottom box marked Wo in Figure 7.5.(a) contributes to d~l,O' d~l,l
and d6.l' so the associated data blocks are collected to OBwQ and placed one after another.
Next, we establish the initial OP's. The Q's in OP are distributed to OP's in suitable
directions :
OP~f.o = {[OB, 0, Q-l,O + Q-l,l]}, Op;',r = {[OB, n(Q-l.O + Q-l,l), QO,l]}.
OP;',i = {[OB, 0, Q-l.l + QO.l]},
OP(;,] = {[OB, 0, Q-l,l + QO,l]}, OPl~5 = {[OB, n( Q-l,l + QO.l), Ql.-l + Ql,O]}.
OP~l.o = {[OB, 0, Q-l.l + Q-l.O]}.
OPt8 = {[OB, 0, Ql.-l +Ql.O]}.
OP~to = {[OB, 0, Q-l.l + Q-l.O]}, OP;'~l = {[.oB, n(Q-l.l + Q-l,O), Ql.-l + QO,-l]}.
OPI~6 = {[OB, 0, Ql.-l + QO.-I]}.
OP;'~l = {[OB,O,QO.-I]}, oPt~ = ([OB,n(QO.-l), Ql.-l + Ql.O]}.
Then, we have to figure out all the corresponding lP's which receive the OP's. It can
be understood that a OP;· is linked to IP~~, where i' = i + 6wp mod c. For instance,
Op~r.o corresponds to I Pl~3, and OP?;,f, to I P;',':l' see Figure 7.5.(b). I p;i receives a
number of Q's. If the Q arrives at its terminal, put the Q into IB, otherwise, put it into
the relay buffer RBp. We can list all the lP's in a order of corresponding to OP's.
I pW3 = {[I BW3 ° Q-l,O] [RB ° Q-l.l]} IRwo - {[I BVJo° QO.l]}1.0 , , , 1.0" , 0,-1 - , , ,
I P/:'!_l= {[I BVJ1,0, Q-l.l + QO.l]},
IP;',:l = {[IBUI2,O,Q-l.l + QO.l]}, IP~.o = {[IBVJII,O,Ql.-l + Ql,O]}.
IPt~ = {[IBW8,0,Q-l.l + Q-l.O]}.
IP~,o = {[IBUI2,n(Q-l.l + QO.l),Ql.-l + Ql.O]}.
IPt8 = {[IBWO,n(QO.l),Q-I,1 + Q-l.O]}, IP;:,r = {[IBVJ8,n(Q-l.l + Q-l.O),
Ql,-l + QO.-I]}.
IP~to = {[IBW4,0,Ql.-l + QO.-l]}.
I Po~~o= ([I BW', n( Ql.-l +Ql.O), QO.-l]}, I P~to = {[RB_l.o, 0,Ql.-l], [IBW!I,0,Ql.O]}.
162
Now the OP's and lP's must be modified to implement data relay. Obviously, all
data in the RB's must be output once more, so we have to modify (or create) some OP's
to relay this data. Since I P::J puts Q-I,I into RBI,o, Q-I,I must be output along the
direction [0,1] at the next time step, where the superscript vector of Q indicates d". So
we create
OPo~t = {[RBI,o,O,Q-I,l]} and IP;:~I = {[IBW4,n(QI,-1 +QO,-I),Q-l,I]}
to transmit Q-I,I and then to receive it, respectively. Similarly, create
OP;:~I = {[RB_I,o, 0, QI,-I]} and I P;:t = {[I BW6, n(Q-l,I+Q-I,O+Ql,-l+QO,-I), Ql,-l]}
to transmit QI,-I and then to receive it.
Finally, we derive and list all the IB's.
IBWo = QO,I+ Q-l,l + Q-l,O
I BW2 = Q-l,l + QO,l + Ql,-l + Ql,O
I BW4 = Ql,-l + QO,-l + Q-l,l
Is« = Q-l,l + Q-l,O + Ql,-I + QO,-I + QI,-I
IBWI = Q-l,l + QO,l
IBw3 = Q-l,O
ts« = Ql,O
IBwe = Ql,-l + Ql,O + QO,-l
Obviously, the lP's should be rearranged according to their ordinal number. It takes too
much space and is of no real value to list all the OP's and lP's created, so they are omitted
here.
All the work above is carried out during compilation and all the OP's and lP's are
written into a h.file. They are are used repeatedly during the parallel program execution.
In addition, we must use 'UlXWa mod c to determine the start point of the cycle, for each
processor a. All the input and output operations must be controlled by the IPWj and
OpWj, respectively, and all of copying data between supernode memory and OB and IB
must be controlled by 0BWj and IBWj •
7.4.2 Non LSGP Case and Direct Data Flows
Non LSGP Case
The non-LSGP case can be taken as the case c = k = 1 of LSGP. The method of creating
IP and OP for LSGP case can be applied here and the superscript Wj for IP and OP is
no longer necessary. For a 2-D sue array, the OP's and lP's are given as:
163
ProclO OB GprocilOPO! IPo._I
IB RBIO IB RBIO
r IP_I.o I1~I.O
OPIO OPIO
procoo OB OBproeO!
OPal IPO.-I
m RBIO IB RBIO
~
-..
-..
-..
Figure 7.6: Data flows and relays.
OPIO = {[OB, 0, QlO + Qll]}.
I P-1,o == {[I B, 0, QI0], [RBlO' 0, Qll]}.
OPOI= ([OB, n(QIO + Qll), QO!], [RBlO' 0, Qll]}.
IPO,-l == {[IB,n(QIO),Qol + Qll]}.
and
OB == QIO+ Qll + QOl.
I B = QlO + QOl + Qll
This is easily checked from Figure 7.6. For instance, we first build 0 PlO for proCoo
along the dimension 0, which contains all the Q's which have the processor dependencies
like "Ix", i.e., QIO + Qll, so that OPIO= {[OB, 0, QlO + Ql1]}. Then, we build I P-I,o
which receives OPIO as a whole but separates it to two data vectors, i.e., QIO to IB and
Qll to RBlO, thus IP_I,o = {[IB,O,QlO],[RBlo,O,Qll]}.
. 164
supernode memory
aDC in OB and IB
~> 1:i!!n,M:hi(iU\g1t?vJn4.t;(;n+%;;.PA1':""·
aDV
2
(a) (b)
Figure 7.7: Indirect and direct transference of data.
Direct Data Flows
In the above discussions, we use OB and IB to buffer all the data flows in and out of the
supernode memory space. The advantage of buffering is to collect a segment-distributed
data block together so that it can be transferred as one Meiko data flow vector, i.e., the
DV in Figure 7.7.(b).
However, it also has an obvious disadvantage: copying data betw en the supernode
memory and the OB and IB takes time, so in some circumstances, we will try to make a
direct data flow, instead.
vVe find that if the length of all of the segments is long enough, say mor than 20 data
items, the time for Meiko to open a new data vector is less than th tim consum d to
copy data to and from the buffers. Therefore, if the length of every segment is mol' than
20, we establish data vectors addressing on each of the segments. Thus, thr DV's ar
created for the three data segments of Figure 7.7. (a). In the rest of this subs ction, w
deal with individual DC, instead of Q.
It is complicated to implement the direct data flow. Generally, for dir tly trans-
ferring a DC, DCd, of size ni':ollj, we have to generate nd = I1f:02Ii DV's, not d as
DVod, ... , DVn~_l' All of them share the first point of DCd as th ir formal buff r. Th
relative distance between the first point of their corresponding s gments and th first
point of DCd are defined as their "position" in the formal buffer. Obviously, w hav to
make two sets of such ov-»; one for output and the other for input.
In practice, we first generate the OP's and lP's by means of the methods outlined in
the subsection above, then create the DVod, .•. ,DVndd_I to substitute for a DV of the OP's
and lP's which references OB or IB and has a length iN-I' This method is summarized
, 165
processor 10 a = [1,0] T processor 11 a = [1,1] T
j"~.;'~~·[;~·~~~~t~~·~~·fI--- (r~. ~J·~·;~·~~~~t~~·~~··I·)
:. parallel codes !
.. .
parallel codes~ .
directional links
processorOO a = [O,O]T processorOl a = [O,l]T
ri~.;.~~···[;~~~~~t~~·~~·r)I---~ (i .;. ~~~~~~t~~·~~··I·)
: parallel codes: :. parallel codes :~.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Figure 7.8: Processor Array and Parallel Codes. The parallel codes stand also for a
process. All the parallel codes are identical with a as the parameter
as follows.
Step 1 Scan all the DC's, find the ones, say oc-, whose IN-l ~ 20;
Step 2 Remove Ded from OB;
Step 3 Create nd DV's for oc«.
Step 4 Scan all DV's of OP's and modify the DV which transfers txr from OB with the
nd DV'.;,. For instance, for DV = [OB, 0,X +Ded +Y], we have to change to 2+nd
sequential DV's, i.e., [OB, 0, X], DVod,. ", DVn~_l' [OB, n(X), Y], where X and Y
are arbitrary groups of DC's;
Step 5 Scan all DV's of OP and modify the DV's which do not transfer oc« but involve
oo- in their "position". For instance, DV = [OB, n(X +oc- + Y), Z] should be
changed to DV = [OB, n(X) + n(Y), Z].
Step 6 Do the same for IB and lP's.
In the parallel program, these DVd,s will be given physical addresses that reference su-
pernodes.
166
7.5 Outline of the Parallel Code
As mentioned before, we shall automatically produce one parallel code which will be
loaded to all the processors and run synchronously. The parallel code behaves like a
process. Every processor has a process. For a processor, the process running in it is also
identified by the a of the processor.
Automatically producing the parallel code does not mean that we have to produce
a completely new-program every time. Our idea is to have some program templates in
advance which construct a structure and contain some unchangeable segments of the pro-
gram. In the structure, operations which change for particular applications are formally
expressed by macro-defined functions. It is these macro-defined functions that have to
be produced automatically and be installed into a head file for each application. There
are many parameters which describe the particular problem and are used by both the
constant program segments and the macro-defined functions. These parameters are also
collected into the head file. Figure 7.8 shows the assignment of parallel codes onto a pro-
cessor array. The program template constructs the main structure of the parallel program.
With a as t1.t.e:paranlf·terwhich determines the supernodes involved in a particular pro-
cessor, it op. 'us .communication channels connecting to neighbouring processors, allocates
memory space for each of the supernodes involved and initialises them. It also computes
the supernode and input/output data from and to other processor via the channels. See
Figure 7.9 for its conceptual operation.
There are two kinds of program templates. One is ProcNonLSGP, which copes with
the situations of Lower-Dimensional array (Lower-D) and SUC using lP's and OP's. The
second, ProcLSGP, is only for SBC with LSGP partitioning and uses IPw.i's and OpwJ's.
Examples of the real parallel codes are presented in Appendix G, but are not very readable
for the uninitiated. We explain them as follows.
167
7.5.1 Parallel Codes for Non-LSGP
Codes 7.5.1 The ProcNonLSGP
Open Nlinks TRANSPORTs for communications } op 1 :
Create NlinkB channels to neighbouring TRANSPORTs Open Communications
Allocate and clear memory space for all data. }
Allocate IB's and OB's and RB's op 2 :
Preparations for accessing supernode space Preparation
Locate DV's in lP's and OP's
SupernodeSpaceLoops
{
IF Condition true
Locate the supernode in the memory space
InitializeData( a supernode)
Break
op 3:
Locate and Initialise
}
SupernodeSpaceLoops
{
op 4 :
FOR i :=b':TO NDC - 1 }
IF be ,·direct DC op 4·1
CreateDirectInputDV s(DC i) build direct DV's
CreateDirectOutputDVs(DC i)
FOR i :=0 TO Nlink. -1 } op 4.2
Receiving input data :Bowblocks according to lP's Input/output
Transferring output data :Bowblocks according to lP's
FOR i :=0 TO N,ink. - 1 } op 4.9
IF the TRANSPORT is input-valid Waiting until input finish
Wait until all input data received
FOR i :=0 TO N DC - 1 } op 4·4
IF be a indirect DC Data from
CopyFromBufferToSupernode(DC i) buffer to memory
tv Condition true } op 4·5
ComputeSupernode(a supernode) Compute supernode
FOR i :=0 TO Nlinlcs -1 } 4 6
IF the TRANSPORT is output-valid Wop·t: til t t fi . h
. • ai mg un 1 ou pu ms
Walt until all output data transferred
FOR i :=0 TO NDO - 1 } op 4·7
IF be a indirect DC Data from memory
CopyFromSupernodeToBuffer(DC i) to buffer
Break
}
End of Codes
• 168
................................................... ,
:,.Take the identity a of the processor. .:
·~:.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·.·r·.· · .
~Open communication (channels) to ';
:. neighbouring processors. i
_.....~~~~~~~~~~ t~~~~~~~~~~~ '.....
For each of supemodes in the processor a
{
allocate its memory location .
initialise the nodes in it .
............................... .
. .
For each of supemodes in the processor a
{
input and output data via buffers
waiting until the input finished.
copy data from buffer to memory.
compute nodes in the supemode .
wait until the output finished
copy data from memory to buffer.
Figure 7.9: The flow Chart of the template of a Parallel Codes.
where InitializeData, ComputeSupernode, CreateDirectInputDV s, CreateDi-
rectOutputDV s, CopyFromBufferToSupernode and CopyFromSupernodeToBuffer
are macro-defined functions; SupernodeSpaceLoops{ } are macro-defined FOR LOOP's
similar to the N-M-1 FOR-LOOP's of FOR-Loops 6.4.2, which confines the supernodes
mapped into this processor; Condition and Break are macro-defined expressions. Con-
dition stands for referencing a valid supernode. Nlinkl stands for the number of the links
around a processor. In fact, NUnkl is also the number of non-zero entries of P.
Brief explanations and comments are given as following.
1. In op 1, TRANSPORT is the port of a process for input and output data. See
Figure 7.10. N'inke TRANSPORTs should be established for a process, T RAN SPORTi."
TRANSPORT 0 TRANSPORT 1
\
\ /
- - r- r- r- f-
Process 0 - Process 1 I+- - process 2 I-
-+ r- r+ r--- - ..... f- ..... -
Figure 7.10: TRANSPORT and Communication channels in the case of I-D and SBC
• 169
where i = 0, ... ,M - 1 and j = 1 in SUC or j = ±1 in SBC. Inter-process communication
channels must be created between the TRANSPORT's inside and outside the process.
We restrict ourselves strictly to the cases where we have to do all necessary data relay
by ourselves. Therefore, the inter-process communication channels coincide exactly with
the physical links, since there is one and only one process in a processor in our cases. Thus,
the procedure of creating the inter-process channels is similar to that of inter-processor
links, i.e., Algorithm 7.2.1, but only for the single a, the badge of the process.
In addition, we should make a record of the available channels for each process, because
for the processors on edges of the array, to try to send data via a non-existent channel will
make the process wait endlessly for response. This is achieved by marking TRANSPORT
input and/or output valid.
Algorithm 7.5.1 Build Inter-processor Communication Channels
For all P; E P
For the Pi,; E P; and i: 0
a':= a + Pi
IF et! (~.·lM
Create a channel from TRANSPOR1i,pi,; of a to TRANSPORTi,Pi,J of a'
Mark TRANSPORTi,Pi,i output valid
a' := a - Pi
IF a' E AM
Mark T RANSPORTi,Pi,; input valid
2. Before doing any data transference, in op 2 we should give the DV's in lP's and
OP's physical addresses referencing to the allocated input and output buffers. Since data
transferences start and end in fixed buffers independent of supernodes, it can be done
independently.
3. In op 2, a memory space defined by eqn (7.3) is allocated for data of all supernodes
which may be involved in the process. For op 9, where SupernodeSpaceLoops{ }
with Condition together confine and access all the supernode assigned in this process
and locate ea.ch supernode to the address at defined by eqn (7.4) in this memory space.
InitializeData generates an initial data matrix by the process itself. If initial data are
obtained from outside it is not required .
• 170
4. The second SupernodeSpaceLoops{ }, i.e., op 4, is the real body of communi-
cations and computations.
Communication through the N'inks TRANSPORTs: in op 4.2 for each supernode, if
the TRANSPORT is input-valid, data is transferred inwards to IB and RB by using the
lP's; if the TRANSPORT is output-valid, data is transferred outwards from OB and RB
by using the OP's.
Internal Data Transfer (buffer =} supernode): if the tests in op 4.3 shows that the
reception is finished, then in op 4-4 the data which have arrived at their destination is
transferred from an IB to the memory location of the relevant supernode. More accurately,
at time t, ViE [0,NDC - 1], CopyFromBufferToSupernode fetch DCd: from the IB
and store it to an area indicated by at + ri, where ri is defined by eqn (7.1).
Computation: in op 4-5 if the test Condition shows that the supernode is within the
valid supernode polyhedron, the supernode is computed.
Internal Data Transfer (supernode =} buffer): if the test in op 4.6 shows that the
previous output transference has finished, the output data will be collected in op 4.7from
the memory location of the relevant supernode to an OB. More accurately, at time t,
ViE [0,NDC -1], CopyFromSupernodeToBuffer copies the DCi of the area indicated
by at-d~+ ri to the OB, where ri is defined by eqn (7.2) and cl! is the time-delay attribute
of DCd~ .•
5. The data movements between the memory locations of supernodes and the IB and
OB waste time. Obviously, we hope for data flow directly from or to supernodes. In some
cases, this is beneficial, but not always. For those DCs which may be directly transferred,
op 4.1 have to give the DVd,s in lP's and OP's physical addresses referencing supernodes.
Unlike other DV's which address only at the fixed OB or IB, these DVd,s have to be
located dynamically by CreateDirectInputDVs/CreateDirectOutputDVs inside op
4 because they are associated with individual supernodes.
7.5.2 Parallel Codes with LSGP
The algorithm of ProcLSGP is similar to ProcNonLSGP but requires some modifications.
For completeness, we give it as following:
. 171
Codes 7.5.2 The ProcLSGP
Open Nlinks TRANSPORTs for communications } op 1 :
Create N'inks channels to neighbouring TRANSPORTs Open Communications
Allocate and clear memory space for all data. }
Allocate OB's and IB's and RB's op 2 :
Preparation for accessing supernode space preparations
Locate in/out DV's in or= and IpWJ
Determine initial Wj for the processor } op 3
FOR tN-l := t' TO t"{
op 4
Derivet } op 4.1
Locate the supernode t in the memory space } op 4.2 :
InitializeData(supernode t) Locate and Initialise
}
FOR tN-l := t' TO i"{
op 5
} op 5.1
} op 5.2
} op 5.3
op 5.4 :
Data from memory
to buffer
}
op 5.5 :
Input/Output
Derivet
Select IpWj and 0 pWj
ComputeSupernode( supernode t)
FOR i :=0 TO NDC -1 }
IF the DC is indeed outputted according to OPwi
CopyFromSupernodeToBuffer(DC i)
FOR i :=0 TO NUnks -1
Receiving input data flow blocks according to I P"!
FOR i :=0 TO N'inks -1
Transferring output data flow blocks according to OP":
FOR i :=0 TO N'in/c. -1
IF the TRANSPORT is input-valid
Wait until all input data received
FOR i :=0 TO NUnlcs-1
IF the TRANSPORT is output-valid
Wait until all output data transferred
FOR i :=0 TO NDc - 1 } op 5.7:
IF the DC is indeed input according to I P"! Data from
CopyFromBufferToSupernode(DC i) buffer to memory
j := j + 1mod c
op 5.6:
Waiting input/output finish
}
End of Code
A brief explanation is given as follows:
'172
1. Recalling FOR-Loops 6.5.1, instead of SupernodeSpaceLoops, in op 4 and op
5 we need only one loop with respect to tN-l to access the supernodes assigned into the
processor.
2. Derive t, op 4.1 and op 5.1, is a function for deriving the time basis t by means of
Algorithm 6.5.1.
3. Now let us consider op 3, op 5.2 and op 5.5 which are related to input and output
packets. The c pairs of IpWj's and OpWj's derived in 7.4.1 are used repeatedly in op 5 of
Codes 7.5.2. When running the parallel program, for each process, op 3 determines the
initial Wj. After entering the loop of op 5, op 5.2 selects the pair of I P"! and Opw). In
op 5.5 every Receiving and Transferring must be carried out according to the pair of IpWJ
and ot-«. The data copying between Supernode memory space and in/out buffers, op
5·4 and op 5.7, must be carried out according to IBWj and OBWj. For the next supernode,
turn to IPW)+l and OPW,+l, and so on.
7.6 Summary.
As the result of the theoretical work, actual parallel codes in Meiko C are automatically
generated. The parallel codes consist of two pieces. One is a designed to include or omit
LSGP partitioning, and contains the main structure of the codes. The second part is
a head file which collects all parameters and some specific functions. In this way, we
may reduce the compilation work to minimum. To achieve this, some special practical
problems for the generation of parallel codes are resolved. See figure 7.11.
The methods of code generation for the various situations differ slightly. We cannot
say which is better or worse, since only the particular situation in an application decides
which one will be used. Obviously, one factor is the type of. processor-array we are
given, and the interconnection primitives. In addition, sometimes, even having a SBC
array, we may actually adopt SUC mesh. This is because SUC invokes about half the
communication interfaces as SBC.ln some cases, invoking communication interfaces is very
costly. Furthermore, we should choose the method where the computational polyhedron
is mapped to processor array with best fitness. In many cases, the last factor may playa
key role.
• 173
~ ..
Make Input/output Packet set, Ipw'and Opw' ~
according to LSGP dependency nAand ~
out-going data packet Des ~
Make Input/output Packet IP and OP
according to Processor dependencies rI
and out-going data packet DCs
.................................................................
........................
i Build processor l
: array and links ;;..._--+----.
.............................
i Make Lower-D code template
~from Lower-D algorithm structure
h file .
if with lBOP ·::·:::.L:::'.... otherwise
....._-----~: ~.~!~~.~~~.s....'..·1-------'
............................................
Make LSGP code template .
from LSGP algorithm structure ~ .... _ .
Figure 7.11: Chart of Parallel Code Generation
• 174
Chapter 8
Experimental Results and
Discussions
The theory and the generated parallel codes presented in previous chapters have been
tested by a great number of experiments. The experiments are designed to check the
correctness of the parallel program first, then the performances in various situations. The
result shows they work correctly and efficiently. Some experimental results are presented
in Section 8.1. Discussions are presented for the main factors which affect the performance.
Among them, we find the fitness of mapping an irregular problem onto a regular time-
space domain is the dominant one.
8.1 Experimental Results
In this section, we provide some experimental results to show the correctness and perfor-
mance of our methodologies.
8.1.1 Test of the Correctness
Example 6.1.1 is used to check the correctness of our work, that is, whether the auto-
matically generated parallel programs can give the exact same results as their sequential
counterpart. A sequential program is written to carry out the Loops 6.1.1. Four kinds of
parallel programs are automatically generated, one for each of the four cases given below,
there are a I-D array with 4 processors and sue mesh, I-D array with 4 processors and
SBC mesh, 2-D array with 4 x 4 processors and sue mesh, and a 2-D array with 4 x 4
. 175
processors and SBC mesh. Since Example 6.1.1 is a recursive process, array A(io, iI, i2)
should be given initial values. We suppose that A(io, it, i2) = io + it + i2 initially.
Because of the dependencies in Example 6.1.1 and since element A(20, 20,10) is the
last element to be modified, the correct value of A(20, 20,10) requires the correctness of
all other modified elements of A(io, ill i2). Therefore, it is sufficient to check only element
A(20, 20,10). We run the five programs, and all of them give the result A(20, 20,10) =
708639144.
Meanwhile, we find that any minimal faults in these parallel programs will result in a
totally different A(20, 20, 10). In addition, we also check the correctness in other ways for
all the parallel programs. For example, the total numbers of computed nodes are counted,
which is 4851 for Example 6.1.1. The correctness of data communication is also checked.
It has been checked for all nodes by an additional subroutine that when computing a node,
checks all the required data are already in their correct locations to be fetched. These
two checks guarantee further the correctness of computation. These facts show that all
the automatically generated parallel programs work correctly.
Furthermore, a great number of experiments carried out in the following subsection
are also used to establish the correctness of our method, although they are designed for
testing the performances. For each of them, we have a sequential program and a parallel
program both doing the same computation which produce a pair of results of the last
executing node. Comparing each of the pairs of results produced by the sequential and
parallel programs, we find that none of the pairs is different!
In theory it is occasionally possible that more than one error exactly compensate each
other so that the last answer is correct. However, in the large amount of our experiments,
we find no error occurs at all, therefore their correctness cannot be concluded as coinci-
dence. In fact in our case, because the computation is a linear accumulating procedure,
any functional fault in the program will make errors become bigger and bigger so that
errors hardly compensate each other, let alone exactly .
• 176
8.1.2 Measuring Performance
The performance of the mapping method is shown with experimental results. Before
giving the experimental results, we should describe the experimental set-up and the major
performance factors we are concerned with, Speed-up (S) and Efficiency (E).
The experimental results are given for three kinds of processor arrays, 1-D array with
sue mesh and variable number of processors, which is designed to show the performance
of linear arrays which are the most widely-used arrays and the effects of changing the size of
an array; 2-D array with 4 x 4 processors and sue mesh and 2-D array with 4 x 4 processors
and SBe mesh, which is for showing the performance of multi-dimensional array with or
without LSGP partitioning since the use of LSGP partitioning is an important aspect of
our method. We do not intentionally change the shape of the 2-D array since the 1-D
array is an extreme example of 2-D arrays and we like to make a full use of the number
of processors in the parallel computing resource. Also we do not use high dimensional
structures because the lower dimensional mappings are designed to alleviate problems of
constructing higher dimensional array structures.
Experimental situations
Similar to Example 6.1.1, we still use 3-D problems
FOR io := 0 TO a
FOR it := 0 TO b
FOR i~ := 0 TO c
Ai := A(i - do) + A(i - d1) + A(i - d2) + A(i - d3)
as a basic experimental model, where A(i) stands for A(io, ill i2). The dependency matrix
D = [do, d1, d2, d3] can be modified in our experiments. The basic computational poly-
hedra are cubes of size a x b x c. Let c = [a, b, c]T be an N-D constant vector describing
the size of the polyhedra.
The actual computational problems are constructed by skewing the polyhedron of
the basic one with a linear transformation K, where K is a unimodular lower-triangular
matrix. That is, at first, construct cube-shaped computational polyhedron, indexed by
·177
52
Figure 8.1: The basic cubic polyhedron and the transformed actual polyhedron
iC, then do transformation i= Ki" as follows
[ -I 1 iC < [ 0 1 ==> [_K-l 1 i< [ 0 11 - c - 1 K-l - C - 1
where the left and the right systems of inequalities are the cubic polyhedron and the
actual polyhedron, respectively. Because the transformation is bijective, the transformed
polyhedron contains also Nn = a X b x c nodes.
For example, let c = 30 x 40 x 50, and
[
1 0 0 0]
D= -1 1 1 1
-3 1 -1 0
This computational graph stands for a For-Loop
FOR io := 0 TO 29
FOR il := -2io TO -2io + 39
FOR i2 := -io + il TO -io + il +49
A(io,it,i2):= A(io -1,il + l,i2 + 3) + A(io,il -1,i2 -1)
+A(io, il - 1, i2 + 1) + A(io, il - 1, i2)
The cubic polyhedron and the transformed actual polyhedron are illustrated in Figure
8.1. The statement of computation in the loops has also the form of In summary, the
experimental situations can be specified by c, K and D.
In the following experiments, we design a few situations to test. The experiments are
divided into three sets, one is for lower-dimensional array (linear array here), the second
.178
is for (N-l)-D array with sue mesh and the third for (N-l)-D array with sac mesh (2-D
here). In each case, we test the effects of changes in the skew of the polyhedron, the
dependencies, and the sizes of the computational polyhedron, (in individual dimensions
or all dimensions together). For lower-D cases we also test changes in the number of
processors. These tests are not exhaustive, but we claim that they give us a general idea
of how our methodology behaves and they inspire later discussions to exploit factors which
may affect performance.
Speedup and Efficiency
The speed-up is usually defined as S = ~, where T. and Tp stand for the times consumed
by the sequential program and parallel program, respectively. However, in many cases,
we cannot run an equivalent sequential program in one processor to measure T., since the
processor does not provide enough memory space for large computational tasks. Alter-
natively, we define Til = tn x Nn, where tn is the time to compute a single node without
any overhead operations. The tn is obtained by measuring the time for computing a
single large full supernode (the time for running ComputeSupernodeO once) and then
dividing the time with the number of nodes in the supernode. ComputeSupernodeO
is a typical set of nested loops with the recursive mechanism of computing bounds whose
overheads is negligible when the supernode contains a large amount of nodes.
The efficiency is defined by E = Jp' where Np is the number of processors in the array.
Before evaluating the speedup, tn should be obtained. Note thattn varies with the
statements involved in iterations, independent of dependencies. In our experiments, tn
remains basically unchanged. We measure the time for computing a supernode containing
2000 nodes. Which takes 35776 J1.S. Thus tn = 17.8J1.8. Some other situations have also
been tested, results are almost the same and 17.8 is about the smallest value.
Only a few number of experimental results are presented here, while all the experi-
mental results are collected into Appendix F.
• 179
A linear array consisting of Np processors with sue mesh is used in the following
experiments
c 50 100 200 400 800
Ts(ms) 355.8 711.6 1423.2 2846.4 5692.8
Tp(ms) 204.3 372.6 715.4 1401.8 2714.5
Lower-D Case 1
Ts(ms) 142.3 480.3 1138.6 2223.8 3842.6 6102.0 9108.5
Tp(ms) 77.4 196.0 382.9 713.4 1138.7 1793.2 2553.6
Lower-D Case 4
4.0
3.5
.....
(J)
~3.0
::J
12.5
(J)
2.0
1.5
20 30 80
Equal Sizes a,b,c
.180 .
An 4 x 4 2-D array consisting of 16 processors with SUC mesh is used in the following
experiment
Case SUC-I: c = [ ~~] K = [~ ~ ~] D = [~; : 11 ~] Np = 16
c 50 100 200 400 600 800
Ts(ms) 1423.2 2846.4 5692.8 11385.6 17078.4 22771.2
Tp(ms) 724.5 1038.3 1502.8 2182.2 2860.2 3540.6
2-D SUC Array Case 1
~.---.-.-------.------'r---.--+l0
60 80 100 200 400 eoo 800
Sizec
An 4 x 4 2-D array consisting of 16 processors with SBC mesh is used in the following
experiment
30g
i
20
w
Case SBC-1: c, K and D are the same as Case SUC-l.
c 50 I 100 I 200 I 400
Ts(ms) 1423.2 2846.4 5692.8 11385.6
Tp(ms) 719.4 843.9 1574.8 2649.0
2-D sec Array Case 1
5
~~--~--~--------.----------rl0
40060 80 100 200
Size c
• 181
8.2 Discussions on Factors Affecting Efficiency
We have to note a fact that the parallel algorithms are not necessarily faster than their
sequential counterparts, due either to the nature of the computational problem or a poor
arrangement of the parallel computation, parallel programs may run slower. In our ex-
periments, the speedup and efficiencies vary widely, therefore it is quite worthwhile to
investigate the main factors which affect the performance of our method.
8.2.1 Space and Time Fitness of the Mapping
The fitnesses of the space and time mappings are the major factors that affect the efficiency
in many situations. The problem is discussed in detail as follows.
The poorly-filled supernodes
When the original computational polyhedron is clustered to a supernode, many supern-
odes around the boundary of the polyhedron may not be full. In some extreme cases,
a supernode may consist of only a few valid nodes. But even if there is one valid node
in a supernode, the supernode must be taken into account, and all operations must be
performed. Even though the supernode off the boundary are all full, the number of poorly
filled supernodes is significant, and the time wasted on them cannot be ignored.
Usually, the poorly-filled supemodes are difficult to avoid, for instance, the point e
in Figure 6.1. Even if the facets of the original polyhedron are parallel to those of the
partitioning parallelepiped, the sizes of the former are not integral multiplies of the latter,
thus poorly filled supernodes appear. If the facets are not parallel to each other, the
situation becomes worse.
Another factor is the size of the supernode. The larger the size, the poorer the fullness
(if the supemode is small enough to contain one node, every supernode is full). This
increases the difficulty of choosing a proper granularity. However, note that in our parallel
programs the invalid nodes of a supernode are removed from the computation by imposing
the boundary of the original polyhedron upon the boundary of supernodes (Subsection
6.2.3), but are still involved in communication for uniform communication mechanism.
182
As a result of our experiments, we find that the large granularity of supernodes does not
result in serious shortcomings because the communication accounts for less in the large
granularity situation. So we still prefer large granularity.
The fitness of the mapped supernode with the array
We have taken considerable care to scale the supernode domain such that it can be mapped
within the given array by S. Unfortunately, because the array is regular, a good match
of the mapped supernode with the regular array is not guaranteed in theory. Of course,
this problem exists for all the existing partitioning and mapping methods as well.
In the 1-D linear array, every processor is assigned work to do if the partitioning is
carried out properly. So usually, no processors are completely idle, but the loads on each
processor may be unbalanced. For the 2-D cases, some processors around the corners of
the array may be idle during the whole computing process. This is reflected clearly by the
experimental results of Case SUC's (Case SUC-1 to Case SUC-16 of pages 234 to page
241) and Case SBC's (Case SBC-1 to Case SBC-14 of pages 242 to page 248) which are
generally poorer.fhan .those of the I-D arrays. As the dimension of the array increases,
the situation deteriorates dramatically.
The main factors influencing the problem are the shape of the supernode domain,
the transformation B and the space mapping matrix S. The load balance is the result of
the combination of these factors. In our method, we consider the B and S separately,
according to different requirements. So our method is guaranteed to succeed for any URE
cases, but loses the freedom for space fitness.
However, it should be pointed that such freedom is usually quite limited because so
many constraints and requirements are imposed upon the choices of B and S, that they
do not always promise better fitness. Furthermore, despite any other considerations, we
may still prefer a simple S as the Ssuc and SSBC, because a complex S will result in a
more irregular mapping for reasonable polyhedra, which makes it more difficult to match
a regular array.
In our experiments, it can be seen that Ssuc is preferable to SSBC. We think this
. mainly comes from the unfitness of SSBC when mapping supemode domain onto pro-
.183
cessor array. In fact, because the space mapping matrix S, i.e., SSBC, has a form of
[~1 ~1 ~ 1 ' a 45° rotation happens when mapping the supernode domain onto the
processor array. If the supernode domain is somehow cube-shaped, the 45° rotation map-
ping results in a mismatch similar to a diamond within a square. However, we cannot
conclude that such a rotation is always a bad thing. If the supernode domain itself is
diamond-shaped, this rotation may result in a better match.
Time domain skew
There is a skew in the time domain. The nature of time skew comes from the data
dependencies. In an array, a processor cannot work until the data from other processors
is ready, so that, usually, all processors in an array cannot simultaneously start and finish.
Because this delay effect can be accumulated, the final delay can become very significant,
especially for large arrays, (e.g., long 1-D arrays). Sometimes this problem becomes the
dominant factor affecting the efficiency because the whole executing time is counted from
the start of the first processor to the finish of the last one.
For example, for the case of the Case-Lower-2 with 10 processors (see page 223), their
running times are listed as:
proC{) proc, prOC2 prOC3 prOC4
o '" 42 5 '" 47 10 '" 52 15 ",57 20 '" 62
proc« proCe, prOC7 procs prOCg
25 '" 67 30 '" 72 35 '" 77 40 '" 82 45", 87
The single delay is 5 for a single processor but is accumulated to 45 for the last processor,
so the upper-bound of efficiency is only 48.8% = 43/88. The actual efficiency is 42.9%.
For the case of the Case-Lower-5 with 10 processors, the running times are:
proC{) proc-; prOC2 prOC3 prOC4
o '" 79 1 '" 80 2,....,81 3,....,82 4,....,83
proc« proCe, prOC7 proCs proCg
5 '" 84 6,....,85 7,....,86 8 '" 87 9 '" 88
The single delay is 1, so the upper-bound of efficiencyis improved to 91%, while the actual
efficiency is 76.7%.
.184
The regularity of supernode domain
The shape of the supernode polyhedron is determined by applying the transformation B
or E upon the original polyhedron. The E derived from D is necessary to make the whole
of our methodology successful, but also has the side-effect of skewing the polyhedron.
The skewing may make the supernode polyhedron more regular in some situations, such
as Case lower-D-6, Case lower-D-7, .. " and Case SUC-5, but worse in other cases.
When determining the directions of the edges of the partitioning parallelepiped, we do
not take the requirement of reducing the data communication as the only consideration
. For instance, if there are N dependency vectors, they can be taken as the edges of
the partitioning parallelepiped so that the data dependencies in supernode space lay
only along coordinate axes without any relay. This is the best we can expect for data
communications, but we may skew the computational polyhedron too much. Therefore,
generally without detailed knowledge of the .supernode polyhedron, we prefer a E which
is as close to I as possible.
Comparing Case lower-Dol and Case lower-D-3, they share the same rectangular com-
putational polyhedron but with different dependencies. In Case lower-Dol, there are some
negative entries in its dependencies, thus E is not an identity matrix. Therefore the su-
pernode polyhedron is no longer rectangular, which gives a poor fitness of space and time
mapping. In contrast, in Case lower-D-3, there are no negative entries in its dependen-
cies, so E is an identity matrix and then its supernode polyhedron keeps rectangular and
results in a good fitness of space and time mapping. As a result, they show the significant
difference in their efficiencies.
IfD consists of only positive entries, we do have E = I. Therefore, for some benchmark
problems such as matrix product where D is non-negative and the original polyhedron is
regular, a good match, and so high efficiency,can be expected. This case is also shown by
Case lower-D-3, Case lower-D-4, "', and Case SUC-2, Case SUC-7, where the shape of
original polyhedron is regular. It should be noted that for the case of (N-l)-D array with
SBC mesh, a similar situation does not bring about a good result, see Case SBC-2 and
Case SBC-6, because the 450 rotation results in an poor match in the space mapping .
.185
8.2.2 Data Communication
If the data has to be distributed too many places, the communications will be a major
obstacle for actual applications. If the data dependency is uniform and relatively local,
the cost for data communication is acceptable. For the case of UREs, some facts should
be mentioned:
Communication separated from computation
Some parallel computing facilities, such as transputers, provide the ability of doing com-
putation and communication simultaneously so that the communication of large scale data
can be "hidden" behind computations. Unfortunately, this good feature can not be taken
advantage of in our case. Consider er = [... ,dL"'] = ID' which indicates the times to
be offered to implement the corresponding supernode dependency. If ~ = 1, this means
that the data produced at one time step is required just by the next time step. Thus the
communication. for the data must take place between the two computations, and so the
occurs "expl.i itly". If ~ = 2, the data produced at one time step is required by the third
time step, so we can arrange the data communication to take place simultaneously with
the second computation, effectively "hiding it" behind the second computation. In our
case, it is easy to see that majority of ~'s are "1". There are ~'s larger than 1, but the
amount of corresponding outgoing data are relatively small. For example, Let I = [1,1]
and D' = [~ ~ ~]. Thus, er = [tfo,e4,~,] = [1,1,2]. The outgoing data associated
with them are A2, AI, and A3 in Figure 6.2, respectively. A3 can flow simultaneously
with a computation, but it may be too small to be worth doing, since it is an intersection
of two areas.
Time for communication
The time for communication for a supernode consists of two parts. One is for the overhead
operations, such as setting and invoking port interfaces, which depends on how many ports
are involved and how many data vectors are to be transferred, (which is independent
of the size of the supernode). The second part is for the actual data transference, (
which is directly related to the size of the supernode). However, since the size of the
• 186
supernode increases, the amount of computations of a supernode increases faster than
that of communication thus, increasing the size of the supernode means a reduction in
the ratio of time of communication over that of computations.
It can be seen from the results of experiments that when the size of the computation
problem is large, the affect of the communication becomes insignificant. For example,
for the Case-Lower-4, the efficiency is up to 89.3%, and its upper-bound limited by time
skew is 94%. Of course, if the size of the computation problem is small, the cost of
communication becomes dominant. Also for the Case-Lower-4, the lowest efficiency is
46%, while its upper-bound limited by time skew is 87%.
8.2.3 Single Supernode Loops
From eqn (6.13) and Loops 6.4.3, we know that the single supernode loops have a form
FOR qg = max(/o, 0) TO min(uo, go - 1)
FOR q'N-l = max(lN_h 0) TO min(uN-h gN-l - 1)
where l, and Ui are the linear functions of qg,' .. ,q~~l' and gt's define the sizes of the
supernode in each dimension. In some situations, the time for calculating the bounds Ii'S
and u;'s becomes too significant to be ignored. In fact, Ii will be calculated n~;~gj times
and so will u., For instance, for the 2-D case, if go = 100 and gl = 1, we have to calculate
11 and '1.£1 100 times, that is, to calculate them once for each node. This results in a very
poor efficiency if the time for calculating the bounds is comparable to that of computing
a node. In contrast, if go = 1 and gl = 100, we need to calculate 11 and '1.£1 once for the
whole supernode, so the time for it will be insignificant.
The problem also exists for any general sequential nested loops and becomes serious for
small size computational problems. The supernode can be regarded as a. sequential-loop
nest of small size. So from this point-view, we prefer large granularity.
If there is some freedom for partitioning, such as in the cases of Lower-D and SUC, the
sizes of the inner loops of the supernode, especially gN-l, should be as large as possible.
However, the shapes of the original polyhedron also influence the choice of gi and freedom
is sometimes limited. For example, compare Case SUC-5 and Case SUC-6. They have the
same K and D, they should have the same fitness. But they show significant difference in
• 187
efficiency, because the c of Case SUC-5 is larger than that of Case SUC-6, so the former
has a larger 92. Reviewing the results of Case lower-D's, we find this strategy works well
since there is more freedom for partitioning. See also Case lower-D-19 and Case lower-D-
20, Case lower-D-21 and Case lower-D-22. With regards the SBC cases, because there is
no freedom for partitioning efficiency is affected significantly.
8.2.4 The Effect of Granularity
From the discussions above, we propose that the best strategy is to increase the size of
the supernode. This conflicts with the previous criterion of optimising the size of the
supernode where it is claimed that a small supernode can reduce the overall executing
time. Obviously the previous criterion does not take these practical factors into consider-
ation. How to take these factors into account remains a difficult problem. The difficulty
lies in the fact that the time for an overhead operation is unpredictable and changeable
for different computational problem and different computing facilities. The time ratio of
communications over computations is also an important issue and is also unpredictable,
because it depends on the type of data and the kind of the computation.
From our experiments, we find that the practical benefits from increasing the size of
the supernode can overwhelm the theoretical loss. In fact for the Lower-D cases, when
we test with the smallest supernode size, the parallel program is much slower than a
sequential one, because all overhead must be carried out for each small supernode. In
practice, we choose large granularity if there are freedoms left to define supernode sizes.
8.3 Summary
A large number of experiments have been carried out to test the performances of the
automatically generated parallel codes. The experimental results give us a general idea of
how the methodology performs. Speedups can be seen for most of the results. It should be
pointed out that the real computation in these examples are small because there is only a
single recurrence and the computation is only a number of simple additions. For systems
of recurrence array from high level synthetic procedures and for the cases where complex
computations are involved in each iteration, the computation of a node and supernode
• 188
hence may rise significantly. As a result, this will improve the ratio of the computing time
over the overhead time, giving a higher efficiency. However, we restrict ourself to quite
hash situations.
Generally speaking, from the three sets of experiments, the Lower-D array set is the
best in efficiency, and the (N-l)-D array with sue mesh set is reasonable, while the set
of (N-l)-D array with SBC mesh produced poor results.
These experimental results and the analysis may give us some clues about how to
select a processor array when given an application. If the size of computation is not too
large, a linear array is recommended, otherwise we have to consider 2-D arrays. Among
2-D arrays we may try those with SUC meshes first. The 2-D arrays with SBC meshes
should be taken into consideration if the computa.tion polyhedron is diamond-shaped.
189
Chapter 9
Conclusion'
In this final chapter, the whole design procedure we have developed is summarised. Fur-
thermore, the evaluations and comparisons are presented from both theoretical and prac-
tical views-points. This shows that our methodology makes a remarkable progress in this
area and gives a prospect of automatically parallelizations.
9.1 Summary Of The Whole Design Procedure
Except for some miscellaneous topics in Chapter 3 which improve on existing partitioning
and mapping techniques, we propose a complete methodology of automatic generation
of parallel programs for regular array designs, based on algorithms expressed as Uniform
Recurrence Equations. We start with partitioning and end by producing parallel codes.
Below we attempt to make a summary of, and evaluate, the work of each chapter.
In Chapter 4, A Positive Expression Basis E is derived from the original dependencies
D. This basis plays a key role in our method, by defining the direction of the partitioning
parallelepiped and guarantees that all data communication will flow along the edges of the
partitioning parallelepiped. Ifwe scale the sizes of the parallelepiped properly so that no.
data communication may penetrate the parallelepiped, the locality of communication is
ensured. More importantly, we obtain a unique data dependency relation among the par-
titions, i.e., the canonical dependencies, independent of particular problems. Therefore,
it becomes possible to develop a uniform methodology for the whole mapping procedure.
Based on the two common patterns of interconnection primitives, SUC and SBC, we
proposed the models of space-time mapping matrix T, consisting of S and t, such that
• 190
they guarantee the implementation of data communication of the canonical dependencies
by utilising the two patterns of interconnection primitives. The models of S must be
selected before deciding the sizes of partitioning parallelepiped to achieve a fixed-size
partitioning.
Then, we re-scale the E to B, which is the basis for quasi-supernode space and also de-
fine the partitioning parallelepiped, such that the quasi-supernode domain can be mapped
within a given array. The procedure of re-scaling for SUC is straightforward, but is much
more difficult for SBC, since SSBC has more than one non-zero entry in each row. For
SBC, a LSGP partitioning is applied to improve the efficiency. This does not affect the
procedure in nature at this stage.
Chapter 5 extends the idea of Chapter 4 to the case of lower-dimensional arrays.
However, the single timing function t is replaced by a map of the original domain onto
a K-D virtual time domain. After this transformation we derive a valid minimum vector
which re-projects the K-D virtual time domain along a 1-D domain on the condition that
no more than one node maps onto one time point and the executing sequence remains
as it should be and the 1-D domain is as compact as possible. Fortunately, we find that
the search time is independent of the size of the problem, so it promises a fundamental
advantage over other methods with respect to computational complexity.
In Chapter 6 and 7 we address the challenge of generating parallel algorithms and
codes. Since the supernode domain will be the elements of our parallel programs, the
features of the supernode domain are explored first, such as the boundary of supernode
domain and the boundary of a single supemode. The supernode domain is then mapped
to the processor-time domain which corresponds to the processor array and the executing
sequence in a processor. For the lower-dimensional case, an algorithm structure is invented
such that the K-D time loops can be executed as 1-D time sequence. For the LSGP case,
we also propose an algorithm structure which carries out the LSGP partitioning.
Based on the two algorithm structures, we present two parallel code templates for
actual code generation. Other changeable information, such as the bounds of loops, are
collected in a header file which also contains some minor loops structures for actual compu-
tations and communications. The data communication problem is given special attention.
-191
r-------------
Design Parameters:
.-.......................... ,'·····················M····
I :.Computational Graph) ~processor array A .: I--------~.~.~.~.~.~.~.~.~.~.~J~.~.~~=::':, ,
Partitioning and Mapping I
-----------,,
I Lower-D array
.................................... I I
:._ ::·~~.~~.~.~~~.~:~.~~~~~.~.~~~.~::r:::if K> l: :-,~~~!~i~..:
:!Two ~o~els of spac~projecti.wt Ssue & Sssd :·-p··..:·:··:····B···::
, :.and Timing vectors t & tsue if K= I : "..~~?~~~~.:: :
I .sac.............. : p'roiection P :
, I •... J .-------------------]------------------- ~----T------
_- 1 _,
:'vailci miiiimiim;
I
I
: (LSGP·~~~riili~·..·: (o~t~~~i~~·d~~..·l (Lc;~~;~D·~~~ritb~..·l
: :.structure, if LSGP ! :.packet Des .::. structure if K> 1 .: I
I •.•••.•••••••.••••••• . I
'-------~~~~~~~~~~~~~~~[~~~~~~~~~~~~~~~----~
: Code Generation :
I ._............... ._.................... I
: : LSGP code: ( ..... "': : Lower-D code 1 :
I : I :: h.file : : te I te : I: , t~~p.~~: : , ~p.~ : I'--------------}---------------'
......................
:...~~~~.~~~~..)
Algorithm Transformation
Figure 9.1: Chart of the Whole Procedure.
In Chapter 6, we figure out data flow blocks which confine the outgoing data and describe
their flow directions. In Chapter 7, this data flow is embodied in in/out packets which
describe the sources or destinations of the data flow and their sizes. Combining the tem-
plate and the header file, we produce parallel code which can run correctly on a parallel
computing platform built by existing environmental tools. Figure 9.1 is expected to give
a general idea of the whole methodology.
9.2 Evaluations and Comparisons
OUI work consists of both theory and practice. In the aspect of theory, we stand on the
shoulders of other previous researchers, so we may see farther and better. In the aspect
.192
hardware computational mapping onto implementing data
efficiency given regular array communication
LPGS more needed may be low guaranteed not guaranteed
LSGP no more high difficult to search not guaranteed
our method no more high guaranteed guaranteed
Table 9.1: Comparison of three methods in theory. Here, "low" is due to the invalid holes
in the processor-time domain; "not guaranteed" means that the method does not solve
the problem for all cases.
of practice, our work is relatively original, since most other researchers have not arrived
at this stage yet.
9.2.1 Theory
The main theoretical contributions are listed:
Mapping onto fixed-size Array
The canonical dependencies are the main highlight, and make it possible to generate
a general modelling of the various URE problems so that the major design process is
independent of individual applications. The basic models of space mapping matrix and
timing vector (or matrix) gives the foundation of the universal design process. Because
these models are established on the assumption of the simplest processor arrays, they
guarantee the basic feasibility of the methodology generally (in any reasonable situations).
Therefore, in this sense, it can be claimed that this methodology is universally applicable
and feasible.
The method itself belongs to the supernode partitioning classification. Previous su-
pernode partitioning suffers from the drawback that it is difficult to map onto a fixed-size
array. Our method copes with the problem properly. In Chapter 2, we have discussed the
major advantages and disadvantages of the most well-known partitioning mathods, such
as LPGS and LSGP. Table 9.1 shows the comparisons of the original LPGS and LSGP
methods with ours.
·193
Lower-dimensional mapping
The extension of this methodology to lower-dimensional arrays is another great challenge,
which is far from trivial. A complex procedure is presented to derive the valid mini-
mum projection vector. This method has significant advantages over the previous works.
Compared to [104], the efficiency is improved greatly. Compared to [94]and [34], instead
of searching our method is based on synthesis computations which are independent of
problem size, so it has a fundamental merit in design; secondly, our method can obtain
optimisation when the polyhedron is irregular; finally, by means of a non-singular map-
ping matrix, we have found a simple way to map from M-D space domain and 1-D time
domain to the original N-D computational domain which is the key to a usable practical
method.
9.2.2 Practice
The theoretical work is only half of our work. A more important and potentially more
difficult part is the implementation of the methodology to generate actual parallel codes.
The main contributions in the aspect can be listed briefly:
1 Derivation of boundaries of the computation in the processor-time domain. The
difficulty lies in determining the confining supemode polyhedron.
2 Construct the structure of parallel algorithms for the two cases. One is the lower-
dimensional array and the other is for LSGP partitioning.
3 Establish the mechanism of data communication from the data dependencies. For
strictly implementing our methodology, the data relay is also implemented by our-
selves
4 Suggest a feasible program structure for the parallel codes, which is easy to generate
automatically.
These results pave the road to real applications. So far we cannot find similar work
at the same level of detail in the literature.
• 194
From a practical view-point, we find that the supernode partitioning is necessary to
carry out computation and communication, even if it also brings about a major complexity
to the design procedure. In this respect, the LSGP method undergoes further setbacks.
Note that the parallel algorithm of pure LSGP method, Algorithm D, does not make a use
of supernode partitioning. Therefore, according to the analysis of Section 6.5, in order to
decide which node to compute we have to carry out Algorithm 6.5.1 once for each point in
the processor-time domain to find the corresponding node. This is too time-expensive! In
addition, it is also very time-consuming to invoke the communication mechanism once for
transferring data for a single node. From the view-point of data transference, the LPGS
method is better somehow, that is, the data produced in one time layer can be transferred
together as a packets. But its data flow packets cannot be large, since usually without
supernode partitioning the time layers are very "thin". In contrast our method, which
uses the supernode partitioning, is of significant advantage.
9.2.3 Results
As the result of the theoretical and practical work, a software tool has been developed
which accepts the description parameters for the computation problem and the given
regular processor array, and then produces parallel codes.
Many experimental results are provided to check and show the performance of the
methodology. We find they show significant speedup in many cases. The linear array
shows better efficiency, some as high as up to 90%. The 2-D array with sue mesh works
well, too. The 2-D array with SBe mesh behaves disappointedly in efficiency, though it
also shows some speed-up. After all, it is still an alternative choice for some applications.
Unfortunately there are no experimental results presented in the literature to allow us
to make direct comparisons with other synthesis method.
9.3 Closing Remarks
The achievements presented in this thesis can be regarded as a significant progress in
works of automatic generation of parallel codes and regular (systolic) array design. This
methodology is integrated and self-contained, and may be the only practical working
• 195
package in this area.
Obviously, the automatic parallel code generation shows incomparable advantages over
a manual one. The time for generating a parallel program can be reduced dramatically,
from weeks (even months) to a few minutes, and the correctness is guaranteed. To the
fundamental kind of computational problem Uniform Reccurence Equation, it gives a real
application prospect of parallel computing.
However it is far from true to claim that our work closes the research in this area.
Obviously, one problem is to exploit the possibility of improving efficiency further. In
addition, although our method is developed for the URE problem and for "software sys-
tolic array", it is possible to extend the methodology for more situations and applications
where non- URE problems can be transformed to URE or similar forms. As regards general
cases of non-URE, there is much work waiting for future researchers.
-196
Bibliography
[1] M.Ander and F.Berman "Assessing Partitioning, Scheduling, Storage Trade-offs for
Regular Interative Algorithms", J VLSI Integration, V.15, N.l, p25-50, 1993
[2] W. L. Athas and C. L. Seitz, " Multicomputers: Message-passing concurrent com-
puter", IEEE Comput. Mag., pp. 9-24, Aug. 1988
[3] V. Balasubramanian and P. Banerjee, "Compiler-assisted synthesis of algorithm-based
checking in multiprocessors" IEEE Trans. Computer Vo1.39,No.4 Apr.1990 pp. 436-
446.
•
[4] U. Banerjee,"Data dependence in ordinary programs" Tech. Rep. 76-837, Univ. of
Illinois Urbana-Champaign, Nov 1976
[5] U. Banerjee, "Dependence analysis for supercomputing" Boston, MA: Kluwer Aca-
demic,1988
[6] U. Banerjee, "A theory of loop permutations" in Proc. 2nd Workshop Languages
Compilers Parallel Computing, Aug. 1989
[7] U. Banerjee, "Unimodular transformations of double loops", in Proc. 3rd Workshop
Languages Compilers Parallel Computing, Aug, 1989
[8] F. Berman, "On mapping parallel algorithms into parallel architectures", J. 'Parallel
and Distributed computing, 4, 1987, pp439-458
[9] G.H. Bradley: Algorithm and bound for the greatest common divisor of n integers.
Communications of ACM, Vol.13, No.7, 1970, pp433-436
[10] M.O'Boyle and G.A.Hedayat, "Load Balance of Parallel affine Loops By Unimodular
Transformation", Department of Computer science, University of Manchester.
[11] J.C.Bu, E.F.Deprettere and P.Dewilde, "A Design Methodology for Fixed-Size Sys-
tolic Arrays", Pro., International Conference on Application Specific Array, IEEE
Society, 1990, pp591-602
[12] C. Callahan and K. Kennedy, "Compiling programs for distributed-memory multi-
processors", in Proc. 1988 Workshop on Programming Languages and Compilers for
Parallel Computing, Aug. 1988
• 197
[13] Sharat Chandran and Larry S. Davis "Parallel vison algorithms: an approach" Par-
allel Processing for Scientific Computing, SIAM, pp235-254
[14] V. Chaudhary and J. K. Aggarwal "A Generalized Scheme for Mapping Parallel
Algorithms", IEEE Trans. Parallel and Distributed System Vol.4, No.3, 1991 pp.328-
346
[15] Zen Chen and Chih-Chi Chang, "Iteration-level execution of DO loops with a reduced
set of dependence relations", J. Parallel and Distributed Computing, 4, 1987, pp488-
504
[16] M. Chen, "A design methodology for synthesizing parallel algorithms and architec-
ture", J. Parallel Distributed Comput., Dec. 1986. pp. 461-491
[17] D.S.Chand and S.S.Kapur, "An Algorithm for Convex Polytopes", JACM, Vol.17(1)
pp78-86, 1970
[18] Xian Chen and G.M.Megson, "A General Methodology of Partitioning and Mapping
for Given Regular Array", IEEE, Trans. Parallel and Distributed System Vo1.6,No.1O,
1991 pp.ll00-l107
[19] Xian Chen and G.M.Megson, "Optimal Mapping onto Lower-dimensional Arrays",
under the second review of IEEE, 1994
[20] Xian Chen and G.M.Megson, "Automatic Parallel Code Generation for Given Appay
(Part 1: Theory)", Technical Report No.502, University of Newcastle, Feb, 1995
[21] Xian Chen and G.M.Megson, "Automatic Parallel Code Generation for Given Appay
(Part 2: Practice and Results)", Technical Report No.506, University of Newcastle,
Feb, 1995
.[22] R. Cytron, "Compiler-time scheduling and optimization for asynchronous machines"
Ph.D dissertation, Uni. of Illinois at Urbana-Champaign. 1984
[23] A. Darte and J. M. Delosme, "Partitioning for array processors" Technical Report,
90-23. LIP, ENS-Lyon Lyon, France, Oct. 1990"
[24] A. Darte, "Regular Partitioning for Synthesizing Fixed-Size Systolic Array", J. of
VLSI Integration, 12, 1991, pp293-304
[25] J.C.Bu, E.F.Deprettere and P.Dewilde, "A Design Methodology for Fixed-Size Sys-
tolic Arrays", Pro., International Conference on Application Specific Array, IEEE
Society, 1990, pp591-602
[26] R.F.DeMara and D.I.Molovan "The SNAP-1 Parallel AI Prototype", IEEE Trans.
Parallel and Distributed System Vol.4, No.8, 1991 pp. 846-854 -
[27] M. L.Dowling, "Optimal code parallelization using unimodular transformations",
Parallel Computing, 16 (1990) ppI57-171
• 198
[28] T.Y.Feng "A summary of interconnection networks" IEEE Computer 14 (12) ppl2-
17, 1981
[29] M.J.Flynn "Some computer organization and their effectiveness" IEEE Trans. Com-
put. C-21, pp948-960, 1972
[30] J.A.B.Fortes, K.S.Fu and B.W.Wah, "Systematic Design Approaches for Algorith-
mically Specified Arrays", In Computer Architecture Concepts and Systems, eds Mi-
lutinovic, 1988, pp454-494
[31] J.A.B.Fortes and D. I. Moldovan, "Parallelism detection and transformation tech-
niques useful for VLSI algorithms", J. Parallel Distributed Comput., vol. 2, pp. 277-
301, 1985
[32] J. A. B. Forte "Algorithm transformations for parallel processing and VLSI architec-
ture design", Ph.D dissertation Univ. Southern California] CA, Dec. 1983
[33] P.Gachet, B.Joinnault and P.Quinton, "Synthesizing Systolic Arrays Using DIAS-
TOL", In Moor, McCabe, Uruguart, Int Workshop on Systolic Arrays, A dam-Hilger,
1986, pp25-36
[34] K.N.Ganapathy and B.W.Wah "Optimal Synthesis of Algorithm-Specific Lower-
Dimensional Processor Array", Technical Report, University of Lllinois at Urbana-
Champaign, CRHC-93-23
[35] D. C. Grunwald and D. A. Reed, " Networks for parallel processors: Measurement
and prognostication;;, in Proc. Third Conf. Hypercube Concurrence Comput. Appl.,
Jan. 1988 pp. 238-253
[36] D.A.P.Haie, "Multiprocessors: Discussions of Some Theoretical and Practical Prob-
lems", Ph.D. Dissertation, Univ. of Illinois at Urbana-Champain, Urbana, IL, 1979.
Rep. UIUCDCS-R-79-99.
[37] G.R.Heijer and E.F.Deprettere "From Algorithm to Parallel Implementation" in 1992
IEEE Workshop on VLSI Signal processing, Napa Valley, p~42-354
[38] B.J.Hellant and R.J.Krueger, "A parallel algorithm for rapid computation of tran-
sient fields", Parallel Processing for Scientific Computing, SIAM, pp270-275.
[39] R.W.Hockney and C.R.Jesshope, "Parallel computer 2"
[40] S.Horiik, S.Nishida and T.Sakaguchi "A design method of systolic array under the
constraint of the number of the processors", Int. Conf ASSP, 1987.
[41] H.M.Hsu, J.K.Peir and D.B.Haidvogel, "Performance of an ocean circulation model
on LCAP" Parallel Processing for Scientific Computing, SIAM, p285
[42] T.C.Hu, Combinatorial Algorithms, Ch3, Addison-Wesley, 1982
.199
[43] K.Hwang and Y.H.Chung, "Partitioning algorithms and VLSI structures for large-
scale matrix computations," in Proc. 5th Symp. Comput. Arithmetic, May 1981,
pp222-232
[44] O. H. Ibarra and S. M. Sohn," On Mapping systolic algorithms onto the Hypercube"
1988 IEEE Trans. Parallel and Distributed System Vol.1, No.1 Jan.1990 pp. 48-63
[45] INMOS, Transputer Reference Manual, Prence Hall, 1988.
[46] F.Irigoin and R.Triolet "Supernode Partitioning", Pro. of 15th Annual ACM
SIGACT-SIGPLAN Symposium on Principles of Programming Languages, 1988,
pp319-329
[47] F .Irigoin and R.Triolet "Dependence approximation and global parallel code gener-
ation for nested loops", in Parallel Distributed Algorithms, 1989
[48] K.Jainandunsing. "Parall algorithms for solving systems of linear equations and their
mapping on systolic arrays." PhD thesis, Delft University of Technology, Delft, Nether-
lands 1989
[49] Ravindran Kannan and Achim Bachem, "Polynomial algorithms for computing the
Smith and Hermite normal forms of an integer matrix", SIAM J. Comput., Vol.8,
No.4, 1979, pp499-507
[50] R.M.Karp, R.E.Miler and S. Winograd, "The organization of computations for uni-
fication recurrence equations", J. ACM, Vol.14, No.3, 1967, pp563-590
[51] M.T.O'Keefe and J.A.B.Fortes, "Bit Level Processor Array: Current Architectures
and a Design and Programming Tool", In Proc. 1988 Int. Symp. Circuit Syst., 1988,
pp2751-2755
[52] Chung-Ta King, Wen-Hwa Chou and Lionel M.Ni "Pipelined Data-Parallel Algo-
rithms. Part II - Design", IEEE, Trans. Parallel and Distributed System, VOL.1,
NO.4, 1990, pp486-499
[53] Kung, H.T. and C.E.Leiserson, " Systolic Arrays for VLSI" , Proceedings of the Sparse
Matrix Symposium (SIAM), 1978.
[54] S.Y. Kung. VLSI Array Processors. Prentice-Hall International Editions, 1988.
[55] S.Y. Kung. "Wavefront array processor: Language, architecture, and applications",
IEEE Trans. Comput., vol. C-31, pp1054-1066, 1982
[56) L.Lamport, "The Parallel Execution of DO Loops", Commun. ACM. 1974, pp83-93
[57] C. Lengauer, M. Barnett, and D. G. Hudson, "Towards systolizing compilation",
Distributed Computing, 5 (1991) pp. 7-24
[58] Pizong Lee and Z.M.Kedem "Synthesizing Linear Array Algorithms from nested For
Loop Algorithms", IEEE, Trans. Computer, VOL.c-37, NO.12, 1988, pp1578-1598
·200
[59] Pizong Lee and Z.M.Kedem "Mapping Nested Loop Algorithms into Multidimen-
sional Systolic Arrays", IEEE, Trans. Parallel and Distributed System, VOL.l, NO.1,
1990, pp64-76
[60] Wei-ming Lin and V.K.Prasanna Kumar, "A note on the linear Transformation
method for systolic array design", IEEE Trans. Computer, Vo1.39,No.3, 1990, pp393-
399
[61] G.Li and B. W. Wah, "The design of optimal systolic arrays", IEEE Trans. Computer,
pp. 66-77, Jan 1985
[62] S. Manoharan and P. Thanisch, "Assigning dependency graphs onto processor net-
works", Parallel Computing, 17 (1991) pp. 63-73 63-73
[63] MEiKO Limited, "C Reference Manual for the Computing Surface"
[64] MEiKO Limited, "CS Tools Reference Manual"
[65] G.M.Megson and E.O.Eyoh, "Implementation and Evaluation of Parallel N-D Convex
Hull Algorithms", To appear on PARCO'93, Grenoble, France
[66] G.M.Megson and Xian Chen, "A Survey and Analysis of Partitioning and Mapping
for Regular Arrays", submitted for publication, also Technical Report No.415, Uni-
versity of Newcastle, 1993
(67] G.M.Megson and Xian Chen, "Partitioning and Mapping for Lower Dimensional
Given Regular Array" , in 1993 IEEE Euromicro Workshop on Parallel and Distributed
Processing, pp149-155.
[68] G.M.Megson and Xian Chen, "Systematic Synthesis of Knapsack Problems onto
Fixed Sized Arrays with Lower Dimensions" , submitted for publication, also Technical
Report No.486, University of Newcastle, 1994
[69] G.M.Megson and Xian Chen, "A synthesis method of LSGP partitioning for given-
shape regular arrays", Proc. 9th lnt Parallel Processing Symp., IEEE Computer So-
ciety Press, 1995, pp234-238
[70] Megson G.M., "Automating Systolic Algorithm Design I : (basic synthesis tech-
niques)", Technical Report Series, No.364, University of Newcastle uopn Tyne, 1991.
[71] G.M.Megson "The Derivation of Uniform Recurrence Equations for the Knapsack
Problem", J. of Parallel Algorithms and Applications, Vol1, 1993, pp127-140.
[72] G.M.Megson "Mapping a Class of Run-Time Dependencies onto Regular Arrays",
Proc. 7th Int Parallel Processing Symp., IEEE Computer Society Press, 1993, pp97-
104.·
[73] D.Moldovan and A.B.Fortes, "Partitioning and Ma.pping Algorithms into Fixed Size
Systolic Arrays", IEEE, Trans. Computer VOL.c-35, NO.1, 1986, ppl-12
-201
[74] D. Moldovan: ADVIS: a software package for the design of systolic arrays. IEEE
Trans. Computer Vo1.36,No.1 Jan.1987 pp. 33-40
[75] L. Mordell, Diaphantine equations. New York, Academic. 1969
[76] H.Nelis and E.Deprettere" Automatic Design and Partitioning of Systolic/wavefront
Arrays for VLSI". Circuits, System, Signal Processing 7, 1988
[77] L. M. Ni and C. T. King, "On partitioning and mapping for hypercube computing",
Int. J. Parallel Programming, Vol, 17, no. 6, pp. 475-495, Dec. 1988
[78] D.A.Padua, "Multiprocessors: Discussion of Theoretical and Practical Problems",
Ph.D., Dissertation., Rep.UIUCDCS-R-79-990, Univ. of Illinois at Urbana-Cham-
paign. Urbana, 1979.
[79] A.Pashapour, G.A.Pope, K.Sepehrnoori and G.Shiles, "Application of vectoriza-
tion and microtasking for reservoir simulation" Parallel Supercomputing: Meth-
ods,Algorithms and Applications, pp267-281, 1989
[80] D. A. Padua. and M. J. Wolfe, " Advanced compiler optimizations for supercomput-
ers", Commun. ACM, pp. 1184-1201, Dec. 1986
[81] J.K.Peir and R.Cytron, "Minimum Distance: A Method for Partitioning Recurrences
for Multiprocessors", IEEE, Trans. Computer, VOL.38, NO.8, 1989, pp1203-1211
[82] C. D. Polychronopoulos, D. J. Kuck and D. A. Padua, "Utilizing multidimensional
loop parallelism on large-scale parallel processor system", IEEE 7hms. Computer pp.
1285-1296, Sept. 1989
[83] Quinton P. and Van Dongen V., "The Mapping of Linear Recurrence Equations on
Regular Arrays". J VLSI Signal Processing, 1, pp95-113, 1989
[84] Quinton P., " Automatic Synthesis of Systolic Arrays from Uniform Recurrent Equa-
tion". Pro. 11th Symp on Computer Architecture, IEEE Computer Society Press.
pp208-214, 1984
[85] S. V. Rajopadhye, "Regular iterative algorithms and their implementations on pro-
cessor arra.y", Ph.D dissertation. Stanford Univ., Stanford, CA, Oct. 1985
[86] S. V. Rajopadhye, "Sizing systolic array with control signals from recurrence equa-
tions", Distributed Computing, 3 (1989) pp. 88-105
[87] S.V.Rajopadhhye and R.M.Fujimoto, "Synthesizing Systolic Arrays from Recurrence
Equations", Parallel Computing, Vol.14, 1990, pp163-189
[88] S. V. Rajopadhye, "Automating the design of systolic arrays", Integrating, The VLSI
Journal 9, 1990, pp225-242
[89] L. Rapanotti, "On the Synthesis of Integral and Dynamic Recurrences" Newcastle
Unversity, Nov 1995 (submitted)
[90] A. Schrijver, Theory of Linear and Integer Programming, New York: Wiley, 1986
[91] Jang-Ping Shu and Chih-Yung Chang, "Synthesizing nested loop algorithms us-
ing nonlinear transformation method." IEEE Trans. Parallel and Distributed System
Vo1.2,No.3 Jul.1991 pp. 304-317
[92] Weijia Shang and Jose A. B. Fortes: Time optimal linear schedules for algorithms
with uniform dependencies. IEEE Trans. Computer, Vo1.40,No.6 Jun.1991 pp. 723-742
[93] W.Shang and J.A.B.Fortes, "Independent Partitioning of Algorithms with Uniform
Dependencies", IEEE, Trans. Computer, VOL.41, NO.2, 1992, pp190-206
[94] Weijia Shang and Jose A. B. Fortes, "On Time Mapping of Uniform Dependence
Algorithms into Lower Dimensional Processor Arrays", IEEE, Trans. Parallel and
Distributed System, VOL.3, NO.3, 1992, pp350-363
[95] J.E.Shore "Second thoughts on parallel processing" Comput.Elect.Eng. 1, pp95-109,
1973
[96] J.P.Sheu and T.H.Tai, "Partitioning and Mapping Nested Loops on Multiprocessor
Systems", IEEE, Trans. Parallel and Distributed System, VOL.2, NO.4, 1991, pp430-
439
[97] M.A.Stoker, "The Exploitation of Parallelism on Shared Memory Multiprocessors" ,
Ph.D Thesis, University of Newcastle upon Tyne, 1990.
[98] G. Swart "Finding the Convex Hull Facet by Facet", J.Algorithms, Vo1.6,pp17-48,
1985.
[99] J. Teich and L. Thiele "Partitioning of Processor Arrays: a Piecewise Regular Ap-
proach" Integration, VLSI Journal14, 1993, pp297-332
[100] Ping-Sheng Tseng, "A systolic array parallelizing compiler", J. Parallel and Dis-
tributed computing, 9, 1990, pp116-127
[101] T.H.Tzen and L.M.Ni "Dependence Uniformization: A Loop Parallelization Tech-
nique", IEEE Trans. Parallel and Distributed System Vol.4, No.5, 1991 pp.547-558
[102] S.S.Vincentelli, "Parallel Processing for simulation of VLSI circuits", Parallel Pro-
cessing and Application pp3-19 1988
[103] M. Wolfe and U. Banerjee," Ddta dependence and its application to parallel pro-
cessing", Int. J. Parallel Programming, Vol. 16, no. 2, pp 137-178, 1987
[104] Y.Wong and J.M.Delosme, "Optimal Systolic Implementation of N-dimensional Re-
currences", IEEE Proc. ICCD, 1985 pp.618-621
[105] M.E.Wolf, "Improving parallelism and data locality in nested loops", Ph.D disser-
tation, Stanford Univ. 1991
-203
[106] M.E.Wolf and M.S.Lam, "A data locality optimization algorithm", in Proc. ACM
SIGPLAN '91 Conf. Programming Language Design Implementation, Jun. 1991, pp.
30-44
[107] M.E.Wolf and M.S.Lam, "A Loop Transformation Theory and an Algorithm to Max-
imize Parallelism", IEEE, Trans. Parallel and Distributed System, VOL.2, NO.4, 1991,
pp452-471
[108] M.Wolfe "More Iteration Space Tiling", Proc., Supercomputer' 89
[109] M.Wolfe "Optimizing Supercompilers for Supercomputers", Cambridge, MA: MIT
Press. 1989.
[110] Zhenhui Yang, Weijia Shang and J.A.B.Fortes, "Conflict-free Scheduling of Nested
Loop Algorithms on Lower Dimensional Array Processor Arrays" , Proc. of Int. Parallel
Processing Symp., 1992, pp156-164
[111] Xiaoxiong Zhong and S.Rajopadhye, "Quasi-Linear Allocation Function for Efficient
Array Design", J. of VLSI Signal Processing, 4, 1992, pp97-110
-204
Appendix A
Scaling SBC
If wi and w~are known for all i, this would be a simple problem to find the solutions for
the system of equations (4.43). But because the mapping depends on the scaled space
mapping SF, i.e., wi and w~depend on fi and fi+b we fall into a mutually recursive
cycle. Eqn (4.34) is no longer true, because there are two variables in an equation.
The scaling of SBC can be divided into two stages:
(1). To break the cycle, begin with an initial point of fit;. At the initial point, it is
possible to determine wi and wL \Ii , 0 ~ i < M.
This is a complicated procedure. wi and w~ depend not only on fit; but also fi+i'
where j = 0, 1. And there is another mutually recursive cycle between wi and w~ and
fHi' Break this cycle by beginning with an initial point of fHi to establish a relation
between wi and w~ and fi+;.
By this way, a set of initial wi and w~are found corresponding to the initial /k.
(2). Derive functions for FG, with these initial wi and w~. Then, increase fit; from
the initial point. At some points, wi and w~are replaced by other Wi'S. Derive the next
piecewise functions for FG.
A.1 Initial w~'s and w~'sz z
As mentioned, when fit; is taken as an argument, let fie = 0 be the initial point. Imme-
diately the (k-l)-th and the k-th equations of eqn (4.42) and eqn (4.43) become single-
variable equations, that is
·205
(A.I)
Therefore, the method for sue can be adopted here to determine fk-l and fk+l, when fk
= 0, i.e.,
fk-l
maXo~j<nv (-Wk-l,j) - miIlo~j<nv (-Wk-l,j)
lk+! (A.2)
The following problem is to determine fk+2, fk+3," . and ik-2' fk-3,' ... Unfortunately,
that is not an easy job. fk+2 should be determined by uk+! = - fk+l wk+l,Q + fk+2Wk+l,1
and Ui+l = - fk+l Wi+l,O+ fk+2Wi+l,1' but Wk+1 and Wi+l depend on fk+2. So we need to
initialise fk+2 (from fk+2 = 0), and see what is happening as fk+2 increases.
Suppose A+2 = fL+2' fL+2 is an initial f1c+2. Then it is easy to determine wk+1 and
I 1Wk+!
Wk+! = {Wj : sf+!Wj = max (Sf+!WI),O ~ j < nu}
O~l<nv
wi+! = {Wj : sf+1wj = min (sf+1WI),O ~ j < nu}
O~l<n"
(A.3)
However. as fk+2 increases, the Uk+!,j projected by some vertices may increase faster
than that of the wk+!' or decrease faster than that of the W~+l' so they should become
the new uk+! or ui+!. Therefore, it is necessary to know the turning point f~+2 where
such a "overstep" takes place. Vj, 0 ~ j < nu 2, we ha.ve
or
and similarly
(AA)
"I'here is the possibility that more than one vertex satisfies if+lWj = maxoSl<n.(if+lW/). Among
them, the vertex which diverges fastest as fH2 increases will be the w:+l' i.e., whose (k+2)-th element
should be the largest. The same argument holds for selecting w~+l' but that vertex should have the
smallest (k+2)-th element.
2In principle, any vertices should be assumed to have the possibility to overstep the w:+1 or W~+l'
However, in practice, we can judge some of them having least possibility or having no possibility at all.
For example, the vertices which have been overstepped can be removed from the list for the following
operations.
-206
where ft:+2,j (or fk+2,j) is the point where Wj oversteps the present wr+l (or wi+d. Then
the turning point is
(A.S)
that is, fk+2 is a fk+2 where the first overstep takes place.
Now within the period of [J~+2' fk+2]' W~+l and wi+! are known and the fk+2 which
produces the accurate scaling is evaluated by means of eqn (4.43)
fa _ lk+1 - (wi+1,o - wk+1,o)fk+1
Jk+2 - u I
wk+l,l - wk+l,l
(A.6)
If (A.7)
The procedure is terminated by finding f:+2' If not, that means by this pair of w~+1
and wi+1' it fails to determine the accurate scaling, then, letting n; := fk+2' repeat
the procedure of eqn (A.3)3, (A.4), (A.5), (A.6) and (A.7) once more, until find f:+2 or
have no vertices in the list for eqn (A.4). Note that we obtain not only f:+2' but also its
corresponding pair of wk+1 and wi+1'
Repeating the algorithm for determining all the left fr ..,we find an initial SF and
an initial set of vertices wi and w~which are projected by the initial SF to the outmost
vertices of U, called initial vertex set. And it is the initial vertex set that we are actually
interested in. Letting k = 1, ... ,M - 1, we can have M-1 such initial SF'S and their
accompanying initial vertex set.
A.2 SF as a Function of fk
Having the initial vertex set associated with an initial IL, it is possible to determine the
other entries of SF for the accurate scaling.
3If 1:+2 > Jt+2' need not search a new w.+1, and vice versa
"Only a little modification on the procedure is needed 80 as to be applicable for determining
IIc-2, fle-3, ....
·207
A.2.1 Determining fa with a Known Set of wi and w~
However, it can be observed that in SF, fi does not have a direct relation with /k, but an
indirect one, when i > k + 1. So there should be a recursive operations to express /i+l as
a function of A, At first, suppose ff = adk + bi. If ff is known, /i+l is formulated by
eqn (A.6). Then
(A.S)
where
u IW·o - W·oa = I, I,
U I'wi,1 - wi,l
Iib= -----;-
W~l - WLI (A.9)
A.2.2 Determining the Turning Points
As mentioned before, when fie increases from zero, SF changes, too, as a result of which
some vertices of U move outwards faster, so they may become the outmost vertices in a
certain dimension. Once this occurs, the w'J or w! corresponding to the present outmost
vertices uj or u~ which ever is overstepped should be replaced by those corresponding
to the new outmost vertices. Therefore f8, ... , fM are piecewise functions with respect
to /k, and u is also the lower boundary of a piece. We need to find the turning point
fl where an overstep takes place, which also is the upper boundary of the piece, as fie
increases. This is similar to the last section in concept.
In the i-th dimension, Wj oversteps the present ur or u~at
fL,j =
biW~O - bi+!Wr,l - biWi,j + bi+! Wi+!,j
aiw~O - ai+! W~l - aiWi,j + ai+! Wi+!,j
biW~,o - bi+l wb - biWiJ + bi+! Wi+!,j
aiwto - ai+lwb - ajWi,j + ai+lwi+!,j (A.I0)
fi:,i,j -
V i and j5, 0 :5 i < M and 0 :5 j < nil' So there are 2(M)(nll - 1) such possible turning
points. Some of them are invalid if they are smaller than fL. Thus, the turning point f~
should be
. I i I
f~= O~i<~~j<n" {fi:,i,j n (fie < fi:,i,j)' fle,i,j n (A < flc,iJ)} (A.ll)
5Exclude j = {I: W, = wf UwLo ~ I < nil}
-208
Let ft := fl· If ft = f;:,i,j (or ft = fL,j)' it means that the present wi (or wD may
be replaced by Wj, and the corresponding n'-tl should be re-evaluated byeqn (A.S) and
(A.9). Note that if the above situation takes place in the i-th dimension, the a, and hi
should be modified by eqn (A.S), VI, I > i. In this way, a new piece of the functions fr
with respect to fk is created. Repeat this procedure until 3tt < 0, Vi, 0 ~ i < n. The
interval of from 0 to the last turning point fl is the valid interval of !k as the argument.
A.2.3 Selecting wi and w~from Multi-Candidates
A complicated case should be mentioned with regard to select wi (or wD. In the i-th
dimension, at fk = ft, when there is more than one Wi (include the present wi (or w~))
such that BjWj = ui (or BjWj = uD, they will move with different speeds as fie increases
further, and only one which moves outwards fastest should be taken as wi (or wD.
Assume that we know the expression of Ii = ai/Ie + bi. Suppose that there are two
vertices WiI and W j such that ui = BiW i1 = SiW i and one vertex w~. We should select
wi from the Wn. and Wj for evaluating Ii. If we let wf = Wil, the difference between ur
and the images of Wj is
- -(Wj,i1 - wi,j)li + (Wj+l,il - Wj+lJ)
x (Wi,il - wto Ii + Ii I)
Wi+l,jl - wb Wj+1Jl - Wi,l
- adfi + bd = aiadfle + biad + bd (A.12)
where
(W'+l . - W'+l .)l-1,311",
Wi+! ,il - Wtl
(A.13)
It is assumed that when fie = fL, Ui,i1 - Ui,i = O. As tie increases, i.e., /Ie = n + 6./Ie,
6./k > 0, it is obtained that ui - Ui,i = UiJl - Ui,; = adaj6./le.
Therefore, ajad is the criterion for judging whether choosing wi1 as wf is correct: if
aiad ;:::0, then ui ;:::Ui,i, the selection is correct; otherwise, it is incorrect. Let us rewrite
ajad as ajad( wi, wL wi), because it is a function of them. If there are a number of such
-209
vertices, they form a set wu. If the case takes place with the wL i.e., there is more than
one Wj, which form a set W', such that u~= SjWj, the selection procedure is the same
but the criterion should be ajad(wi, wL Wj) ~ 0, because we need u~ ~ Ui,j' We have a
procedure to select a pair of wi and w~from WU and WI.
( 1) take a pair of vertices, one from WU as wr and the other from Wi as w!.
(2) check ajad(wi, w!, Wj) ~ 0, 'Vwh Wj E ~VU, if any are false, choose another vertex
from wu as ur, and redo (2).
(3) check ajad(wi, w!, Wj) ~ 0, 'VWj, Wj E Wi, if any are false, choose another vertex
from Wi as u!, and redo (2).
(4) replace the old wi and w~with the last pair from (2) and (3).
A procedure for deriving ff, i< k, as a piecewise function of fie is similar to the above
with a little modification.
A.3 Delimit Jk According to Dependencies
We know I~"'" J~ are the lengths of the clustering parallelepiped. As commanded
above, the parallelepiped should be large enough to enclose all of the dependencies. To
do so, we should have, according to eqn (4.6)
'Vi, 0 ~ i < n
Note that because ff is a piecewise function of fie, say np pieces, f&,i should be eval-
uated in this way: for each pair of ai and bi of all pieces, calculate ft,i; the ft,i is valid
if it is within the /Jc's interval in which this pair of ai and b, is defined, and is invalid if
outside. So, more generally, ft,i is rewritten as
1- b. ·a~Q%,fC _ I" I
J k,j,j - moza· ·a·I" I
(A.14)
where 0 $ j < np, ai,i and bi,] is the ai and bi of the j-th piece, respectively. fie is delimited
by all the valid fk,;J' Obviously, ft,;,; is related to the upper bound of fie if ai" > 0, and
.210
to the lower bound if ai,j < O. Therefore 6
min {rc .. n(a··>O)}
09<n, O~j<np J k,I,) I,)
max {rC n (a· . < O)}
O~i<n, O~j<np J k,i,j I,)
(A.I5)
If fi:b < fib, no !k is valid. This means that these dependency vectors are too long to
find a parallelepiped enclosing all of them and keeping the property of mapping the poly-
hedron just onto the given array. In this case, if ft < fib is made in the i-th dimension,
we have to use arax as the length of the i-th edge of the clustering parallelepiped, with a
cost of wasting the processor resource.
SLet invalid f~,iJ be null
.211
Appendix B
Collection of Algorithms for
(N-l)-D Partitioning and Mapping
It will be helpful to make a summary by representing the method with a series of algo-
rithms.
B.I Pre-compilation Work
Algorithm B.1.1 Computing the timing vector sets for Strategy 4.4.1
FORN =2, ...
create N! permutation matrices PMn,i, forming a set of PMn'
FOR i= 0, n!-l
FOR j = i+l, n!-l
delete PMn,; if PMn,; = JPMn,i'
create N-D SSBC and tSBC'
FOR i = 0, ~!-1
produce a permuted version Si by SSBCPMn,i, PMn,i E PMn'
WHILE( all possible t not done){
- M Md - pT'T 2change t such that L;=o t; = g an t; ~ tS!C Mil;. g = ,3.
calculate the activity matrix A with Si and t.
evaluate M! HNFs of A.
put the t into a set Tn,i if any of the HNFs has equal diagonal entries. }
FOR i = ° n!_l, 2
PMn,~+i = JPMn,i'
T 1lI.+' = Tni•n'2 I ,
End of Algorithm
• 212
B.2 Compiling Work
Given dependency D, interconnection primitives P, polyhedron vertices Y and processor
array size 10 x ... X 1M-I,
Algorithm B.2.l Main Algorithm: Produce B, T and k
derive E from D by means of Theorem 4.2.1.
yield E by normalising each columns of E.
perform Ap := E-1D.
FOR i = 0, M
evaluate arax := maXo<;<m ai,;, ai,; E Ap.
W:= E-ly. -
IF P belong to Psuc, CALL Algorithm B.2.2.
IF P belong to PSBC, CALL Algorithm B.2.3.
B.:= EF-l
CALL Algorithm B.2.5
End of Algorithm
Algorithm B.2.2 Real Parallelepiped and S-T Transformation for sue
L := diag( +00,10, ... , IM-d
FORi = 0, M
u-l • E Wwi := ma:xo~;<n" Wi,; - mlIloS;<nv Wi,;, Wi,; .
w := [ ~-z, ... , ~_z1T.
Wo WM
FOR i= 0, n!-1
evaluate Fa with eqn (4.40)
FORj = 0, M
Ii := min(fj, CI~ ..S ), fj E F1·,
Ff := diag[f8,···, fM]'
yq:= FfW .
.tF := maxo~j<n" tsucvJ - miIlo~j<n. tsuov1, v1 E yq.
tm1n._· team.- mlno~i<n!-I i .
k := {i : tim = tmin}.
Fa '- Fa.- le'
S := SSUOPMA:.
t:= tsuc.
End of Algorithm
Algorithm B.2.3 Real Parallelepiped and S-T Transformation for SBC
WO:=W
FORi = 0, M
lie := glle
FOR i = 0, n!-1 (or ~! -1 if 10= ... = 1M-I)
W:=PMiwo
.213
FOR k = 1, M-I
CALL Algorithm B.2.4 with fk as argument.
fk'ax = {Ik : maximise ~o f;}
max _ nM f.a IPk _ q=O q hc=/ros
kmax = {k : Pkax = maxo~q<M p;,ax}
P~ax _ pmaxI _ kmos
Fi = diag(fo, ... ,fM) Ih,=Jr.:,fz
Fa '- P-1F Pi'- Mi i Mi·
Vq:= FiWo.
evaluate t? by eqn (4.46)
determine 17 by eqn (4.47)
. tOPt~an ._ _;j_
I .- praz
iOP = {i . t~in = mino tmin}
• I ~p<n!-l P
Fa._ Fa.- iOp.
- LOPt = tiop
S := SSBCPMiop.
End of Algorithm
Algorithm B.2.4 Deriving Fa as a Function 01 a Single Argument Ik.
fk:= 0
FOR i =- :.:,M-I
f/+1 :=~O·
WHILE(f1+1 not foundj]
determine wi and w~byeqn(A.3)
evaluate turning point If+! with eqn (A.4) and (A.S).
evaluate 11+1 byeqn (A.6)
break WHILE loop if It+! :5 11+1< If+!·
//+1 := If+!otherwise.}
FOR i= k, 1, -1
similar way to derive If-I' as well as Wi_l and w:_l·
np := O. (np is the number of the segments of the piecewise function)
Ii '- 0Jk .-
WHILE(all 18, ... , 1M positive){
ak,np = 1
bk,np = 0
FORi = k, M-I
evaluate ai+l,np and bi+1,np with eqn (A.B) and eqn ((A.9)
FOR i = k, 1, -1
similar way to evaluate ai-1,np and bi-l,n,.
FOR i = 0, M-1
FOR j = 0, ntl
evaluate/:,i,; and IL,; with eqn (A.10).
evaluate n with eqn (A.ll)
f~ := ft· (f~ is the upper boundary of the segment)
.214
f~: := fL, (f~:is the lower boundary of the segment)
break WHILE loop if any ai,npfL + bi,np < 0,
FOR i = 0, M-I
w" '- W· l'f fU - fti '-) k,i,j - le:
WI '- W· if f' - ft. i ,-) k,i,j - k :
fl .- ftk ,- k:
np:=np+l.}
FOR i = 0, M
FOR j = 0, np - 1
fc _ I-b,.jajQZk,i,j - a' 'a~QZ ''tJ ,
fkc .. := null if (f~b > Ikc .. ) U ( Ikc .. > f~),,,) ) J I ,,,) J I ,,,) s
determine fUb and fib with eqn (A,I5)
End of Algorithm
Algorithm B.2.5 Integralization B.
FORi = 0, M
B-i := B" deleted the i-th column,
solve B:jh_i = 0 for h-i.
FOR i = 0, M
Hi = [h-o, ... , h-(i-l), h-(i+l)," . ,h-(M), b.], b, E B•.
FORi = 0, M
FORj = 0, M
Hj,_j := Hi deleted the j-th column.
solve Hr._jCi,j = 0 for Ci,j.
Ci := [Ci,O,' •• , Cj,M]
modify the direction of Cit; such that a := C;lj is non-positive, j is in the cone.
find integral vectors jj around the top of b., j = 0,... ,2M.
FORj = 0, 2M
a := C;l (j; - b.)
put jj into a set J, if a is non-negative.
re-arrange the elements of J, according to their distances to b.).
WHILE(valid B not found){
B := Uo, ... ,jj, ... ,jM], ji E Jj and selected differently every times.
U:= SB-1V
break WHILE loop ifU is acceptable.}
End of Algorithm
215
Appendix C
Collection of Algorithms for
Lower-Dimensional Mapping
Algorithm C.l Feature of Polyhedron
FOR i=O, nf - 1 (nf is the number of the facets of the pK-r)
Find all vertices associated with the i-th face, forming a set ~
fUP nv, -1 ( • h di ali f V. V.)Jr,' := maxj=O Vr,i nVj IS t e car In ity 0 i, Vr,i E Vj E •
b' = max(b' fU,!)r r' Jr~
flow . nV·-1
Jr,i := mlni=O Vr,i
hi = min(b1 flO!»)r r' Jr~
Collect all pair of vertices from all Vi's, creating a set of possible edges, E
Remove the terms which appears only once in E, and merge the terms which are the
same. (e.g., {(O,O),(l,l),(l,l)} ==> {(1,1)})
FOR i=O, ne - 1 (ne is the cardinality of E)
e (1 2) ( 1 1 2 2 1 d 2 • th . th d . E)maxi,r := max vr,i' Vr,i vr,i E Vi' Vr,i E Vi' Vi an Vr,i IS e 1- e ge In
. e._ . (1 2)mzn',r .- mzn Vr,i'vr,i
End of Algorithm
Algorithm C.2 Intersect Polyhedron (from -px-r to pK-r-l with hr)
FOR i=O, n! - 1
if f:c; ~ h; ~ f:'f, put the i-th face to the set of the faces of pK-r-l, but modifying
the constant term o« := aK + arhr.
FOR i=O, ne - 1 (ne is the number of the edges of the pK-r)
if min~,r ~ h; ~ max~.r, evaluate a vertex byeqn (5.5) to the vertex set of pK-r-l.
CALL Algorithm C.1, finding edges and some feature of pK-r-l
End of Algorithm
Algorithm C.S Determine f(K-r)
FOR i = r, K-1
fi(K-r) := fbfl
• 216
DO {
CALL Algorithm C.2 to intersect the pK-r with i, = fi(K-r), producing pK-i-l
}WHILE( If for pK-i-l rh! 1 < b~ break otherwise f~K-r) '= f~K-r) + 1), , ,+1 - ,+1" Ja',
End of Algorithm
Algorithm C.4 Determine l(K-r)
Similar to Algorithm C.3
Algorithm C.S Build SPi+1 from SPi
FOR j = 0, ns», - 1 (nsPi is the cardinality of SPi
pK-i := pf-i E SPi
FOR r = 0, nu - 1 (nu is the number of the vertices of the pK-i)
h := Vi (Vi is the first element of v r of the vertices of pK -i)
if the h does not appear before, CALL Algorithm C.2 to intersect pK -i with
h, producing a p~K -i-I), put into SPi+l
End of Algorithm
Algorithm C.6 Determine Pi
FOR j = 0, ns», - 1 (nsp, is the cardinality of SPi
pK-i := p/.<-i E SPi
FOR r = 0, nu - 1 (nu is the number of the vertices of the pK-i)
h := Vi (Vi is the first element of vr of the vertices of pK -i)
FOR s=-I, 1
CALL Algorithm C.2 to intersect -px-! with h + s, producing p~K-i-l).
. CALL Algorithm C.3 to determine r,(K -i-I)
CALL Algorithm C.4 to determine IfK-i-I)
d+ := p(I~K-j-l) _ f1K-i-1») + 1
d- := p(l~~-i-I) _ fJK -i-I») + 1
Pi := max(pi' d+, d-);
End of Algorithm
Algorithm C.7 Main Algorithm: Find a Minimum Valid Projecting Vector
Pt:= IIW
Compute the Convex Hull, C1iK, of Ve (by means of an algorithm [98] and a
program [65];
p := [0, ... ,0,1]
SPo := pK := C1iK
FOR i = 0, K-2
CALL Algorithm C.5 to build SPi+1 from SPj.
FOR i = K-2, 0, i := i-I
CALL Algorithm C.6 to derive Pi from SPi.
End of Algorithm
217
Appendix D
Parallel Algorithm for Pure LSGP
Method
From the viewpoint of algorithm generation, the pure LSGP can be taken as the special
case of B = I in the LSGP case of Chapter 6. Omitting derivation, the parallel algorithm
obtained is
DOALL )0 := -10 TO 20 STEP 8
DOALL .il := -10 TO 20 STEP 8
w :=.:' ..
FOR ,~_;.:=;: 0 'ID-1230 .
to := rTl
tl := rWl;tol
w:= w + [ ~1
w/:= 8t-w
io := -2to + 3tl
il := 5to+ -5tl - t2
i2 := -3to + -5tl - 4t2
IF (0 :5 io :5 20) n(O :5 io :5 20) n(O :5 io :5 10)
A(io, iI, i2) := A(io - 1, il + 1, i2 + 3) + A(io, il - 1, i2 - 1) +
+ A(io,il -1,i2 + 1) + A(io,il -1,i2)
Data transfer operation
Again, it is checked that the parallel algorithm covers all the 4851 nodes. The Data
transfer operation can also be taken as a special case of that of (N-1)-D SBC case discussed
in Chapter 7, remembering that a supernode is just one node.
218
Appendix E
Generating Data Flows and Relays
for LSGP
A OP (or JP) may contain a few of data vectors, "VO", which has a number of information
fields: .n is the number of nodes of the VO; .fb and .fbp indicate which buffer the VO
comes from and the start position in that buffer respectively; .tb and .tbp indicate the
which buffer the VO goes to and the start position in that buffer, respectively; .dep
indicates the processor dependency.
Algorithm E.1 Relationw '-t(t:>.a)
w":= k6a .
FOR t := 0 TO c - 1
FOR i :=0, N-2
W· '- W" ~i-l h t
1'- i - L...Ji=O i,i i
i, := rTl
w" := w" - r
w':= kt-w
mark w' with t
(comment: hi,i and r are defined in Section 6.5)
End of Algorithm
Algorithm E.2 CreateIP;'i' (OP;i)
j' := j + t:>.wpmod c
p/:= -p
• 219
nv:= 0
FOR i:= 0 TO Op;j.nv-1
IP;?V(nv).dep:= OP;J.V(i).dep
IP;;'j'.V(nv).n:= OP;j.V(i).n
IF only one non-zero entry in Op;;j .V{i).dep
I P;/. V(nv ).tb := I BWi'
I P;/. V(nv ).tbp := I BWj'.n
IBWj'.n:= IBWj'.n + OP;i.V(i).n
ELSE
IP;'j'.V(nv).tb:= RBp
I p;'j' .V(nv ).tbp:= 0
(comment: if there is only one non-zero entry in Op;j. V(i).dep, the data vector has
arrived at its terminal, so it is put into OB; otherwise, into a relay buffer)
End of Algorithm
Algorithm E.3 The data flows and relays for LSGP
Relationw'-t(O)
Mark every w' with Wt,
FORi:= 0 TO M-I
Relationw'-t(li) comment!
Taking the t from any w'marked as Wj, let ~Wi := t - j mod c
FOR j := 0 TO c - 1
Take the corresponding w' of the Wj
For all dP E DP
dA := dP ® l(w')
IF dA E DA
p :=dA
Set "0" to all entries of p except the first non-zero entry
comment''
nv := Op;i .nv
Op;;j. V(nv )dep := dA
comment '
220
OP;' .V(nv).fb:= OB
Op;j. V(nv ).fbp := OP;i. V(nv - l).fbp - OP;i. V(nv - l).n
or: dPPp' .V(nv).n := Q
Op;i .nv := Op;i .nv + 1
For all existing Op;i 's
Createl P;l' (0p;n
FOR i := 1 TO M - 2
For all existing IP;l'. VO's
IF IP;l' .VO.tb is any RBp"
p:= IP;/'.VO.dep
Null all entries of p except the i-th one
IF p # 0
w := W' + 1 mod c
n\· :-= OP;) .nv
OP;;. V(n,,).dep := Ip;';I.VO.dep
OP;;.V(nv).fb:= REp"
Op;i .V(nv).fbp:= 0
Op;i .V(nv).n := IP;/' .VO.n
OP;; .nv := nv + 1
CreateI P;/' (0Pp)
Re-order OP's and lP's according to the ordinal number w
(comment.': I, is a vector such that only the i-th element is "1", and others are "0".
~Wi E ~w)
(comment= see Subsection 6.5.3 for the operator ® and function 1(· .. ).)
(comments: field .nv indicates the number of data vector of a packet. This segment makes
out the initial OP's.)
End of Algorithm
• 221
Appendix F
The Collection of Experimental
Results
I-D array with sue mesh
In the following experiments, a linear array consisting of Np processors is used.
c= p~1 K= [~ o 0] [1 0 0 ~ 1Case lower-D-1: 1 0 D = -1 1 1 Np = 4o 1 -3 1 -1
c 50 100 200 400 800
T.(ms) 355.8 711.6 1423.2 2846.4 5692.8
Tp(ms) 204.3 372.6 715.4 1401.8 2714.5
Lower-D Case 1
2.2- 80
60 80 100 200
Sizec
400 eoo 800
• 222
Case lower-D-2: c = [40,40,200f, K and D are the same as Case lower-D-1
4 5 6 7 8 10
T8(ms) 5692.8 5692.8 5692.8 5692.8 5692.8 5692.8
Tp(ms) 2696.7 2266.6 2006.3 1781.4 1570.2 1326.1
Lower-D Case 2
5.0
4.5
~ 4.0
a.
::J13.5
{Jj 3.0
2.5
2.0
4
55
SO!.
!!!.
~
45 ~
W
40
5 6 7 8 9 10
Np
c 50 100 200 400 800
T8(ms) 355.8 711.6 1423.2 2846.4 5692.8
Tp(ms) 140.8 248.3 461.8 897.6 1778.7
• 223
Case lower-D-4: c = [a, b, elT, K and D are the same as Case lower-D-3 Np = 4
I a = b = e I 20 I 30 I 40 50 60 70 80
Ts(ms) 142.3 480.3 1138.6 2223.8 3842.6 6102.0 9108.5
Tp(ms) 77.4 196.0 382.9 713.4 1138.7 1793.2 2553.6
Lower-D Case 4
20 30 40 50 60
Equal Sizes a,b,c
70 60
Case lower-D-5: c = [80,80, 80]T, K and D are the same as Case lower-D-3
4 5 6 7 8 9 10
Ts(ms) 9108.5 9108.5 9108.5 9108.5 9108.5 9108.5 9108.5
Tp(ms) 2553.6 2097.2 1872.2 1644.0 1414.5 1300.9 1188.6
Lower-D Case 5
e
7
Ci)
-C;:e
::l15
Cl)
4
3
4
90
85?Jf.
!!E
601
75 w
70
5 6 7 e 9 10
Np
.224
4 5 6 7 8 10
Ts(ms) 5692.8 5692.8 5692.8 5692.8 5692.8 5692.8
Tp(ms) 1596.9 1323.6 1188.6 1053.2 917.3 785.5
Lower-D Case 6
4 5 8 7
Np
8 10
Case lower-D-7: c = [40,40,c]T, K and D are the same as Case lower-D-6, Np = 4
c I 50 I 100 I 200 I 400 I 800 I
T.(msJ 355.8 711.6 1423.2 2846.4 5692.8
Tp(ms) 127.9 230.1 438.1 859.7 1692.5
Lower-D Case 7
~
65
I
80
80 80 100 200 400 800 800
SIZ8C
.225
Case lower-D-8: c = [a, b, elT, K and D are the same as Case lower-D-6, Np = 4
I a = b = e I 20 I 30 1 40 50
Ts(ms) 142.3 480.3 1138.6 2223.8
Tp(ms) 66.9 180.0 369.5 696.0
Lower-D Case 8
§: 3.0
Co~
I
(J) 2.5
70 ~
@
3.5 80
15 20 25 30 35 40
Equal Sizes a,b,c
45 50 55
Case lower-D-9: c = [ :~ 1 K = [~ ~ ~ 1 D = [!1 ~ ~ o~1
200 -1 -1 1 -3 1 -1
4 5 6 7 8 10
Ts(ms) 5692.8 5692.8 5692.8 5692.8 5692.8 5692.8
Tp(ms) 2683.5 2250.6 1989.4 1763.4 1550.6 1305.3
Lower-D Case 9
5.0
4.5
§: 4.0
Co:s13.5
(J) 3.0
2.5
2.0
4
55
50!
!:!:!.r
'0
45 !Ew
40
5 8 7 8 9 10
Np
226
4 5 6 7 8 10
Ts(ms) 569.3 56.9 5692.8 5692.8 5692.8 5692.8
Tp(ms) 2715.1 2284.2 2025.5 1800.9 1588.6 1345.7
Lower-D Case 10
5.0
7
Np
55
so!
!!!.
f
45 ;a
W
40
9 104 5 6 8
Case lower-Doll: c = [20,20,cf, K and D are the same as Case lower-Dvlfl, Np = 10
c I 50 I 100 I 200 I 400 I 800 I
Ts(ms) 355.8 711.6 1423.2 2846.4 5692.8
Tp(ms) 205.4 374.0 718.8 1409.2 2782.7
Lower-D Case 11
80
if.
W--g
I
~
45 W
1. I
40
80 80 100 200 400 800 800
Size c
9 227 .
2.2
2.0
@:
g. 1.8
I
C/J 1.6
1.4
I a = b = c I 20 30 40 50
2.0
50
45 ae.
@
40 i
~
35 W
30
&5
Ts(ms) 142.3 480.3 1138.6 2223.8
Tp(ms) 102.2 287.6 608.6 1142.8
Lower-D Case 12
R
B
15 20 25 30 35 40 45 !SO
Equal Sizes a,b,c
I a = b = c I 20 30 40 60
Ts(ms) 142.3 480.3 1138.6 2223.8 3842.6
Tp(ms) 65.1 177.0 356.2 674.9 1085.9
4.0
3.5
2.5
50
Lower-D case 13
R
B
15 20 25 30 35 40 45 !SO &5 eo 85
Equal Sizes a,b,c
228
IlO
70 f
i
eoW
Case lower-D-14: c = [ 2~ 1 K = [~ ~ ~ 1 D = [!1 ~ ~ ~11 NI' = 4
20 0 0 1 -1 0 1
b 50 100 200 400 800
Ts(ms) 355.8 711.6 1423.2 28748.6 57497.3
Tp(ms) 181.5 315.6 583.5 1119.6 2191. 7
Lower-D Case 14
~~~~------~------~--~--+~
400 800 80060 60 100 200
Sizeb
b 50 100 200 400 600
T,,(ms) 355.8 711.6 1423.2 2846.4 4269.6
Tp(ms) 172.4 309.5 592.8 1162.2 1729.1
Lower-D Case 15
3.0 70
65
6OfI.
~
65 ~
Q)
~
50~
W
45I
40
60 80 100 200 400 800 800
Sizeb
.229
Case lower-D-16: c = [ ~O 1 K = [~ ~ ~ 1 D = [~ ~ ~ 001 s, = 4
20 -2 0 1 -3 1 -2
b 50 100 200 400 800
Ts(ms) 355.8 711.6 1423.2 2846.4 5692.8
Tp(ms) 164.5 294.4 564.9 1107.4 2191.0
Lower-D Case 16
en
~ 2.8
:::l
'i
~ 2.4
R
B
2.0-'---r----.-....___---~--___,.--....____+_50
800 80060 80 100 200
Sizeb
400
Case lower-D-17: c = [a, b, elT, K and D are the same as Case lower-D-16 Np = 4
I a = b = e I 20 I 30 I 40 50 60 70
Ts(ms) 142.3 480.3 1138.6 2223.8 3842.6 6102.0
Tp(ms) 96.3 227.6 427.8 773.2 1215.7 1893.6
Lower-D Case 17
3.5
80
3.0 70 .,..
§: @:
Q,
80f:::l 2.5I '0
Cl) 2.0
~
5O!E
w
40
1.5
I
30
1S 20 2S 30 3S 40 45 so ss 80 ss 70 75
Equal Sizes a,b,c
.230
Case lower-D-18: c = [a,b,c]T, K = [~2 ~ ~ 1 D = [~ ~ !3 o~1 Np = 4
-2 0 1 -3 1 2
I a = b = c I 20 30 40 50 60 70
T8(ms) 142.3 480.3 1138.6 2223.8 3842.6 6102.0
~(ms) 117.2 262.6 483.5 858.2 1331.2 2051.8
Lower-D Case 18
15 20 25 30 35 40 45 50 55 60 65 70 75
Equal Sizes a,b,c
Case lower-D-19: c = [20,20, elT, K and D are the same as Case lower-D-18, Np = 4
c I 50 I 100 I 200 I 400 I 800 I
T.{ms) 355.8 711.6 1423.2 2846.4 5692.8
Tp{ms) 196.9 333.1 612.4 1186.7 2341.0
Lower-D Case 19
~~--.--r------,-------.----r--r~
600 80060 80 100 200
Sizec
400
·231
Case lower-D-20: c = [20,b,20f, K and D are the same as Case lower-D-18, Np = 4
b I 50 I 100 I 200 I 400 I 800 I
Ts(ms) 355.8 711.6 1423.2 2846.4 5692.8
Tp(ms) 193.0 349.0 663.5 1293.2 2552.7
Lower-D Case 20
65
2.4
§: 2.2 "#-
@
Co
:::l .75 g"0 2.0
CD iCDCo 18
~
en 1. SOW
1.
E
45
60 80 100 200 400 800 800
Sizeb
Case 1·)~,,-er~D-21:c = [40, b, 40]T, K and D are the same as Case lower-D-18, NI' = 4
b 50 I 100 I 200 I
T,(ms) 1423.2 2846.4 5692.8
Tp(ms) 581.4 979.3 1855.9
Lower-D Case 21
80
75 fIl.
R.
B
70 ~
~
SSW
~~--~--,----------.----------r80
40080 80 100 200
Sizeb
.232
Case lower-D-22: c = [40,40,cf, K and D are the same as Case lower-D-18, Np = 4
c 50 I 100 I 200 I 400
T,,(ms) 1423.2 2846.4 5692.8 11385.6
Tp(ms) 582.1 1023.9 1919.3 3751.8
Lower-D Case 22
3.0
M
B
~~---r--'----------.----------+~
400eo ~ 100 200
Size c
Case lower-D-23: c = [40, b, 40]T, K and D are the same as Case lower-D-18, Np = 8
b 50 I 100 I 200 I 400
T.(ms) 1423.2 2846.4 5692.8 11385.6
Tp(ms) 450.5 655.9 1172.5 2204.5
Lower-D Case 23
en"0: 4.5
::::I
-g
8. 4.0
Cl) M.
B
5.5
5.0
~---...-----.---r-----------.----------+3S
400eo 80 100 200
Sizeb
233
2-D array with sue mesh
A 4 x 4 2-D array with SUC mesh are used for the experiments. Thus Np = 16.
c= [; 1 [1 0 0 1 [1 0
0 nCase SUC-1: ~ = 0 1 0 D = -1 1 1 N; = 16001 -3 1 -1
c 50 100 200 400 600 800
Ts(ms) 1423.2 2846.4 5692.8 11385.6 17078.4 22771.2
Tp(ms) 724.5 1038.3 1502.8 2182.2 2860.2 3540.6
Case SUC-2: c = [ ; 1 K = [~ ~ ~ 1 D = [i ~~n N. = 16
c 200 400 600 800 1000
T.(ms) 5692.8 11385.6 17078.4 22771.2 28464.0
Tp(ms) 684.5 1196.3 1682.4 2176.1 2619.3
2-D SUC Array Case 2
12
4---~--~--~--~~r-~---'---+~
200 300 400 500 800 700 800 900 1000
Sizee
11
Co
~ 10
J
.234
Case SUC-3: K and D are the same as Case SUC-2 Np = 16
c 40 SO I 160 I 320 I 640 I 1280
Ts(ms) 113S.6 2277.1 4554.2 91OS.5 18217.0 36433.9
Tp(ms) 223.6 353.7 578.6 1001.9 1782.1 3274.4
2-D SUC Array Case 3
12
R
EJ
40 70 100 200 400
Sizec
700 1000
c 50 100 200 400 800
Ts(ms) 1423.2 2846.4 5692.8 11385.6 22771.2
Tp(ms) 748.9 1133.2 1683.6 2446.8 3815.4
2-D SUC Array Case 4
60
r===-r---r-...,....----r------,--,---;-10
400 eoo eoo60 80 100 200
Sizec
.235
Case SUC-5: c = [ !~1 K = [~l ~ ~ 1 D = [~l ~ ~ 0011 Np = 16
c 1 1 1 -3 1 -1
c 50 100 200 400 800
Ts(ms) 1423.2 2846.4 5692.8 11385.6 22771.2
Tp(ms) 437.8 672.1 987.9 1409.4 2244.8
2-D SUC Array Case 5
~ E
20
60 80 100 200 400 800 800
Sizec
Case SUC-6: c = [a, b, cf, K and D are the same as Case SUC-5 Np = 16
la=b=cI40 50 60 70 80
T,{ms) 1138.6 2223.8 3842.6 6102.0 9108.5
Tp(ms) 378.0 685.9 1095.2 1707.4 2457.4
2-D SUC Array Case 6
30
25 it.....
~
20f
~
15 ~
I
10
70 75 80 85
2.0+----,--,----,r---,--r---.--,--.--..--t-
35 40 45 50 55 80 85
Equal Sizes a,b,c
.236
CMe SUC-7: c = [ ~] K = [~ ~ ~] D = [i : : ~] s, = 16
I a = b = c I 40 10050 60 80 90
Ts(ms) 1138.6 2223.8 3842.6 6102.0 9108.5 12968.9 17790.0
Tp(ms) 214.3 351.4 500.5 740.0 979.9 1351.2 1677.2
~
§- 8
I
en 8
70
2-D SUC Array Case 7
10
40 60 70 80
Equal Sizes a,b,c
90
80
70
100
I a = b = c I 40 50 60 70 80
T.(ms) 1138.6 2223.8 3842.6 6102.0 9108.5
Tp(ms) 574.3 1041.1 1681.7 2579.5 3738.7
~ 2.2
Q,
:::J12.0
en 1.8
2-D SUC Array Case 8
2.4
R
B1.8
1---.--.,----.-....--.-....--.--,----,--+10
$ 40 ~ 60 M 60 M ~ ~ 80 U
Equal Sizes a,b,c
237
20
18
la=b=cI40 50 60 70 80
T,(ms) 1138.6 2223.8 3842.6 6102.0 9108.5
Tp(ms) 321.3 572.0 895.6 1378.5 1957.1
§: 4.5
go
Ien 4.0
2-D SUC Array Case 9
5.0 30
u ~
~ ~ ~ ~ M ~ & ro ~ ~ ~
Equal Sizes a,b,c
28
22
Case SUC-10: c = [ Yo] K = [~ !~]D = [~~ ~ ~ ~] Np = 16
b 100 200 400 800
T,(ms) 1423.2 2846.4 5692.8 11385.6 22771.2
Tp(ms) 512.3 739.6 1212.3 2330.2 3858.7
50
2-D SUC Array Case 10
7 so
....L.....-,--~_..-- __ _,.. -,-_~_+10
~ 800 80060 60 100 200
Sizeb
.238
b 50 100 200 400 800
T.(ms) 1423.2 2846.4 5692.8 11385.6 22771.2
Tp(ms) 485.3 827.7 1534.7 2948.7 5783.0
2-D SUC Array Case 11
1.0....l..--r---r--r------..-----r-----r-+ 10
400 800 80060 80 100 200
Sizeb
b 50 100 200 400 800
T.(ms) 1423.2 2846.4 5692.8 11385.6 22771.2
Tp(ms) 394.4 638.6 1146.6 2161.5 4200.8
2-D SUC Array Case 12
6.0
5.5
~ 5.0
go
'0 4.5
!en 4.0
3.0-'----,-----,.-.----....,----,-----.--+20
80 80 100 200
Sizeb
400 800 800
.239
Case SUC-13: c = [a, b, elT, K and D are the same as Case SUC-12 Np = 16
I a = b = e I 40 50 60 80
Ts{ms) 1138.6 2223.8 3842.6 6102.0 9108.5
Tp{ms) 334.8 531.9 791.2 1177.2 1611.8
~ 5.0
Co~1! 4.5
/l; 4.0
70
2-D SUC Array Case 13
6.0 40
5.5
3.5
M ~
~ 40 ~ ~ ~ ~ M ro n ~ M
Equal Sizes a,b,e
35 ae.
30 ~
Q)
'0=25 W
I a = b = e I 40 50 70 80
Ts(ms) 1138.6 2223.8 3842.6 6102.0 9108.5
Tp{ms) 426.1 663.1 953.8 1402.6 1893.4
60
2-D SUC Array Case 14
5.0
4.5
R
B3.0
2.5 10
~ 40 ~ ~ ~ ~ ~ ro n ~ ~
Equal Sizes a,b,e
·240
30
25 tI-
20f
i2
15 W
Case SUC-15: c = [40,40, cf, K and D are the same as Case SUC-14 Np = 16
c 50 I 100 I 200 I 400 800
T,,(ms) 1423.2 2846.4 5692.8 11385.6 22771.2
Tp(ms) 437.1 548.7 795.9 1301.4 2336.6
2-D SUC Array Case 15
10
30
.... a
~
Co
::::I
"0i 6
Cl)
~~--.--r------,-------~---r--+~
600 80060 60 100 200
Sizec
400
Case SUC-16: c = [40,b,40]T, K and D are the same as Case SUC-15 Np = 16
b 50 I 100 I 200 I 400 800
T,,(ms) 1423.2 2846.4 5692.8 11385.6 22771.2
Tp(ms) 488.5 825.0 1518.5 2906.8 5525.3
2-D SUC Array Case 16
2.5...l--r-----,,---..,..--------r---------,----.--+10
600 60060 60 100 200
Sizeb
400
.241
2-D array with SBC mesh
A 4 x 4 2-D array with SBC mesh are used for the experiments. Thus Np = 16.
c 50 100 200 400
Ts(ms) 1423.2 2846.4 5692.8 11385.6
Tp(ms) 719.4 843.9 1574.8 2649.0
2-D SSC Array Case 1
60 80 100 200 400
Sizec
Case SBC-2: C = [ ~] K = [~ !nD = [l ~ ~ !] N. = 16
c 50 100 200 400
Ts(ms) 1423.2 2846.4 5692.8 11385.6
T,,(ms) 479.2 752.1 1379.8 2415.3
2-D SSC Array Case 2
40
60 80 100 200 400
Silec
242.
c 50 100 200 400
T,,(ms) 1423.2 2846.4 5692.8 11385.6
Tp(ms) 774.5 981.2 1190.1 2246.7
2-D sec Array Case 3
1..:.-C~ _ ___'r----.,.-------.,.------+-10
40060 60 100 200
Sizec
c 50 100 200 400
T.(ms) 1423.2 2846.4 5692.8 11385.6
Tp{ms) 827.1 1057.4 1497.4 2441.6
2-D sec Array Case 4
~~~_~~ ~ -+10
40060 80 100 200
Size c
243
Case SBC-5: c = [ ~ 1 K = [!1 ~ ~ 1 D = [!1 ~ ~ 001 Np = 16
ell 1 -3 1 -1
3.0
§: 2.5
Co
::l
Ien 2.0
la=b=cI40 50 60
20
15!.
~
fn
10!Ew
5
8S
1.5-t-----y---,---.---,--,.---+
35
T.(ms) 1138.6 2223.8 3842.6
Tp(ms) 680.3 1150.4 1848.1
2-D sec Array CaseS
t..-_--1....!.13.0
2.1
4.0
3.5
§:
Co
::l13.0
en
2.5
2.0
35
40 45 50 65
Equal Sizes a,b,c
80
la=b=cI40 50 60
T.(ms) 1138.6 2223.8 3842.6 6102.0
Tp(ms) 414.6 660.5 1000.5 1513.9
2-D sec Array Case 6
-t--.--,.---r--,.--.---,---,---r1S
~ ~ 50 55 80 e ro ~
Equal Sizes a,b,c
244
70
30
3.0
§: 2.5
go
I
Cl) 2.0
I a = b = c I 40 50 60 70
25
T,(ms) 1138.6 2223.8 3842.6 6102.0
Tp(ms) 523.7 1055.4 1505.2 2073.2
2-D SBC Array Case 7
20 !
!:!!.
~
15 !2
W
1.5 10
35 40 45 50 55 60 65 70 75
Equal Sizes a,b,c
3.0
§: 2.5
go
I
Cl) 2.0
la=b=cI40 50 60 70
25
T,{ms) 1138.6 2223.8 3842.6 6102.0
Tp{ms) 545.7 1025.0 1567.9 2143.9
2-D SBC Array Case 8
~
15 !2
W
1.5 10
$ 40 ~ 50 ~ 60 ~ ~ n
Equal Sizes a,b,c
245
b 50 100 200 400
Ts(ms) 1423.2 2846.4 5692.8 11385.6
Tp(ms) 518.8 1055.2 2224.3 4071.6
2-D SSC Array Case 9
25
16.9
2.7
~
1.5
E
5
eo 80 100 200 400
Sizeb
b 50 100 200 400
T,,(ms) 1423.2 2846.4 5692.8 11385.6
Tp(ms) 759.9 1370.6 2591.6 5038.9
2-D SSC Array Case 10
1.S.J._--r----,---r------r-----+5
400eo 80 100 200
Size b
• 246
b 50 100 200 400
Ts{ms) 1423.2 2846.4 5692.8 11385.6
Tp{ms) 530.8 937.5 1723.7 3318.8
2-D sec Array Case 11
60 80 100 • 200
Sizeb
400
Case SBC-12: c = [a, b, cf, K and D are the same as Case SBC-11 Np = 16
la=b=cI40 50 60
Ts{ms) 1138.6 2223.8 3842.6 6102.0
Tp{ms) 431.2 904.7 1105.1 1629.0
2-D sec Array Case 12
4.0
3.5
§:
Cl.i3.0
2.5
2.0
3S 40 45 50 55 eo 85 70
Equal Sizes a,b,c
247.
70
30
25 !
!!!.
f
20i2
UJ
15
75
3.0
~ 2.5
g.
I
fJl 2.0
I a = b = c I 40 50 60 70
1~ 10
~ ~ ~ ~ " ~ M ro n
Equal Sizes a,b,c
Ts(ms) 1138.6 2223.8 3842.6 6102.0
Tp(ms) 641.5 1157.4 1469.6 2061.0
2-D SSC Array Case 13
Case SBC-14: c = [40,40, c]T, K and D are the same as Case SBC-13 Np = 16
4.0
c 50 400
25
3.5
.......
~c. 3.0
:J
'2c§. 2.5
100 200
T,(ms) 1423.2 2846.4 5692.8 11385.6
Tp(ms) 883.5 1071.5 1936.1 3367.2
2-D sec Array Case 14
.~L-~ ~~ ~ --+5
40060 60 100 200
SIZ8C
248
Appendix G
Examples of Automatically
Generated Parallel Codes
G.l Parallel Codes for Non-LSGP
G.1.1 h.file
static int cont_up_bds[3] = {4, a, 499};
static int spnd_size[6] = {5, 1, sao, 24000, sao, 1};
static int jump_extra_vol[3] = {2S000, 1000, i};
static int jump_j_2_extra_vol[3] = {i24000, iOOO, SaO};
static int d_q[4] = {-25000, -1002, -1000, -1001};
static int n_cubes_dep_spnd[3] = {1, 1, 1};
static int extra_vol[3] = {iO, 2, 1000};
static int delay_data[1] = {O};
static int n_notes[i] = {SaO};
static int size_cubes[6] = {i, 1, 500, 24000, 500, 1};
static int bgs_to_extra_vol[l] = {101500};
static int 1_inci[3] = {O, -1, 1};
static int 1_inc2[3] = {a, 0, -SOO};
static int 1_v_c[12] = {-S, 0, 0, 0, 6, -i, 0, 0, 14, 1, -500, O};
static int u_incl[3] = {O, -1, i};
static int u_inc2[3] = {O, 0, -SOO};
static int u_v_c[12] = {-5, 0, 0, 19, 6, -1, 0, 19, 14, 1, -500, 399};
static int bgs_out[l] = {226S00};
static int para_data_flows[6] = {3, 0, 500, -1, 0, SOO};
static int size_data_flows[3] = {500, 0, 500};
static int n_vtes_flows[2] = {1, 1};
static int direct_data_flows[2] = {1, O};
static int p[2] = {1, 1};
static int time_comsumed[4] = {23, 29, 35, 41};
static int time_low_bds[4] = {a, 6, i2, lS};
static int lowerbds_sizes[12] = {O, 0, a, 1, 6, 0, 2, 12, 0, 3, 18, O};
249
static int order_vtr[l] = {O}j
static int E[9] = {1, 0, 0, -1, 1, 0, -3, -1, l}j
static int B_j_2_i[9] = {5, 0, 0, -6, 1, 0, -14, -1, 500}j
static int array[1] = {4}j
static int iO=O, il=O, i2=Oj
static int jO=O, jl=O, j2=Oj
static int pjO=O, pjl=O, pj2=Oj
static int vol_elg_spnd=20000, spnd_subpoly_size=250000, time_used=42, min_t=O,
max_t=41, N_DC=l, N_links=1, n_cubes=1, 19st =500, n_para_data_flows=2,
n_dep_array=O, bg_spnd=126500, n_note_data_flows=1500, edge_extra=126500,
n_total_nodes=160000, n_prcsrs=4j
static data_type *elg_spnds[42], *data_f1ow[3]j
static int la100=1,la200=3,la201=lj
ua100=l,ua200=3,ua201=lj
clOOO,c1100,cll0l,c1200,c1201,c1202j
cuOOO,cul00,cul0l,cu200,cu201,cu202j
1sal00=6,lsa200=14,lsa201=lj
usal00=6,usa200=14,usa201=lj
c1s000=O,c1s100=O,c1s101,cls200=416,cls201,cls202j
cusOOO=19,cusl00=23,cusl0l,cus200=399,cus201,cus202j
11_1[3],ll_2[3]j
lu_1[3],uu_2[3]j
*q_adO,*q_ad1,*q_ad2,*q_ad3j
static int
static int
static int
static int
static int
static int
static int
static int
static irt
static int
#define CopyFromBufferToSupernode(a,b,zise_a) {\
tmp_ptr2 = bj\
for(iO=OjiO<*(zise_a+O)jiO++)\
{\
for(il=O;il<*(zise_a+1)jil++)\
{\
for(i2=O;i2<*(zise_a+2);i2++)\
{\
*tmp_ptr2++ = *a++;\
}\
tmp_ptr2 += *(zise_a+4);\
}\
tmp_ptr2 += *(zise_a+3)j\
}\
}\
#define CreateDirectlnputDBV(b,zise_a) {\
tmp_ptr2 = b;\
for(iO=O;iO<*(zise_a+O);iO++)\
{\
for(i1=O;il<*(zise_a+l);i1++)\
• 250
{\
(data_flow_vec_tmp++)->iov_base = (caddr_t) tmp_ptr2;\
tmp_ptr2 += *(zise_a+2)+*(zise_a+4);\
}\
tmp_ptr2 += *(zise_a+3);\
}\
}\
#define CopyFromSupernodeToBuffer(a.b,zise_b) {\
tmp_ptrl = a;\
for(iO=O;iO<*(zise_b+O);iO++)\
{\
for(il=O;il<*(zise_b+l);il++)\
{\
for(i2=O;i2<*(zise_b+2);i2++)\
{\
*b++ = *tmp_ptrl++;\
}\
tmp_ptrl += *(zise_b+4);\
}\
tmp_ptrl += *(zise_b+3);\
}\
}\
#define CopyFromSupernodeToBufferO(a,b,zise_b) {\
tmp_ptrl = a;\
for(iO=O;iO<*(zise_b+O);iO++)\
{\
for(il=O;il<*(zlse_b+l);il++)\
{\
for(i2=O;i2<*(zise_b+2);i2++)\
{\
*b++ = *tmp_ptrl;\
}\
}\
}\
}\
#define CreateDirectOutputDBV(a.zise_b) {\
tmp_ptrl = a;\
for(iO=O;iO<*(zise_b+O);iO++)\
{\
for(il=O;il<*(zise_b+l);il++)\
{\
(data_flow_vec_tmp++)->iov_base • (caddr_t) tmp_ptrl;\
251
trnp_ptrl += *(zise_b+2)+*(zise_b+4);\
}\
tmp_ptrl += *(zise_b+3);\
}\
}\
#define CreateDirectOutputDBVO(a,zise_b) {\
trnp_ptrl = a;\
for(iO=O;iO<*(zise_b+O);iO++)\
{\
for(il=O;il<*(zise_b+l);il++)\
{\
}\
(data_flow_vec_trnp++)->iov_base = (caddr_t) trnp_ptrl;\
}\
}\
#define InitializeData(bounds) {q_adO = elg_spnds[t] + bg_spnd;\
for(q[O]=rnax_fn(clOOO,O),cl101=cl100+q[O]*lal00,c1201=c1200+q[O]*la200,
cul0l=cul00+q[O] *ual00,cu201=cu200+q[O] *ua200,q_ad1=q_ adO+q[O]*jump_extra_vol[O];
q[O]<=rnin_fn(cuOOO,*(bounds+O»;q[O]++,cll0l+=la100,c1201+=la200,cul01+=ual00,
cu201+=ua200,q_ad1+=jurnp_extra_vol[O])\
for(q[1]=rnax_fn(cl101,O),c1202=c1201+q[1]*la201,cu202=cu201+q[1]*ua201,
q_ad2=q_ad1+q[1]*jump_extra_vol[1];q[1]<=rnin_fn(cu101,*(bounds+1»;q[1]++,
c1202+=la201,cu202+=ua201,q_ad2+=jump_extra_vol[1])\
for(q[2]=rnax_fn(c1202,O),q_ad3=q_ad2+q[2];q[2]<=min_fn(cu202,*(bounds+2»;
q[2]++,q_ad3++)\
{\
ct++;\
matrix_int_multiplication(E,q,i_index,dim,dim,l,l,O);\
add_2_int_vtrs(i_index,i_from_j,i_index,N,1);\
rnatrix_int_multiplication(jump_step,i_index,q_ad3,1,dim,1,1,0);\
}\
}\
\*
add_2_int_vtrs(a,b,c, ...) is a function to add vectors· a and b to c.
rnatrix_int_rnultiplication(A,B,C, ....) is a function to multiply matrices A and B to
*\
#define ComputeSupernode(bounds) {q_adO = elg_spnds[t] + bg_spnd;\
for(iO=rnax_fn(clOOO,O),cl101=cl100+iO*la100,c1201=c120O+iO*la200,cu101=cul00+iO*ual
cu201=cu200+iO*ua200,q_ad1=q_adO+iO*jump_extra_vol[O] ;iO<=min_fn(cuOOO,*(bounds+O»
iO++,cll01+=lal00,c1201+=la200,cu101+=ua100,cu201+=ua200,q_ad1+=jump_extra_vol[O])\
.~2
for(il=max_fn(cll0l,O),c1202=c1201+il*la201,cu202=cu201+il*ua201,
q_ad2=q_adl+il*jump_extra_vol[1] ;il<=min_fn(cul0l,*(bounds+l»;il++,c1202+=la2C
cu202+=ua201,q_ad2+=jump_extra_vol[1])\
for(i2=max_fn(c1202,O),q_ad3=q_ad2+i2;i2<=min_fn(cu202,*(bounds+2»j
i2++,q_ad3++)\
{\
*q_ad3 = *q_ad3+*(q_ad3+d_q[O])+*(q_ad3+d_q[1])+*(q_ad3+d_q[2])
+*(q_ad3+d_q[3]);\
}\
}\
#define cl_cu_O_set_befO jO=*(a_i_to+O),cls101=cls100+jO*lsa100,cls201=cls200
+jO*lsa200,cusl01=cusl00+jO*usal00,cus201=cus200+jO*usa200,11_1[O]=1_O[O],
11_1[1]=1_O[1],11_1[2]=1_O[2],uu_1[O]=u_O[O],uu_1[1]=u_O[1],uu_1[2]=u_O[2]
#define cl_cu_O_set_aftO
#define cl_cu_O_set_befl cls202=cls201+jl*lsa201,cus202=cus201+jl*usa201
#define cl_cu_O_set_aftl ,11_2[O]=11_1[O]+jl*1_incl[O],
11_2[1]=11_1[1]+jl*1_incl[1],11_2[2]=11_1[2]+jl*1_incl[2],
uu_2[O]=uu_1[O]+jl*u_incl[O],uu_2[1]=uu_1[1]+jl*u_incl[1],
uu_2[2]=uu_1[2]+jl*u_incl[2],pjl=p[O]*jl
#define cl_cu_O_set_bef2
#define cl_cu_O_set_aft2 ,clOOO=11_2[O]+j2*1_inc2[O],cl100=11_2[1]+j2*1_inc2[1],
c1200=11_2[2]+j2*1_inc2[2].cuOOO=uu_2[O]+j2*u_inc2[O].cul00=uu_2[1]+j2*u_inc2[1],
cu200=uu_2[2]+j2*u_inc2[2]
#define cl_cu_O_increasel .11_2[O]+=1_incl[O],ll_2[1]+=1_incl[1] .11_2[2]+=1_incl[2]
uu_2[O]+=u_incl[O].uu_2[1]+=u_incl[1],uu_2[2]+=u_incl[2],pjl+=p[O]
#define cl_cu_O_increase2 .clOOO+=1_inc2[O],cl100+=1_inc2[1],c1200+=1_inc2[2],
cuOOO+=u_inc2[O] .cul00+=u_inc2[1],cu200+=u_inc2[2]
#define condition j2>=1_innerest_bd
#define J_2_J_vtr j_vtr[O] =jO jj_vtr[1]=j1;j_vtr[2] =j2;
#define BREAK if(++t>time_end){if(t<=max_t){j2-1_innerest_bd-1000000;}
else{j2=u_innerest_bd+l;}}
#define SupernodeSpaceLoops \
for(cl_cu_O_set_befO,jl=clsl0l cl_cu_O_set_aftl;jl<=cusl01;jl++ cl_cu_O_increasel)\
.253
for(cl_cu_O_set_befl,l_innerest_bd=cls202/500,u_innerest_bd=cus202/500,
j2=t-pjl cl_cu_O_set_aft2;j2<=u_innerest_bd;j2++ cl_cu_O_increase2)\
G.1.2 Parallel Code
#include <stdio.h>
#include <math.h>
#include <csn/csn.h>
#include <csn/csnuio.h>
#include <csn/names.h>
#include <cs.h>
#include II../parameters .h"
#include II../pn.hll
#include IImacro_def.cll
#include IIprcsr.h"
main Cargc, argv)
int argc;
char **argv;
{
Transport transport[D_array*(bi_link_orig+l)];
netid_t net_ID[D_array*(bi_link_orig+l)] ;
int n_netid_t=O,in,*dep_array,m,a_i[D_array],a_i_to[10*D_array];
int sequenceCount,*sequence,id_pos=2,array[D_array],rx_status=O,tx_status=O;
int i,ii,j,j_vtr[N],jj,r,1,data_flow_bank,n_data_flow_vec=4*n_para_data_flows,
N2=2*N,tmp_i,uni_link=1-bi_link_orig,bi_link=bi_link_orig,dim=N,k=N-D_array;
int t,ptr_elg_spnd=O,ptr_data_flow=O,link_out_valid[D_array*(bi_link_orig+l)],
link_in_valid[D_array*(bi_link_orig+l)];
int *direct_data_flows_out=direet_data_flows;
int *direet_data_flows_in=direct_data_flows+n_eubes;
int x,y,1_innerest_bd,u_innerest_bd,time_end,1_O[N+1],u_O[N+1];
int q[N] ,n_vee[8*D_array],i_index[N],i_from_j[N],*i_from_j_ine,jump_step[N],
tmp_vtr[N],j_vtr_init[N],ct=O,ctt=O,*ptrl,*ptr2,op_spnd;
data_type *tmp_ptrl,*tmp_ptr2,*tmp_ptr3,*elg_spnds_base,*idle_spnd;
char name [20] ,here_name[20],there_name[20];
struet iovee *data_flow_vee,*data_flow_vee_tmp;
if (!cs_import (llintvectorSize", ksequenceCount»
debugf(tleannot import 'veetorSize'\nll, 1);
sequence = Cint *) malloc(sequeneeCount*sizeof(int»;
if (!cs_import (llint*veetor", ksequence»
debugf(lIcannot import 'veetor'\n", 1);
cp_vtrs(sequenee,array,D_arraY)i
ep_vtrs(sequenee+D_array,a_i,D_array);
.254
dep_array = sequence+2*D_array;
m = (sequenceCount-2*D_array)/D_array;
csn_init 0;
strcpy(name,ints_2_string(a_i,D_array»;
strcpy(here_name, "transpot");
strcat(here_name,name);
forCi=O;i<N_links;i++)
{
strcpy(there_name,here_name);
strcat(there_name,int_2_alph(i»;
if( csn_open( CSN_NULL_ID, ktransport[i] ) != CSN_OK )
debugf( "master: cannot open transport\n", 1 );
if( csn_registername( transport[i],there_name ) != CSN_OK )
debugf( "cannot register Transport \n",l );
}
r = 0;
for(i=O;i<D_array;i++)
for(j=l;j>=-bi_link_orig;j _= 2,r++)
{
copy_int_array(a_i,a_i_to,l,D_array,l,D_array,O,O);
a_i_to[i] += j;
if(O>a_i_to[i] I la_i_to[i]>=array[i])
link_out_valid[r] = 0;
else
{
strcpy(there_name,"transpot");
strcat(there_name,ints_2_string(a_i_to,D_array»;
strcat(there_name,int_2_alph(r»;
if(csn_lookupname(tnet_ID[r],there_name,l)!=CSN_OK)
debugf("cannot lookup Yesto Yes\n",a_i,there_name);
link_out_valid[r] = 1;
}
a_i_to[i] _= 2*j;
if(O>a_i_to[i] I la_i_to[i]>=array[i])
link_in_valid[r] = 0;
else
link_in_valid[rJ = 1;
}
\* cp_vtrs(a,b, ...) and copy_int_array(a,b, ...) are functions of copy vector a to b
if«elg_spnds_base = (data_type *) malloc(spnd_subpoly_size*sizeof(data_type»)
==NULL)
.255
{debugf("Err: no enough memory for elg_spnds_base Y.d\n",
spnd_subpoly_size*sizeof(data_type»;
exit(1);
}
idle_spnd = (data_type *) malloe(lgst*sizeof(data_type»;
for(i=O;i<spnd_subpoly_size;i++)
*(elg_spnds_base+i) = 0;
forCi=O;i<lgst;i++)
*(idle_spnd+i) = 0;
forCi=O;i<N_links+2;i++)
{
if(! a )
data_flow[i] = (data_type *) malloe(n_note_data_flows*sizeof(data_type)
else
{
data_flow[i] = data_flow[i-1];
data_flow[i] += 2*size_data_flows[i-1];
}
}
for(i=O;i<n_note_data_flows;i++)
*(data_flow[O]+i)= 0;
data_flow_vec = (struet iovee *) malloc(2*n_para_data_flows*sizeof(struet iovee
1 = 0;
for(t=0,r=0;t<2;t++)
for(i=O,j=O;i<N_links;i++)
for(ii=0;ii<2;ii++)
{
n_vee[l] = 0;
for(jj=0;jj<n_vtes_flows[i*N_links+ii];jj++,j+=3)
{
data_flow_bank = 0;
if(para_data_flows [j]«N_links+l)ttt=-ii)
data_flow_bank = size_data_flows[para_data_flows[j]];
if(O<=para_data_flows [j]&tpara_data_flows [j]<N_links+2)
(data_flow_vec+r)->iov_base = (eaddr_t)
(data_flow[para_data_flows[j]] + para_data_flows[j+l]
+data_flow_bank);
(data_flow_vec+r)->iov_len = para_data_flows[j+2]
,256
*sizeof(data_type);
if«data_f1ow_vec+r)->iov_1en)
{
r++;
n_vec[1]++;
}
}
1++;
}
l_O[N] = 1;
u_O[N] = 1;
re_order_int_rows(a_i,a_i_to,order_vtr,D_array,l);
time_end = time_comsumed[array_2_ad(a_i,array,D_array)];
ptr2=(1owerbds_sizes+dim*array_2_ad(a_i,array,D_array));
transpose_intCB_j_2_i,B_j_2_i,N,N);
re_order_int_rows(B_j_2_i,B_j_2_i,order_vtr,D_array,N);
transpose_intCB_j_2_i,B_j_2_i,N,N);
cp_vtrs(ptr2,a_i_to,D_array);
cp_vtrsCa_i_to,l_O,D_array);
cp_vtrs(a_i_to,u_O,D_array);
re_order_int_rowsCl_O,l_O,order_vtr,D_array,l);
re_order_int_rows(u_O,u_O,order_vtr,D_array,l);
\*
transpose_int(A,B, ...) is a function to transpose matrix A to B.
re_order_int_rowsCA,B,c •....) is a function to permute the rows of matrix A to B
according to vector c.
*\
cp_vtrs(ptr2.j_vtr_init,dim);
for(i=D_array;i<N;i++)
{
1_0[i] = 0;
u_O[i] = 0;
}
matrix_int_multiplication(1_v_c,1_O,1_0,N,N+1,1,1,O); .
matrix_int_multiplication(u_v_c,u_O,u_O,N,N+l,l,l,O);
for(i=l,jump_step[N-l]=l;i<N;i++)
jump_step[N-l-i] = l*jump_step[N-i];
~7
t=O;
SupernodeSpaceLoops
{
if(condition)
{
J_2_J_vtr;
add_2_vtrs(j_vtr_init,j_vtr,tmp_vtr,N,-1);
inner_prct(tmp_i,tmp_vtr,jump_j_2_extra_vol,N);
elg_spnds[t] = (data_type*) elg_spnds_base + tmp_i;
rnatrix_int_multiplication(B_j_2_i,j_vtr,i_from_j,N,N,l,1,0);
InitializeData(cont_up_bds);
}
else
elg_spnds[t] = idle_spnd;
if(*ptr2==100000)
l_innerest_bd = u_innerest_bd;
BREAK;
}
data_flow_bank = 0;
ii = 0;
t = 0;
SupernodeSpaceLoops
{
for(i=O,r=O;i<N_DC;i++,r+=N2)
if(direct_data_flows_out[i] !=-1)
{
data_flow_vec_tmp = data_flow_vec + ii + direct_data_flows[i+n_cube
CreateDirectlnputDBV(elg_spnds [t]+bgs_to_extra_vol [i],size_cubes+r)
}
for(i=O,r~O;i<N_DC;i++,r+=N2)
if(direct_data_flows_out[i] !=-1)
{
data_flow_vec_tmp = data_flow_vec + ii'+ direct_data_flows[i];
op_spnd = t _ delay_data[i] _ 1;
if (op_spnd<OI lelg_spnds[op_spnd]==idle_spnd)
{
tmp_ptr3 = idle_spnd;
CreateDirectOutputDBVO(tmp_ptr3,size_cubes+r);
}
else
{
.258
tmp_ptr3 = elg_spnds[op_spnd]+bgs_out[i];
CreateDirectOutputDBV(tmp_ptr3,size_cubes+r);
}
}
for(l=O,i=O;i<N_links;i++)
{
if(link_in_valid[i]&&n_vec[l])
csn_rxnbv(transport[i] ,data_flow_vec+ii,n_vec[l]);
ii += n_vec[l++];
if(link_out_valid[i]&&n_vec[l])
csn_txnbv(transport[i],O,net_ID[i],data_flow_vec+ii,n_vec[l]);
ii += n_vec[l++];
}
for(i=O;i<N_links;i++)
if(link_in_valid[i])
csn_test(transport[i],CSN_RXREADY,-1,NULL,NULL,&rx_status);
tmp_ptr1 = data_flow[N_links+1];
if (condition)
for(i=O,r=O;i<N_DC;i++,r+=N2)
if(direct_data_flows_in[i]=--1)
CopyFromBufferToSupernode(tmp_ptr1,elg_spnds[t]
+bgs_to_extra_vol[i],size_cubes+r);
if (condition)
ComputeSupernode(cont_up_bds);
tmp_ptr2 = (data_flow_bank)? data_flow[O] :data_flow[O]+size_data_flows[O];
for(i=O;i<N_links;i++)
if(link_out_valid[i])
csn_test(transport[i],CSN_TXREADY,-1,NULL,NULL,&tx_status);
for(i=O,r=O;i<N_DC;i++,r+=N2)
if (direct_data_flows_out [i]==-1)
{
op_spnd = t _ delay_data[i];
if (op_spnd<OI Ielg_spnds[op_spnd]=.idle_spnd)
{
tmp_ptr3 = idle_spnd;
CopyFromSupernodeToBufferO(tmp_ptr3,tmp_ptr2,size_cubes+r);
}
.259
else
{
trnp_ptr3 = elg_spnds[op_spnd]+bgs_out[i];
CopyFrornSupernodeToBuffer(trnp_ptr3,tmp_ptr2,size_cubes+r);
}
}
data_flow_bank++;
if (data_flow_bank==2)
{
data_flow_bank = 0;
ii = 0;
}
if(*ptr2==100000)
l_innerest_bd = u_innerest_bd;
BREAK;
}
}
G.2 Parallel Codes for LSGP
G.2.1 h.flle
static int cont_up_bds[3] = {3, 3, 11}j
static int spnd_size[6] = {4, 4, '12, 7488, 144, 1};
static int jump_extra_vol[3] = {8l12, 156, l}j
static int jurnp_j_2_extra_vol[3] = {31800, -31212, 12};
static int d_q[4] = {-8112, -158, -156, -157}j
static int n_cubes_dep_spnd[4] = {1, 1, 1, l}j
static int extra_vol[3] = {8, 8, 24}j
static int delay_data[4] = {2, 0, 3, 2}j
static int n_notes[4] = {48, 24, 48, 8}j
static int size_cubes[24] = {4, 1, 12, 7956, 144, 1, 4, 3, 2, 7644, 154, 1, 1, 4,
12, 7488, 144, 1, 4, 1, 2, 7956, 154, 1}j
static int bgs_to_extra_vol[4] = {32928, 33082, 24972, 32926};
static int 1_inc2[3] = {O, 0, -12}j
static int 1_v_c[12] = {-4, 4, 0, 0, 8, -12, 0, 0, 32, 8, -12, O};
static int u_inc2[3] = {O, 0, -12}j
static int u_v_c[12] = {-4, 4, 0, 39, 8, -12, 0, 39, 32, 8, -12, 49}j
static int bgs_out[4] = {33552, 33094, 57420, 33562}j
static int lower_array_bds[2] = {-ll, O}j
static int H[4] = {3, 0, -2, 3}j
static int hO[3] = {-1, 0, O}j
.260
static int ptn_array[278] = {9, 1, 2, 1, 0, 0, 48, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0,
0, 48, 3, 1, 2, 3, 0, 2, 0, 0, 24, 0, 72, 8, 0, 1, 0, 24, 48, 1, 2, 1, 0, 0, 48,
0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 48, 2, 1, 3, 0, 2, 0, 0, 24, 0, 24, 8, 0, 0, 3, 0,
2, 3, 1, 0, 48, 48, 0, 3, 0, 0, 48, 0, 96, 8, 1, 0, 8, 0, 2, 0, 3, 0, 0, 2, 0, 0,
48, 0, 48, 8, 0, 3, 0, 1, 3, 0, 2, 0, 48, 24, 0, 72, 8, 1, 0, 0, 48, 0, 9, 1, 2,
-1, 0, 1, 5, 0, 48, 0, 0, 0, 1, 2, -1, 3, 0, 0, 0, 1, 5, 0, 48, 3, 2, 1, 3, -1,
0, -1, 3, 1, 1, 0, 2, 5, 48, 24, 5, 72, 8, 0, 1, 5, 0, 48, 0, 1, 2, 0, 48, 0, 0,
0, 1, 2, 0, 3, 0, 0, 0, 1, 5, 0, 48, 1, 1, 0, 1, 0, 2, 5, 0, 24, 1, 0, 8, 0, 0,
4, 2, 0, 3, 3, -1, 0, -1, 0, 0, 2, 2, 2, 1, 5, 0, 48, 0, 3, 5, 48, 48, 5, 96, 8,
5, 104, 8, 0, 2, 0, 3, 0, -1, 2, 2, 0, 0, 2, 5, 0, 48, 5, 48, 8, 0, 3, 1, 3, 0,
0, -1, 0, 1, 1, 2, 0, 2, 5, 0, 24, 5, 24, 8, 1, 5, 32, 48, 0, 3, 0, 104, 8, 48,
0,0,0,0,8,48, 0, o, 112};
static int cubes_out_group[8] = {O, 1, 1, 2, 2, 3, 3, 4};
static int D[12] = {1, 0, 0, 0, -1, 1, 1, 1, -3, 1, -1, O};
static int size[3] = {39, 39, 49};
static int lowerbds_sizes[48] = {300000, 200000, 900000, 26, 19, 87, 20, 16, 69,
16, 14,57,5,4,21,3,3, 15,6,6,24,9,9,33,0,0,3,3,3, 12,6,6,21,
9, 9, 30, 0, 0, 0, 3, 3, 10, 6, 6, 20, 300000, 200000, 900000};
static int order_vtr[2] = {o, 1};
static int E[9] = {1, 0, 0, -1, 1, 0, -3, -1, 1};
static int B_j_2_i[9] = {4, -4, 0, -8, 12, 0, -32, -8, 12};
static int array[2] = {12, 12};
static int iO=O, il=O, i2=0;
static int jO=O, jl=O, j2=0;
static int pjO=O, pjl=O, pj2=0;
static int vol_elg_spnd=1536, spnd_subpo1y_size=356928, time_used=123, min_t=O,
max_t=122, N_DC=4, N_1inks=4, n_cubes=4, 19st =12, n_para_data_flows=16,
n_dep_array=O, bg_spnd=33084, n_note_data_f1ows=600003, edge_extra=33084,
n_total_nodes=80000, n_prcsrs=16;
static data_type *e1g_spnds[123], *data_flow[6];
static int lal00=1,la200=3,1a201=1;
static int ua100=1,ua200=3,ua201=1;
static int clOOO,cl100,c1101,c1200,c1201,c1202;
static int cuOOO,cu100,cu101,cu200,cu201,cu202;
static int 11_2 [3] ;
static int uu_2[3];
static int *q_adO,*q_adl,*q_ad2,*q_ad3;
#define CopyFromBufferToSupernode(a,b,zise_a) {\
tmp_ptr2 = b;\
for(iO=O;iO<*(zise_a+O);iO++)\
{\
for(il=O;il<*(zise_a+l);il++)\
{\
·261
for(i2=Oji2<*(zise_a+2)ji2++)\
{\
*tmp_ptr2++ = *a++j\
}\
tmp_ptr2 += *(zise_a+4)j\
}\
tmp_ptr2 += *(zise_a+3)j\
}\
}\
#define CopyFromSupernodeToBuffer(a,b,zise_b) {\
tmp_ptrl = aj\
for(iO=OjiO<*(zise_b+O);iO++)\
{\
for(il=Ojil<*(zise_b+l);il++)\
{\
for(i2=Oji2<*(zise_b+2)ji2++)\
{\
*b++ = *tmp_ptrl++;\
}\
tmp_ptrl += *(zise_b+4);\
}\
tmp_ptrl += *(zise_b+3);\
}\
}\
#define CopyFromSupernodeToBufferO(a,b,zise_b) {\
tmp_ptrl = a;\
for(iO=O;iO<*(zise_b+O);iO++)\
{\
for(il=Ojil<*(zise_b+l)iil++)\
{\
for(i2=Oii2<*(zise_b+2)ii2++)\
{\
*b++ = *tmp_ptrl;\
}\
}\
}\
}\
#define InitializeData(bounds) {q_adO = elg_spnds[t] + bg_spndj\
for(q[O]=max_fn(clOOO,O),cll0l=cll00+q[O]*lal00,c1201-c1200+q[O]*la200,
cul0l=cul00+q[O]*ual00,cu201=cu200+q[O]*ua200,q_adl=q_adO+q[O]*jump_extra_vol[O];
q[O]<=min_fn(cuOOO,*(bounds+O»;q[O]++,cll0l+=lal00,c1201+=la200,cul01+=ual00,
cu201+=ua200,q_adl+=jump_extra_vol[O])\
for(q[1]=max_fn(cll0l,O),c1202=c1201+q[1]*la201,cu202=cu201+q[1]*ua201,
q_ad2=q_adl+q[1]*jump_extra_vol[1] ;q[1]<=min_fn(cul01,*(bounds+l));q[1]++,
c1202+=la201,cu202+=ua201,q_ad2+=jump_extra_vol[1])\
for(q[2]=max_fn(c1202,O),q_ad3=q_ad2+q[2] ;q[2]<=min_fn(cu202,*(bounds+2));
q[2]++,q_ad3++)\
{\
ct++;\
matrix_int_multiplication(E,q,i_index,dim,dim,l,l,O);\
add_2_int_vtrs(i_index,i_from_j,i_index,N,l);\
matrix_int_multiplication(jump_step,i_index,q_ad3,l,dim,1,1,0);\
}\
}\
#define ComputeSupernode(bounds) {q_adO = elg_spnds[t] + bg_spnd;\
for(iO=max_fn(clOOO,0),cll0l=cll00+iO*lal00,c1201=c120O+iO*la200,cul01=cul00+iO*ual
cu201=cu200+iO*ua200,q_adl=q_adO+iO*jump_extra_vol[O] ;iO<=min_fn(cuOOO,*(bounds+O))
iO++,cll0l+=lal00,c1201+=la200,cul0l+=ual00,cu201+=ua2OO,q_adl+=jump_extra_vol[O])\
for(il=max_fn(cll01,O),c1202=c1201+il*la201,cu202=cu201+il*ua201,q_ad2=q_adl
+il*jump_extra_vol[1];il<=min_fn(cul0l,*(bounds+l));il++,c1202+=la201,cu202+=ua
q_ad2+=jump_extra_vol[1])\
for(i2=max_fn(c1202,O),q_ad3=q_ad2+i2;i2<~min_fn(cu202,*(bounds+2));i2++,
q_ad3++)\
{\
*q_ad3 = *q_ad3+*(q_ad3+d_q[O])+*(q_ad3+d_q[1])+*(q_ad3+d_q[2])+*(q_ad2
+d_q[3]);\
}\
}\
#define LL_2_CL clOOO=11_2[O],cuOOO=uu_2[O],cll00-ll_2[1] ,cul00=uu_2[1],c1200=11_2[
cu200=uu_2[2]
G.2.2 Parallel Codes
#include <stdio.h>
#include <math.h>
#include <csn/csn.h>
#include <csn/csnuio.h>
#include <csn/names.h>
#include <cs.h>
#include "../parameters .h"
#include "../pn.h"
#include "macro_def.c"
.263
#include "prcsr.h"
main (argc, argv)
int argcj
char **argvj
{
Transport transport [D_array*(bi_link_orig+1)] ;
netid_t net_ID[D_array*(bi_link_orig+1)];
int n_netid_t=O,in,*dep_array,m,a_i[D_array],a_iO[D_array],a_i_to[10*D_arraY]j
int sequenceCount,*sequence,id_pos=2,array[D_array],rx_status=O,tx_status=O;
int i,ii,j,j_vtr[N+l] ,j_vtr_l[N],j_vtrO[N] ,jj,r,l,data_flow_bank,
n_data_flow_vec=4*N_links*n_data_vtes,N2=2*N,tmp_i,uni_link=l-bi_link_orig,
bi_link=bi_link_orig,dim=N,k=N-D_arrayj
int t,s[D_array],sO[D_array],ptr_elg_spnd=O,ptr_data_flow=OJ
int link_out_valid[D_array*(bi_link_orig+l)];
int link_in_valid[D_array*(bi_link_orig+l)];
int 1_0[N+l],u_0[N+1],11_1[N],uu_1[N],x,y,1_innerest_bd,u_innerest_bd,time_end;
int q[N],i_index[N],i_from_j[N],*i_from_j_inc,jump_step[N],tmp_vtr[N],
j_vtr_init[N] ,ct=0,ctO,ctt=0,*ptr1,*ptr2,prct;
int l_v_t[N*D_array],u_v_t[N*D_array];
data_type *tmp_ptr1,*tmp_ptr2,*tmp_ptr3,*elg_spnds_base,*idle_spnd,
*data_flow_out[2*D_array+2] ,*data_flow_in[2*D_array+2] ;
char name [20] ,here_name[20] ,there_name[20] ;
struct iovec *data_flow_vec_out,*data_flow_vec_in,*tmp_vec;
int n_spnd_in_prc=*ptn_array,*ptn=ptn_array,ptn_out_inc,ptn_ptr,*size_buffer_ou
*size_buffer_in,*buffer,*delay_minus,*port,active_d_p;
MCC_SBC_PRCSR_DATA_FLOW *ptns_out,*ptns_in,*tmp_S_out,*tmp_S_in;
if (!cs_import ("int vectorSize", IcsequenceCount»
debugf("cannot import 'vectorSize'\n", 1);
sequence = (int *) malloc(sequenceCount*sizeof(int»;
if (!cs_import ("int *vector", II:sequence»
debugf(IIcannot import 'vector'\n", 1);
cp_vtrs(sequence,array,D_array);
cp_vtrs(sequence+D_array,a_i,D_array);
dep_array = sequence+2*D_array;
m = (sequenceCount-2*D_array)/D_array;
cp_vtrs(a_i,a_iO,D_array);
re_order_int_rows(a_i,a_i,order_vtr,D_array,l);
csn_initO;
strcpy(name,ints_2_string(a_i,D_array»;
strcpy(here_name,"transpottl);
strcat(here_name,name);
.264
for(i=O;i<N_links;i++)
{
strcpy(there_narne,here_name)j
strcat(there_name,int_2_alph(i));
if( csn_open( CSN_NULL_ID, ttransport[i] ) != CSN_DK )
debugf( "master: cannot open transport\n", 1 );
if( csn_registername( transport[i],there_name ) != CSN_DK )
debugf( "cannot register Transport \n",l );
}
r = OJ
for(i=O;i<D_array;i++)
for(j=1;j>=-bi_link_orig;j _= 2,r++)
{
copy_int_array(a_i,a_i_to,1,D_array,1,D_array,0,0);
a_i_ to [i] += j;
if(O>a_i_to[i]I la_i_to[i]>=array[i])
link_out_valid[r] = 0;
else
{
strcpy(there_name,"transpot");
strcat (there_name, ints_2_string(a_i_to ,D_array)) ;
strcat(there_name,int_2_alph(r»;
if(csn_lookupname(tnet_ID[r],there_name,l) !=CSN_OK)
debugf("cannot lookup Yesto Yes\n",a_i,there_name);
link_out_valid[r] = 1;
}
a_i_to[i] _= 2*j;
if(O>a_i_to[i] Ila_i_to[i]>=array[i])
link_in_valid[r] = 0;
else
link_in_valid[r] = 1;
}
cp_vtrs(a_iO,a_i,D_array);
re_order_int_rows(a_i,a_i_to,order_vtr,D_array,l);
ptn = ptn_array;
ptns_out= (MCC_SBC_PRCSR_DATA_FLOW *) array_2_ptns(ptn,ptn_array,
D_array,tptn_out_inc,O);
ptn = ptn_array + ptn_out_inc;
ptns_in = (MCC_SBC_PRCSR_DATA_FLOW *) array_2_ptns(ptn,ptn_array,D_array,
tptn_out_inc,l);
1* ptns_out and ptns_in are functions of producing OP's and lP's from the data
,265
of ptn_array. *1
inner_prct(ptn_ptr,a_i_to,ptn_array + ptn_out_inc,D_array);
ptn_ptr = round_off(ptn_ptr,n_spnd_in_prc);
ptn += D_array;
size_buffer_out = ptn_array + ptn_out_inc + D_array;
size_buffer_in = size_buffer_out + 2*D_array +2;
data_flow_vec_out = (struct iovec *) malloc(2*N_links*N2*sizeof(struct iovec»;
data_flow_vec_in = (struct iovec *) malloc(2*N_links*N2*sizeof(struct iovec»;
for(i=0;i<N_links+2;i++)
{
data_flow_out[i] = (data_type*)malloc(*(size_buffer_out+i)*sizeof(data_type
data_flow_in[i] = (data_type *) malloc(*(size_buffer_in+i)*sizeof(data_type
}
idle_spnd = (data_type *) malloc(vol_elg_spnd*sizeof(data_type»;
elg_spnds_base = (data_type *) malloc(spnd_subpoly_size*sizeof(data_type»;
l_O[N] = 1;
u_O[N] = 1;
j_vtr[N] = 1;
for(i=O;i<spnd_subpoly_size;i++)
*(elg_spnds_base+i) = 0;
for(i=O;i<vol_elg_spnd;i++)
*(idle_spnd+i) = 0;
ptr2=(lowerbds_sizes+dim*array_2_ad(a_i,array,D_array»;
cp_vtrs(ptr2,j_vtr_init,dim);
for(i=O;i<N;i++)
{
l_O[i] = 0;
u_O[i] = 0;
}
matrix_int_multiplication(l_v_c,1_0,l_0,N,N+1,1,1,O);
matrix_int_multiplication(u_v_c,u_O,u_O,N,N+1,1,1,0);
add_2_vtrs(1_inc2,l_0,l_O,N,min_t);
add_2_vtrs(u_inc2,u_O,u_O,N,min_t);
cp_vtrs(l_O,ll_1,N);
cp_vtrs(u_O,uu_1,N);
transpose_int(l_v_c,l_v_c,N,N+1)i
~66
transpose_intCu_v_c,u_v_c,N,N+l);
forCi=l,jump_step[N-l]=l;i<N;i++)
jump_step[N-l-i] = l*jump_step[N-i];
cp_vtrsClower_array_bds,s,D_array);
scalar_vtr_prctC*H,a_i_to,s,s,D_array);
add_2_vtrsChO,s,s,D_array,-min_t);
cp_vtrsCs,sO,D_array);
j_vtr[D_array] = min_t;
forCi=O;i<D_array;i++)
{
inner_prct(prct,H+i*(D_array),j_vtr,i);
j_vtr[i] = ceil(l.*Cs[i]-prct)/(*(H+i*CD_array)+i»);
}
cp_vtrs(j_vtr,j_vtrO,N);
matrix_int_multiplicationCj_vtr,1_v_c,11_2,1,N+l,N,1,0);
matrix_int_multiplication(j_vtr,u_v_c,uu_2,1,N+l,N,1,O);
cp_vtrs(j_vtr,j_vtr_l,N);
cp_vtrs(11_2,1_0,N);
cp_vtrs(uu_2,u_0,N);
for(j_vtr[D_array]=min_t,t=Ojj_vtr[D_array] <=max_t;j_vt r[D_array]++,t++)
{
add_2_vtrs(j_vtr_init,j_vtr,tmp_vtr,N,-1);
inner_prctCtmp_i,tmp_vtr,jump_j_2_extra_vol,N);
elg_spnds[t] = (data_type*) elg_spnds_base + tmp_i;
matrix_int_multiplication(B_j_2_i,j_vtr,i_from_j,N,N,1,1,0);
add_2_vtrs(j_vtr,j_vtr_1,j_vtr_1,N,-1);
for(i=O;i<N;i++)
ifCtmp_i=*(j_vtr_l+i»
{
add_2_vtrs(1_v_c+i*N,11_2,11_2,N,-tmp_i);
add_2_vtrs(u_v_c+i*N,uu_2,uu_2,N,-tmp_i);
}
cp_vtrs(j_vtr,j_vtr_l,N);
LL_2_CL;
ctO = ct;
InitializeData(cont_up_bds)j
if (ctO==ct)
elg_spnds[t] = idle_spnd;
add_2_vtrs(hO,s,s,D_array,-1);
for(i=O;i<D_array;i++)
{
inner_prct(prct,H+i*(D_array),j_vtr,i);
j_vtr[i] = ceil(l.*(s[i]-prct)/(*(H+i*(D_array)+i)));
}
}
cp_vtrs(sO,s,D_array);
cp_vtrs(j_vtrO,j_vtr,N);
cp_vtrs(j_vtr,j_vtr_l,N);
cp_vtrs(1_O,ll_2,N);
cp_vtrs(u_O,uu_2,N);
for(j_vtr[D_array]=min_t,t=O;j_vtr[D_array]<=max_t;j_vtr[D_array]++,t++)
{
add_2_vtrs(j_vtr,j_vtr_l,j_vtr_l,N,-1);
for(i=O;i<N;i++)
ifCtmp_i=*Cj_vtr_l+i»
{
add_2_vtrsCl_v_c+i*N,ll_2,ll_2,N,-tmp_i);
add_2_vtrsCu_v_c+i*N,uu_2,uu_2,fi,-tmp_i);
}
cp_vtrsCj_vtr,j_vtr_l,N);
LL_2_CL;
ComputeSupernodeCcont_up_bds);
add_2_vtrs(hO,s,s,D_array,-1);
forCi=O;i<D_array;i++)
{
inner_prct(prct,H+i*(D_array),j_vtr,i);
j_vtr[i] = ceil(l.*(s[i]-prct)/(*(H+i*(D_~ray)+i»);
}
tmp_S_out = ptns_out + N_links*ptn_ptr;
tmp_S_in = ptns_in + N_links*ptn_ptr;
delay_minus = tmp_S_in->nth_d_p + tmp_S_in->n_nth_d_p;
port = tmp_S_in->nth_d_p + 2*tmp_S_in->n_nth_d_p;
tmp_ptr2 = data_flow_out[O];
for(ii=O;ii<tmp_S_out->n_nth_d_p;ii++)
~68
{
active_d_p = *(tmp_S_out->nth_d_p+ii);
for(i=cubes_out_group[2*active_d_p] ,r=i*N2;i<cubes_out_group[2*active_d
+1] ;i++,r+=N2)
if (elg_spnds [t]==idle_spnd)
{
copy_2_vtrs(elg_spnds[t] ,tmp_ptr2,n_notes[i]);
tmp_ptr2 += n_notes[i];
}
else
{
tmp_ptr3 = elg_spnds[t]+bgs_out[i];
CopyFromSupernodeToBuffer(tmp_ptr3,tmp_ptr2,size_cubes+r);
}
}
for(i=O;i<N_links;i++)
{
if(link_in_valid[i]&&(tmp_S_in+i)->n_d_p)
{
for(j=O,buffer=(tmp_S_in+i)->buffer;j«tmp_S_in+i)->n_d_p;j++,
buffer+=3)
{
(data_flow_vec_in+i*N2+j)->iov_base = (caddr_t)
(data_flow_in[*buffer] + *(buffer+l»);
(data_flow_vec_in+i*N2+j)->iov_len =*(buffer+2)*sizeof(data_tYF
}
csn_rxnbv(transport[i],data_flow_vec_in+i*N2,(tmp_S_in+i)->n_d_p);
}
if (link_out_valid[i]&&(tmp_S_out+i)->n_d_p)
{
for(j=O,buffer-(tmp_S_out+i)->buffer;j«tmp_S_out+i)->n_d_p;
j++,buffer+=3)
{
if(*buffer)
(data_flow_vec_out+i*N2+j)->iov_base • (caddr_t)
(data_flow_in[*buffer] + *(buffer+l»)j
else
(data_flow_vec_out+i*N2+j)->iov_base = (caddr_t)
(data_flow_out[*buffer] + *(buffer+l»;
(data_flow_vec_out+i*N2+j)->iov_len = *(buffer+2)
*sizeof(data_type);
}
csn_txnbv(transport[i],O,net_ID[i],data_flow_vec_out+i*N2,
269
}(tmp_S_out+i)->n_d_p);
}
}
for(i=O;i<N_links;i++)
if (link_in_valid[i]&&(tmp_S_in+i)->n_d_p)
csn_test(transport[i],CSN_RXREADY,-l,NULL,NULL,&rx_status);
for(i=O;i<N_links;i++)
if (link_out_valid[i]&&(tmp_S_out+i)->n_d_p)
csn_test(transport[i],CSN_TXREADY,-l,NULL,NULL,ttx_status);
tmp_ptrl = data_flow_in[N_links+l];
for(ii=O;ii<tmp_S_in->n_nth_d_p;ii++)
{
active_d_p = *(tmp_S_in->nth_d_p+ii);
for(i=cubes_out_group[2*active_d_p],r-i*N2;i<cubes_out_group[2*active_d
+1];i++,r+=N2)
{
l=t+delay_data[i]-*(delay_minus+ii)+l;
if (link_in_valid[* (port+ii)]&&l<=max_tttelg_spnds [I] !=idle_spnd)
{
CopyFromBufferToSupernode(tmp_ptrl,elg_spnds[l]
+bgs_to_extra_vol[i],size_cubes+r);
}
}
else
tmp_ptrl +- n_notes[i];
}
ptn_ptr++;
if (ptn_ptr==n_spnd_in_prc)
ptn_ptr = 0;
}
270
