A Comprehensive Methodology for Algorithm Characterization, Regularization and Mapping Into Optimal VLSI Arrays. by Barada, Hassan Reda
Louisiana State University
LSU Digital Commons
LSU Historical Dissertations and Theses Graduate School
1989
A Comprehensive Methodology for Algorithm
Characterization, Regularization and Mapping Into
Optimal VLSI Arrays.
Hassan Reda Barada
Louisiana State University and Agricultural & Mechanical College
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in
LSU Historical Dissertations and Theses by an authorized administrator of LSU Digital Commons. For more information, please contact
gradetd@lsu.edu.
Recommended Citation
Barada, Hassan Reda, "A Comprehensive Methodology for Algorithm Characterization, Regularization and Mapping Into Optimal
VLSI Arrays." (1989). LSU Historical Dissertations and Theses. 4761.
https://digitalcommons.lsu.edu/gradschool_disstheses/4761
INFORMATION TO USERS
The most advanced technology has been used to photo­
graph and reproduce this manuscript from the microfilm 
master. UMI films the text directly from the original or 
copy submitted. Thus, some thesis and dissertation copies 
are in typewriter face, while others may be from any type 
of computer printer.
The quality of this reproduction is dependent upon the 
quality of the copy submitted. Broken or indistinct print, 
colored or poor quality illustrations and photographs, 
print bleedthrough, substandard margins, and improper 
alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a 
complete manuscript and there are missing pages, these 
will be noted. Also, if  unauthorized copyright material 
had to be removed, a note will indicate the deletion.
Oversize materials (e.g., maps, drawings, charts) are re­
produced by sectioning the original, beginning at the 
upper left-hand corner and continuing from left to right in 
equal sections with small overlaps. Each original is also 
photographed in one exposure and is included in reduced 
form at the back of the book. These are also available as 
one exposure on a standard 35mm slide or as a 17" x 23" 
black and white photographic print for an additional 
charge.
Photographs included in the original manuscript have 
been reproduced xerographically in this copy. Higher 
quality 6" x 9" black and white photographic prints are 
available for any photographs or illustrations appearing 
in this copy for an additional charge. Contact UMI directly 
to order.
University Microfilms International 
A Bell & Howell Information Company 
300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 
313/761-4700 800/521-0600

Order Num ber 901724S
A  com prehensive m ethodology for algorithm  characterization , 
regularization  and m apping into  optim al V L SI arrays
Barada, Hassan Reda, Ph.D.
The Louisiana State University and Agricultural and Mechanical Col., 1989
U M I
300 N. Zeeb Rd.
Ann Arbor, MI 48106

A COMPREHENSIVE METHODOLOGY FOR ALGORITHM 
CHARACTERIZATION, REGULARIZATION AND 
MAPPING INTO OPTIMAL VLSI ARRAYS
A Dissertation
Submitted to the Graduate Faculty of the 
Louisiana State University and 
Agriculture and Mechanical College 
in partial fulfillment of the 
requirements for the degree of 
Doctor of Philosophy
m
The Department of Electrical and Computer Engineering
by
Hassan Reda Barada 
B.S.E.E. Louisiana State University 1984 
M.S.E.E. Louisiana State University 1986 
August 1989
ACKNOWLEDGEMENTS
I would like to express my sincere appreciation to Dr. A. El-Amawy, my advi­
sor, for suggesting the topic of this research, for being an inexhaustible source of 
ideas and inspiration, and for the benefits I received by working with him during my 
years at LSU. Many improvements in the dissertation are due to his constructive cri­
ticism during its preparation.
I am thankful to Dr. S. Kak for being a good friend and an effective teacher and 
to Dr. J. Aravena for his invaluable help and advice. I would also like to thank the 
rest of my Committee’s members, Dr. B. Jones, Dr. G. Kousik and Dr. G. Ferreyra, 
for taking the time to read this dissertation.
A special note of thanks is extended to Sandra Holmes for her support and for 
going through earlier versions of this dissertation.
Finally, I wish to express my sincere appreciation to my parents and my family 
for their love, encouragement, patience and support.
TABLE OF CONTENTS
LIST OF TABLES .....................................................................................................  v
LIST OF FIGURES ..................................................    vi
ABSTRACT ................................................................................................................. x
CHAPTER 1. INTRODUCTION ............................................................................  1
1.1. Key Issues in Designing VLSI Architecture................................................ 2
1.2. The Systolic Architecture ..............................................................................  3
1.3. Problem Definition and Outline of the Dissertation ...................................  8
CHAPTER 2. LITERATURE SURVEY ...............................................................  13
2.1. Complexity of VLSI Arrays ..........................................................................  13
2.2. Systolic Designs .............................................................................................. 14
2.2.1. Arithmetic Functions........................................................................... 15
2.2.2. Matrix Computations ..........................................................................  16
2.2.3. Signal and Image Processing Applications ......................................  17
2.2.4. Non-Numeric Applications ................................................................  19
2.3. Methodologies for VLSI Arrays Design ......................................................  20
2.3.1. Characterization of Previous Methods .............................................  20
2.3.2. Limitations of Previous Methods ......................................................  34
2.3.3. Overview of the New Methodology.................................................. 43
CHAPTER 3. ALGORITHM REGULARIZATION AND SYSTOLIC 
PRECEDENCE DIAGRAM ........................................................................................ 46
3.1. Algorithm M odel.............................................................................................  47
3.1.1. Pipelining the V ariables...................................................................... 51
3.1.2. Examples to Illustrate Pipelining.......................................................  56
3.2. Data Dependencies and Algorithm Regularization.....................................  61
3.2.1. Data Flow Regularization ..................................................................  63
3.2.2. Dependence Vectors Subsets .............................................................  72
3.2.3. Ordering of Index Points ...................................................................  73
3.2.3.1. Ordering Index Points with Fixed Fan-out........................  76
3.2.3.2. Deadlocks in Index Points Ordering................................... 79
3.3. Algorithm to SPD M apping...........................................................................  84
3.3.1. An Algorithm for Constructing the SPD .........................................  87
3.3.2. Explanation of SPD Construction Algorithm .................................. 89
iii
3.3.3. Examples to Illustrate SPD Construction.......................................  94
CHAPTER 4. CONSTRUCTION OF SDG AND ITS MAPPING INTO 
FULL-SIZE VLSI ARRAYS .....................................................................................  109
4.1. SPD to SDG Transformation ........................................................................  110
4.2. SDG to VLSI Array(s) Transformation ................................................ 114
4.3. Examples to Illustrate Systolic Arrays Design ...........................................  120
4.4. Switchable Arrays ..........................................................................................  135
4.4.1. PE-Switchability..................................................................................  138
4.4.2. Link-Switchability............................................................................... 143
4.4.3. Data-Flow-Switchability ....................................................................  144
4.5. Improving Performance ................................................................................. 146
4.5.1. Data Flow Splitting ............................................................................. 146
4.5.2. Increasing PE Fan-out........................................................................  152
4.5.3. Examples to Illustrate Increasing PE Fan-out...................................  157
CHAPTER 5. DESIGN OF SIZE-INDEPENDENT AND FIXED DEPTH
VLSI ARRAYS ...........................................................................................................  162
5.1. Defining the Problem .....................................................................................  163
5.2. Manipulating the SDG ...................................................................................  165
5.2.1. Graph Stretching .................................................................................  167
5.3. Design of FDML Arrays ............................................................................... 168
5.3.1. SDG Stretching ...................................................................................  171
5.3.2. Mapping Transformed SDG into Hardware......................................  176
5.4. Design of Fixed-Size Arrays ........................................................................  178
5.4.1. Line Numbering ..................................................................................  180
5.4.2. SDG Stretching ...................................................................................  183
5.4.3. Mapping the Stretched SDG into Hardware .....................................  189
5.5. Examples .........................................................................................................  192
CHAPTER 6. CONCLUSIONS AND FUTURE W O R K ....................................  203
REFERENCES ...........................................................................................................  208
VITA .............................................................................................................................  219
iv
LIST OF TABLES
Table 3.1 Dependence Vectors for Matrix-Vector Multiplication 94
Table 3.2 Dependence Vectors for Matrix-Matrix Multiplication 97
Table 3.3 Dependence Vectors for Warshall’s Algorithm 101
Table 3.4 Dependence Vectors for Dynamic Programming 106
V
LIST OF FIGURES
Figure 1.1 The concept of systolic array. 5
a) Conventional processor. 5
b) Systolic processor array. 5
Figure 1.2 Various systolic array configurations. 6
a) Linear array. 6
b) Mesh array. 6
c) Hexagonal array. 6
d) Binary tree. 6
e) Triangular array. 6
f) Cylindrical array. 7
g) Orbital array 7
Figure 2.1 Y-chart for transformation systems. 36
Figure 2.2 Y-charts. 37
a) Gannon’s method. 37
b) S.Y. Kung’s method. 37
Figure 2.3 Y-charts. 39
a) Moldovan and Fortes’s method. 39
b) Li and Wah’s method. 39
c) Cheng and Fu’s method. 40
Figure 3.1 Types of data flow 52
a) Broadcasting of data. 52
b) Pipelining of data. 52
Figure 3.2 Data flow when g is a many-to-one function. 66
a) Irregular data flow. 66
b) Regularized data flow. 66
Figure 3.3 Data flow when w is a many-to-one function. 69
a) Irregular data flow. 69
b) Regularized data flow. 69
Figure 3.4 Data flow when g and w are many-to-one functions. 71
a) Irregular data flow. 71
b) Regularized data flow. 71
vi
78
80
80
80
96
96
96
98
98
99
102
105
108
119
119
119
121
121
121
121
122
124
124
124
124
125
126
126
Index point ordering with fixed fan-out.
Data flow when two dependence vectors have a 
nonconstant component at the same coordinated.
a) Irregular data flow.
b) Regularized data flow using Theorem 1.
SPDs for matrix-vector multiplication (n =  3).
a) Input bandwidth of n+1.
b) Input bandwidth of 2n .
SPDs for matrix-matrix multiplication (n =  3).
a) Input bandwidth of 2n .
b) Input bandwidth of n2.
SPD for LU-decomposition (n = 4).
SPD for transitive closure (n = 3).
SPD for dynamic programming (n = 4).
Ordering of index points that use 
a variable dependence vector.
a) Original ordering.
b) Relaxed ordering.
Matrix-vector multiplication with 
bandwidth of n+1. in =  3)
a) SDG.
b) Array projected along d±.
c) Array projected along d 2.
d) Array projected along d \ + d 2.
Matrix-vector multiplication with 
bandwidth of 2 n . (n = 3)
a) SDG.
b) Array projected along dj.
c) Array projected along d 2-
d) Array projected along d l + d 2.
Matrix-matrix multiplication with 
bandwidth of 2n. (n =  3)
a) SDG.
vii
127
129
130
130
130
131
131
132
133
134
136
136
137
141
142
147
149
149
149
150
156
156
156
158
b) Array projected along d 3.
Orbital array for matrix-matrix multiplication.
Planar array for matrix-matrix 
multiplication. (n = 3)
a) SDG.
b) Array projected along d 3.
LU-decomposition. (n = 4)
a) SDG.
b) Array projected along d j.
c) Array projected along d 2.
d) Array projected along d \ + d 2.
Transitive closure.
a) SDG. in = 3)
b) Array projected, (n = 4)
The mode of operations of the PEs in the 
array shown in Figure 4.7b.
The mode of operations of the PEs in the 
array shown in Figure 4.7d.
The mode of operations of the PEs in the 
array shown in Figure 4.8b.
The concept of splitting data flow.
a) Original data flow.
b) Data flow split into two flows.
c) Data flow is a binary tree.
Ordering of index points covered by a 
variable dependence vector. (5 points)
a) Original ordering.
b) Ordering with all generated points 
have a fan-out of two.
Arrays for matrix-vector multiplication.
viii
a) One PE with fan-out of two. (n = 3) 158
b) One PE with fan-out of four, (n = 5) 158
b) All PEs with fan-out of two. {n = 7) 158
Figure 4.15 Array for matrix-matrix multiplication with 
bandwidth of 2n and one row of PEs
with fan-out of two. 160
Figure 4.16 A cylindrical array for transitive closure 161
Figure 5.1 Graph stretching. 169
a) Original graph. 169
b) Stretched graph. 169
Figure 5.2 Line numbering. 182
Figure 5.3 Subgraph stretching. 187
a) Along d\. 187
b) Along d 2. 187
Figure 5.4 Matrix-vector multiplication (n = 4 and p = 2). 193
a) Stretched SDG along d\. 193
b) Array projected along d\. 193
Figure 5.5 Matrix-vector multiplication (n = 4 and p  = 2). 194
a) Stretched SDG along d 2. 194
b) Array projected along d 2. 194
Figure 5.6 Linear array for transitive closure. 196
a) Stretched SDG. 196
b) Projected array. 197
Figure 5.7 A bilinear array for transitive closure. 199
Figure 5.8 A fixed-size system for transitive closure. 201
a) Block diagram. 201
b) Array of type B. 202
ix
ABSTRACT
This dissertation provides a fairly comprehensive treatment of a broad class of 
algorithms as it pertains to systolic implementation. We describe some formal algo­
rithmic transformations that can be utilized to map regular and some irregular 
compute-bound algorithms into the best fit time-optimal systolic architectures. The 
resulted architectures can be one-dimensional, two-dimensional, three-dimensional or 
nonplanar.
The methodology detailed in the dissertation employs, like other methods, the 
concept of dependence vector to order, in space and time, the index points represent­
ing the algorithm. However, by differentiating between two types of dependence 
vectors, the ordering procedure is allowed to be flexible and time optimal. Further­
more, unlike other methodologies, the approach reported here does not put constraints 
on the topology or dimensionality of the target architecture.
The ordered index points are represented by nodes in a diagram called Systolic 
Precedence Diagram (SPD). The SPD is a form of precedence graph that takes into 
account the systolic operation requirements of strictly local communications and reg­
ular data flow. Therefore, any algorithm with variable dependence vectors has to be 
transformed into a regular indexed set of computations with local dependencies. This 
can be done by replacing variable dependence vectors with sets of fixed dependence 
vectors.
x
The SPD is transformed into an acyclic, labeled, directed graph called the Sys­
tolic Directed Graph (SDG). The SDG models the data flow as well as the timing 
for the execution of the given algorithm on a time-optimal array.
The target architectures are obtained by projecting the SDG along defined direc­
tions. If more than one valid projection direction exists, different designs are 
obtained. The resulting architectures are then evaluated to determine if an improve­
ment in the performance can be achieved by increasing PE fan-out. If so, the metho­
dology provides the corresponding systolic implementation.
By employing a new graph transformation, the SDG is manipulated so that it 
can be mapped into fixed-size and fixed-depth multi-linear arrays. The latter is a 
new concept of systolic arrays that is adaptable to changes in the state of technology. 
It promises a bounded clock skew, higher throughput and better performance than the 
linear implementation.
CHAPTER 1
INTRODUCTION
Get your facts first, and then you can distort them as much as you please.
Mark Twain
In a conventional general-purpose computer or "Von Neumann machine", the 
operation rate is limited by the bandwidth of the processor-memory communication 
link, commonly referred to as the "Von Neumann" bottleneck [1], Moreover, the 
throughput and efficiency of general-purpose array processors are severely limited by 
the high overhead incurred by task allocation, task scheduling, synchronization, etc. 
This makes these machines (including traditional supercomputers) not suitable for 
computationally demanding signal and image processing applications.
The 80’s will be remembered as the decade when researchers looked beyond the 
relatively slow general-purpose machines to solve computationally intensive, real­
time problems. This is due to advances in VLSI technology which makes it feasible 
to build low-cost, high performance, special-purpose peripheral devices. However, 
VLSI technology imposes some limitations on the design of such devices [2] - [4], 
These limitations are discussed in Section 1.1.
Systolic arrays which are examples of specialized VLSI architectures are intro­
duced in Section 1.2. The problem definition and the outline of this dissertation are
1
2presented in Section 1.3.
1.1. Key Issues in Designing VLSI Architectures
To make use of VLSI technology, it is important to design architectures with 
simple, regular and local interconnections. Such interconnections usually lead to 
cheaper implementations and high densities, and hence, undoubtedly, implies high 
performance. In [5], Sutherland and Mead discuss the importance of having simple 
and regular geometries for data flows. For these reasons, we are interested in design­
ing parallel algorithms that have simple and regular data flows, or at least algorithms 
with data flows that can be regularized.
Furthermore, the technological trend clearly indicates a diminishing growth rate 
for component speed. It has been estimated that minimum feature size in VLSI 
might reduce to 0.3 or 0.4 micron within a decade. Reductions below this range 
seem unlikely owing to fundamental quantum mechanical limitations. Therefore, any 
major improvement in computation speed must come from increasing computational 
concurrency. The best-performance can be achieved if both, pipelining and parallel 
processing, are combined in the same design. Pipelined processors are patterned 
after the production line found in manufacturing: a portion of the processing is per­
formed by each of several processors and then handed to the next processor in the 
line. In a parallel processing system more than one line would be working con­
currently.
3Another important issue in designing a special-purpose VLSI device is that 
VLSI structures are only suitable for implementing compute-bound rather than I/O 
bound algorithms. A compute-bound algorithm is characterized by the large number 
of computing operations compared to the total number of input and output operations. 
For example, matrix-vector multiplication algorithm represents a compute-bound 
algorithm because it has 0(n2) multiply-add operations, and only O(n) I/O operations. 
I/O-bound problems are not suitable for VLSI implementation because VLSI packag­
ing must be constrained with limited I/O pins.
1.2. The Systolic Architecture
As a solution to the above challenges, H.T. Kung and colleagues at Carnegie- 
Mellon University introduced the systolic array concept [6] - [11]. Since then, inten­
sive research activity in areas related to the systolic array concept has taken place 
[12] - [81]. This section reviews the basic principle of systolic architectures and 
highlights their salient features and advantages.
A systolic array (or VLSI array), typically, consists of a set of locally intercon­
nected processing elements (PEs), either all of the same type or a mixture of a few 
different types and each capable of performing simple operations. Information in a 
systolic array flows between PEs in a pipelined fashion, and communications with the 
host computer occur only at boundary cells. Once the data is retrieved from memory, 
it passes through the entire array avoiding any delay due to storing and retrieving 
intermediate data.
The basic principle of a systolic array is illustrated in Figure 1.1. By replacing 
a single PE with an array of PEs, a higher computation throughput can be achieved 
without increasing memory bandwidth. This high throughput, accomplished by 
employing the concepts of pipelining and parallel processing, is just one of the many 
advantages of systolic arrays. Other advantages and features which make systolic 
arrays attractive include modular expansionability, regular data and control flows, use 
of simple PEs, elimination of global broadcasting and most significantly compatibility 
with VLSI technology.
Systolic arrays can assume many different geometries for different compute- 
bound algorithms. Figure 1.2 shows various systolic array configurations. While 
most of the research is focused on one-dimensional (linear) and two-dimensional 
(rectangular, triangular, hexagonal and binary tree) arrays, some work has been done 
on the possibility of implementing algorithms using three-dimensional or nonplanar 
(cylindrical and orbital) systolic structures. It has been demonstrated that, for some 
problems, such architectures offer substantial advantages over their 2D counterparts. 
The advantages can be summarized as: use of uniform input data streams, significant 
speed increase, better utilization of processing elements and reduction in I/O 
bandwidth requirements [23] - [30]. Manufacturing implications of 3D structure have 
been tentatively explored by Texas Instrument and Hughes Aircraft Company [81].
All processing elements in a systolic array operate synchronously using the 
same global clock. For very large systems, the clock skew incurred in global clock
„ Memory
PE
a)
Memory
-►  PE PE PE PE PE PE ----
b)
Figure 1.1 The concept of systolic array, (ref. [2])
a) Conventional processor.
b) Systolic procesor array.
(a) Linear array.
(b) Mesh array.
(c) Hexagonal array.
(d) Binary tree. (e) Triangular array
Figure 1.2 Various systolic array configurations. (Cont.)
7Figure 1.2 Continued.
(f) Cynlidrical array.
(g) Orbital array.
distribution is a nontrivial factor, causing unnecessary slowdown in the clock rate 
[82], [83]. For this reason, S.Y. Kung introduced the wavefront processing concept. 
The wavefront array is a variation of the systolic array in which processors operate 
asynchronously. To guarantee operational synchronism, a handshaking protocol is 
often used [70] - [83].
1.3. Problem Definition and Outline of the Dissertation
One of the most challenging problems in systolic processing is the development 
of a systematic methodology for mapping an algorithm into a systolic architecture 
[60] - [80]. Algorithms are best described by high-level language statements while 
systolic arrays are specified by the timing of data movement and the functions and 
interconnection of processing elements.
While many researchers attempted to develop a complete method for transform­
ing algorithms into VLSI arrays, it is possible to pinpoint several limitations to previ­
ous methodologies:
1. Previous methods are, generally, nonconstructive; i.e. the designs are not 
automatically derived.
2. Some transformations are done in an ad hoc fashion.
3. Only regular algorithms are considered.
4. Severe restrictions are imposed on data movement as well as input and out­
put spatial relationships.
95. The resulting designs are often suboptimal.
6. None of the methods can lead to nonplanar architectures.
7. The majority of previous approaches are not easily extendible to encompass 
designing fixed-size systolic architectures.
Chapter 2 contains a thorough review and critique of the most pertinent previously 
proposed methodologies. It also presents an overview of our new method.
Chapters 3 and 4 consist of a detailed description of a constructive graph metho­
dology for transforming regular and some irregular compute-bound algorithms into 
time-optimal size-dependent systolic arrays. A unique feature of this method is the 
use of the Systolic Directed Graph (SDG). The SDG, which can be transformed 
directly into hardware, is an acyclic, weighted graph that describes the data flow as 
well as the timing of the given algorithm. Before reaching this level of representa­
tion, however, the regularized algorithm is first transformed into a Systolic Pre­
cedence Diagram (SPD). The SPD is an adaptation of the familiar precedence graph 
but it takes into account the systolic operation requirement of strictly local communi­
cations and regular data flow. The SPD determines the lower bound on the total exe­
cution time of an algorithm on a systolic array for a given I/O bandwidth and for the 
minimum fan-out.
The most significant contributions of this new graph methodology to the area of 
algorithm to VLSI array mapping can be summarized as follows:
10
1. Defining an algorithm model capable of modeling a wide class of compute- 
bound algorithms that can be transformed into systolic architectures.
2. Transforming an irregular algorithm with variable dependence vectors into a 
regular indexed set of computations with local data dependences.
3. Determining the algorithms absolute lower bound on systolic execution time 
for given I/O bandwidth and minimum fan-out.
4. Determining the best geometry necessary to achieve time-optimality (for the 
given I/O bandwidth and minimum fan-out). The geometry of the arrays can 
be 1-D, 2-D, 3-D or nonplanar.
5. Possibly obtaining different array designs for the same algorithm.
6. Studying the possibility of improving performance with increasing fan-out 
for a given I/O bandwidth.
Chapter 5 focuses on extending the methodology to encompass size-independent 
and fixed-depth multi-linear arrays (the linear array is a special of the latter). As 
Heller [84] stated:
"No matter what special-purpose device is available, there is a prob­
lem too large fo r it. The problem will manifest itself only after the 
device is acquired and can no longer be modified. The problem can­
not be ignored."
Furthermore, problem-size independent arrays and fixed-depth multilinear arrays
T.
11
can offer the desired adaptability to the state of technology. As technology advances 
new structures with extended sizes can be constructed offering higher throughputs 
and better performances. To date, in the most part, size independent arrays for few 
algorithms have been designed in an ad-hoc fashion. Only two formal methodologies 
for their synthesis has been reported [75], [85]. Both of these methods, which are 
incomplete, do not guarantee optimality and are confined to mesh connected arrays. 
Also, both methods consider only regular compute-bound algorithms.
It has been proven that in 2-D arrays with arbitrary size, the clock skew is 
unbounded [82]. Thus globally synchronous arrays with arbitrarily large sizes may 
not be feasible even if other technological limitations such as number of pins, chip 
area, and I/O bandwidth are disregarded. On the other hand, it has been proven that 
for linear arrays, clock skew is bounded regardless of the size [82]. That prompted 
several researchers to pursue linear arrays for several compute-bound algorithms. 
Linear arrays are also less demanding in their I/O and pin count requirements. How­
ever, for many compute-bound problems, linear arrays can not match the performance 
of their full-size 2-D or 3-D counterparts. As technology advances, a viable 
compromise in which arrays consisting of multiple linear arrays will be needed. 
Fixed-depth multi-linear arrays (FDML) would be a good design choice since they 
could offer:
a) bounded clock skew and therefore global synchronism,
12
b) higher throughputs than linear (1-D) implementation, and
c) adaptability to changes in technology.
Chapter 6 concludes with a brief summary of the results of our research and 
some suggestions for future work in related topics.
CHAPTER 2
LITERATURE SURVEY
The farther backward you can look, the farther forw ard you are likely to see.
Winston Churchill
The purpose of this chapter is to give an insight into the various aspects of sys­
tolic or VLSI architectures. The first section deals with the complexity measures for 
VLSI arrays. Section 2.2 classifies and surveys many of the systolic designs found in 
the literature. Section 2.3 reviews, in some detail, the most pertinent methods for 
mapping algorithms into systolic arrays and discusses their limitations. It also 
presents an overview of our new methodology.
2.1. Complexity of VLSI Arrays
Some complexity measures are required for comparing different designs for the 
same algorithm. The most widely used complexity parameters in VLSI arrays are the 
area A, the time T and the period P which are defined below [12], [13]:
A: The area occupied by a chip implementing the algorithm. Since chip area 
is usually a function of the available technology, the number of processing 
elements in a systolic array is often used as an alternative measure.
13
14
T: The total time for computing one instance of the algorithm.
P: The minimal time interval between the input of two consecutive instances
of an algorithm solved by the systolic array.
The objective of any VLSI array designer is to minimize all the above quanti­
ties. Since it is unlikely that all three parameters will be minimized, a function of
A , T and P is often considered. The most widely used functions are T , AT,
AT2, ATP and AP2.
Other array characteristics can also be studied in comparing different designs for 
the same problem. Examples of such characteristics are the input/output bandwidth, 
the dimensionality of the array layout (planar or nonplanar) and the complexity of the 
processing elements.
2.2. Systolic Designs
In this section, the systolic designs are classified according to the type of func­
tions performed by the array [12]. These functions are:
1) Arithmetic functions.
2) Matrix computations.
3) Signal and image processing applications.
4) Non-numeric applications.
15
Other classifications of the systolic designs are possible. H.T. Kung [2], [7] has 
provided a taxonomy of systolic algorithms based on the communication geometry.
2.2.1. Arithmetic functions
The addition of two A/-bit integers is one of the first problems to be investi­
gated in terms of VLSI complexity. The lower bounds were derived by Johnson [14] 
and Baudet [15]. They are:
AT2 = Q( log2N  )
APT = 0 (  log2V )
where N  is the number of bits. An adder was designed by Brent and Kung [16] 
with A = N  logV, T  = logN  and P = 1.
The lower bounds for the general problem of multiplying two -bit integers 
were derived independently by Brent and Kung [16] and Abelson and Andreae [17]. 
These are:
AT2 = Q ( N 2 )
APT = Q ( N 2 )
AP2 = Q ( N 2 )
Optimal systolic designs (with respect to the function AP2) for multiplication over a 
finite field are given in [18].
Systolic arrays for computing the greatest common divisor (GCD) of two 
integers and two polynomials with area 0( N) ,  time 0 ( N )  and period 0 {  1), are 
described in [19]. The lower bounds for these problems have not been derived.
16
2.2.2. Matrix Computations
Some of the most widely used matrix computations are: matrix-vector multipli­
cation, matrix-matrix multiplication, LU-decomposition and QR-decomposition.
The lower bounds for multiplying an N xN  dense matrix with a vector have 
been derived by Vuillenin [20]. The AT2 lower bound is Q( N 2 ). A linear sys­
tolic array for multiplying a band matrix with a vector was first designed by Kung 
and Leiserson [21]. In this dissertation, we describe many different designs for this 
problem. Our systolic arrays which satisfy the AT2 lower bound, include a circular 
array and a tree array.
The lower bounds for the multiplication of square and rectangular matrices have 
been derived by Savage [22]. AT2 = £2( iV4 ) for the multiplication of two N xN  
matrices. A hexagonal systolic array for band matrix multiplication was reported by 
Kung and Leiserson [21]. A cylindrical array and an orbital array for multiplication 
of dense matrices were described by Porter and Aravena [23]. A two-layered mesh 
array was proposed by Kak [29] - [30]. All designs meet the AT2 lower bound but 
differ by a multiple constant of N  insofar as the time parameter is concerned.
The problem of LU-decomposition has been shown to have the same complexity 
as matrix matrix multiplication [86]. A systolic that solves this problem for a band 
matrix was described by Kung and Leiserson [21]. El-Amawy designed an array for 
dense matrices [31]. In this dissertation, we describe two new systolic designs for 
the LU-decomposition . For each of the two arrays, AT2 is lower than that for the
17
previous designs by a constant factor.
The LU-decomposition that implements Gaussian elimination without pivoting 
process is not unconditionally stable, numerically. Systolic arrays that solve the LU- 
decomposition problem using Gaussian elimination with partial pivoting were 
reported by El-Amawy and Barada [32], [33]. Another alternative is to use the QR- 
factorization of a matrix which is unconditionally stable. Systolic arrays for the QR- 
factorization of a band matrix, hence for solving linear least squares problems, have 
been proposed by Heller and Ipsen [34].
A problem related to triangularization of a matrix is the matrix inversion prob­
lem. A systolic array for inverting a dense matrix using LU-decomposition without 
pivoting was described by El-Amawy [35], [36]. Stable matrix inversion based on 
the QR-factorization and on the Gauss-Jordan methods were implemented in systolic 
arrays by Dharmarajan [37].
2.2.3. Signal and Image Processing Applications
The recurrences describing signal and image processing applications have a 
repeated structure. Therefore, perhaps the largest number of systolic designs reported 
in the literature deal with problems in this area. These problems include: one­
dimensional convolution, multi-dimensional convolution, Fast Fourier Transform 
(FFT), and Finite Impulse Response (FIR) and Infinite Impulse Response (HR) filter­
ing.
18
The lower bounds of the one-dimensional convolution problem are:
AT2 = Q( N 2 )
AP2 = Q ( N 2 )
Seven different designs for this problem were described by Kung [2], and two more 
were added in [38].
The multi-dimensional convolutions constitute some of the most compute inten­
sive problems in image and signal processing. Systolic arrays for 2D convolution 
were proposed by Kung et al. [39], [40]. Lower bounds for these problems have not 
been explored.
The FFT problem could be formulated as a matrix multiplication problem, thus 
it has the same lower bounds as matrix multiplication. In addition to the hexagonal 
array designed for matrix multiplication that can be used to solve the FFT problem, a 
2D mesh array was designed by Kung [8].
The two types of filtering operations, namely, FIR and IIR filtering are basic to 
most digital signal processing applications. Their lower bounds are the same as that 
of the one-dimensional convolution problem. A linear array to compute both FIR 
and IIR filtering were reported in [41]. Another systolic architecture for FIR filtering 
is described in [42].
Many more systolic arrays that implement signal and image processing applica­
tion problems can be found in the literature. Examples are: ID and 2D median 
filtering [43] - [44], and relaxation problem [45].
19
2.2.4. Non-Numeric Applications
An increasingly wider range of non-numeric problems such as a stack, a priority 
queue, a dictionary, sorting, searching, transitive closure and shortest paths etc., are 
being implemented on systolic arrays.
Stacks have been implemented in hardware most commonly using one of the 
following methods: a two-way shift register array or a ripple register array. The 
linear array that realizes a stack proposed by Guibas et al. [46] combines the advan­
tage of both methods.
A priority queue is an array that has the ability to insert a data item into a set 
and retrieve the smallest (with respect to a given ordering) data item. Systolic prior­
ity queues were proposed by Leiserson [11].
A dictionary is a data structure capable of storing N  data elements, search for 
a specific element, insert or delete an element and find the minimum (with respect to 
a given ordering) of the stored data elements. A dictionary machine utilizing a sys­
tolic search tree was reported by Ottman et al. [47]. The area of this array is 0 {N )  
and the time for performing each of the above operations in 0{logN).
Sorting N  A:-bit integers in ascending or descending order is by far the most 
widely occurring problem in computer science. It has received a lot of attention both 
in the sequential and parallel environments. Tight lower bounds for sorting have 
been proved by Leighton [48]. The lower bound on A T2 is Q( N 2log2N  ). Eleven 
VLSI sorters are described and compared in [49]. However, in most of these archi­
20
tectures, the N  numbers have to be input in parallel. A systolic sorter which 
operates in a sequential I/O environment is described in [50].
The transitive closure and the shortest path problems are clearly among the most 
important problems in graph theory. They have received the attention of several 
researchers and different structures of systolic arrays have been proposed [51] - [56]. 
A cylindrical array and a fixed size architecture that solve these two problems are 
described in [57].
Many more non-numeric problems have been implemented in systolic arrays, 
such as connected components and biconnected components of graphs [58], pattern 
matching [6] and recognition of formal languages [59], to name a few.
2.3. Methodologies for VLSI Arrays Design
Due to the wide applicability of VLSI arrays as demonstrated in the previous 
section, there has been considerable interest in developing formal methodologies for 
mapping compute-bound algorithms into systolic arrays. This section describes some 
of the most pertinent proposed methodologies. In Subsection 2.3.2, the limitations of 
the surveyed methods are studied and discussed, while in Subsection 2.3.2, the new 
methodology detailed in this dissertation is reviewed.
2.3.1. Characterization of Previous Methods
The common characteristic of previously published methodologies for mapping 
compute-bound algorithms into systolic arrays is the use of a transformational
21
approach; i.e., the systolic architecture is derived by transforming the original algo­
rithm descriptions that are unsuitable for direct VLSI implementation. Different 
transformational systems for systolic architecture design can be characterized by how 
algorithms are described, how the systolic architectures are specified and what types 
of transformations are used on and between these representations [60].
The previous proposed mapping procedures can be classified into three 
categories. In the first category, transformations are performed at the algorithm 
representation level, and a direct mapping is made from this level to the architecture. 
The first two methods described below fall under this type. In the second category, 
transformations are performed on a previously designed architecture to obtain a new 
architecture. A representative of this class is the method due to Leiserson et al. [61]. 
It is the third method to be described below. In a third group of mapping metho­
dologies, transformations are performed at the algorithm-model level, and there are 
procedures for deriving the model from the algorithm representation and for mapping 
the model into hardware. All the other methods surveyed here fall into this last 
category.
Starting from a mathematical expression involving subscripted variables, Cohen, 
Johnson, Weiser and Davis’ method [62], [63] begins by deriving a new expression 
using a special delay operator Z . When x  is a data element at a particular point in 
a computational network, Z[x]  is defined as the data element that was at the same 
point on the previous time step; i.e., Z [x(i)] = x ( i- l ) .  The Z operator corresponds
22
to a delay cell in the circuit. Symbolic manipulation is used to transform the derived 
mathematical expression into equivalent ones by using the properties of the Z 
operator and the functional operators in the expression. From the final expression, 
the ordering of execution of operations and the number, position and interconnection 
of operator cells can be derived. Timing and storage requirements are inferred from 
the placement of delay cells.
Mostow and Lam [64] present a transformational model that is implemented in 
a program called SYS (SYStolic design SYStem). SYS accepts a software algorithm 
suitable for systolic implementation and applies series of transformations to produce a 
functional-level circuit description. The presentation of a design is factored into a 
structure description, which specifies the hardware cells and their interconnections, 
and a driver which relates the input and output data streams of the structure to the 
variables used in the algorithm. Initially, SYS generates a simple minded implemen­
tation of the given algorithm. Systematic and user determined transformations are 
then used to optimize and to obtain new designs.
It is noted that SYS lacks knowledge about implementing control constructs 
other than simple iterations, which limits the domain of algorithms it can handle. It 
also lacks a capability for symbolic manipulation. For example, it would not recog­
nize the equivalence of the data streams Sequence [/ from 1 to n] of xt and 
Sequence [i from 0 to n - 1] of xi+l.
Leiserson, Rose and Saxe [61] begin their approach by designing a synchronous
circuit which is not necessarily systolic. The circuit is modeled as a finite, rooted, 
vertex-weighted, edge-weighted directed multigraph G = (V , E , vh, d,  w). The 
vertices V of the graph model the functional elements of the circuit. Each vertex 
veV  is weighted with its numerical propagation delay diy).  A root vertex vh, 
called the host, is included to represent the interface with the external world, and it 
is given zero propagation delay. The directed edges E of the graph model inter­
connections between functional elements. Each edge e e E  connects an output of 
some functional element to an input of some functional element and is weighted with 
a register count w(e) .  The register count is the number of registers along the con­
nection.
Retiming and k-slowdown transformations are then applied to the original 
design in order to obtain a systolic design without global broadcasts. A retiming of 
a circuit G = (V, E , vh, d z w) is an integer-valued vertex-labeling r  : V —> Z 
such that r (vh ) = 0. The retiming specifies a transformation of the original circuit 
in which the registers are added and removed so as to change the graph G into a 
new graph Gr = {V, E , vh , d, wr), where the edge-weighting wr is defined for an
e
edge u —> v by the equation
wr (e) = w( e )  + r (v)  -  r(« ).
Optimal retiming transformations can be selected so that the transformed circuit 
has the smallest clock period by reducing this problem to a mixed-integer linear pro­
gramming problem.
24
In Gannon’s method [65] a functional specification is derived, from an algo­
rithm, by using four different vector operators where parallelism is explicitly 
represented. These operators are: a product operator which represents the concurrent 
operation of basic functions, data movement operator, chain operator which defines a 
mechanism to iterate basic functions, and the systolic-iteration operator which allows 
one copy of basic functions to be used in "feedback" loop. The functional 
specification of the algorithm is viewed as a data-flow graph which, depending on the 
operators used, can be mapped into a systolic architecture.
In their method, Kung and Lin [66] start by deriving an algebraic representation 
for the algorithm. This representation consists of two expressions of the form
(a) v <— Av + bx
(b) y <- cTv
where x  represents the input, y  represents the output and v represents the vari­
ables generated by implicit functions; functions that are independent of time. The 
matrix A and the column vectors b and c represent the delay cycles between 
the availability and the use of variables, and each entry is either zero or z~k , where 
k corresponds to the number of delays. For example, if at any time t , the value of 
variable v2, v2(t), depends on the values of variable v3 at time t ,  v3(t), and the 
value of input x  at time t —2, x ( t - 2), then the 2nd component of v in Expres­
sion (a) above is
v2 <— z“°v3 + Z ~ 2X
25
and it means that
v2(0  = /  [ v3(t), x ( t - 2) ]
for some implicit function /  associated with node v2. To the defined algebraic 
representation of algorithms, transformations which can be also described algebrai­
cally are then applied. There are two main types of transformations, namely, retim­
ing transformations and "k -slowing" transformations. These transformations deter­
mine the distribution of delays and the input/output (I/O) periods of the systolic 
architecture.
Jover and Kailath’s method [67] is based on the definition of a conceptual tool, 
Lines o f Computation. A LOC is a directional straight line with several nodes 
equally spaced. It can be interpreted either as the history of how one given value 
was computed or as a stream of values in different stages of a computation.
Using LOCs, Jover and Kaitalh defined the concept of Systolic Arrays and 
Systolic-type Arrays. A systolic-type array is the configuration created by intersect­
ing sets of parallel LOCs that have some nodes in common. A Systolic Array is a 
Systolic-type Array with the following constraints:
(1) same number of intersecting LOCs at each node.
(2) no crossing wires and no crossing LOCs.
The topology of the systolic array can be derived from LOCs. In addition,
26
throughput, efficiency, data interval, external delay, pipelinability and the knowledge 
of execution times of basic operations can be found from LOCs.
Schwartz and Barnwell’s method [68] starts with algorithms described by 
shift-invariant fully-specified flow graph. A fully-specified flow graph is a directed 
graph in which all operations occur at the nodes, and in which the branches are used 
exclusively as signal paths.
For a given flow graph, it is possible to derive a bound on the sampling period 
and a bound on the static-pipeline sampling period. The sample period bound 
involves the minimum sampling period at which a particular algorithm can be imple­
mented using a particular constituent processor. The static-pipeline sample period 
bound is the minimum sampling period which can be achieved if the entire graph is 
implemented using a static pipeline. A static pipeline is an implementation in which 
the node operations of the graph are explicitly assigned to individual processors, and 
in which every applications of any particular node operation is always performed by 
the same processor.
Different systolic solutions are generated by distributing delay nodes throughout 
the flow graph so that correctness is preserved and data transfers can be simultane­
ous. The transformations of flow graphs used to achieve this consist of data inter­
leaving and the cut-set or delay transformation that are shown to preserve 
equivalence. The transformed flow graph is mapped into a systolic array by mapping 
nodes into processors and delays and edges into interconnections.
27
Ramakrishnan, Fussell and Silberschatz [69] described a methodology for 
designing linear arrays. The method starts with a data flow description of the algo­
rithm which is partitioned into sets of vertices that are mapped into the same proces­
sor. This partitioning is called diagonalization. Then, a syntactically correct map­
ping is used to map computation vertices into processors and the labels and edges to 
communication delays and interconnections. This method applies only to homogene­
ous graphs with connected subgraphs satisfying certain properties.
The mapping procedure described by S.Y. Kung et al. [70], [71] is performed 
in three stages:
1. Formulate the algorithm as single assignment algorithm and construct its 
corresponding Dependence Graph (DG).
2. Derive an abstract Signal Flow Graph (SFG) implementation by projecting 
the DG onto a processor array and specifying a linear schedule for the com­
putation.
3. Retime the SFG to obtain a systolic array.
A Dependence Graph (DG) is a directed graph which specifies the data depen­
dencies of an algorithm. In a DG, nodes represent computations and arcs specify the 
data dependencies between computations. For regular and recursive algorithms, the 
DGs will also be regular and can be represented by a grid model; therefore, the 
nodes can be specified by simple indices, such as (i ,j ,k ).
28
A regular and temporally localized SFG is derived by applying localization 
rules. The localization procedure consists of selecting cut-sets of the SFG and reallo­
cating scaled delays to edges "leaving" and "entering" the cut-set so that at least one 
unit of time is allowed for communication of a signal between two nodes. Then, 
delays are combined with operational modules to obtain a full description of the 
operation of a basic systolic model. The resulting SFG maps straightforwardly into 
the systolic array.
Li and Wah [71] provide a methodology for the design of pure planar systolic 
arrays for algorithms that are representable as linear recurrence processes. In general, 
linear recurrences for the computation of a two-dimensional result Z from two two- 
dimensional inputs X and Y can be expressed as
zt j  = f  [ zt f>  xik, y/cj ] - 8 = 1 or - 1
where /  is a function to be executed by a processing element and k is a positive 
integer bounded by a linear function of i, j ,  and the problem size. 8 is defined 
within a recurrence to indicate the order of evaluation. When 8 = -1 , the 
recurrence is called a forward recurrence. When 8 = 1 ,  the recurrence is termed a 
backward recurrence.
From the algorithm, this method derives three classes of parameters: velocities 
of data flows, spatial distributions of data and periods of data. The relationships 
among these parameters are represented as constraint vector equations which must be
satisfied in order to obtain correct design. Optimal designs are then searched in the
29
space of solutions satisfying the constraint equations. This search is done by ordered 
enumeration over a limited search space in time polynomial to the problem size. The 
functional description of the cells is derived from the recurrence equations and the 
interconnections among cells are found from the defined parameters.
Cappello and Steiglitz’s method [38] starts with a set of recurrences describing 
the algorithm. A canonical representation is then obtained by adjoining an additional 
index representing time to the definition of the recurrence. Each index (including the 
time index) is associated with a dimension of a geometric space where each point 
corresponds to a tuple of indices on which the recurrence is defined. A primitive 
computation is associated with each point in the space. A computation is called 
primitive if it is assumed that it can be done in constant area and with constant 
latency. The latency of a circuit is defined as the amount of time separating the 
appearance of the first input bit on some port from the appearance of the last output 
bit on same port for one computation of a basic function.
From the geometric representation and ordering rules, the structure and size of 
the systolic architecture and the timing and direction of data flow are derived.
Cheng and F u ’s method [72] starts with a regular loop-program (or recursive 
form) with simple expression. Then, for the indices in the program loops (recursive 
formula), space and time expansion are applied alternately such that each index 
corresponds to one expansion, and the basic cell realizes the simple expression. A 
k  -space expansion along ;q means that the basic cell repeats uniformly k times
30
along the x, direction. If j  events happen sequentially and each adjacent pair of 
the events has equal time interval, then they are called collectively as a "j - time 
expansion". It is obvious that after a time-expansion, a basic cell has a better com­
putational capability to process a sequence of tasks provided that the tasks are well 
scheduled. Moreover, time expansion does not increase the number of processing 
cells provided these tasks can be perfectly scheduled.
There are two rules which are used to simplify the space-time domain expan­
sion:
(1) Rule of Space Expansion: Input data should be symmetric about output 
stream and spatially arranged to keep time consistency.
(2) Rule of Time Expansion: Every input of the cell must be expanded in time 
domain to keep space consistency.
Time-space expansions can be applied in any degree varying from full-time 
expansion (use of single processor) to full-space expansion (fully parallel). Time and 
space expansions implicitly determine data-flow timing and direction as well as pro­
cessing elements interconnections.
Khun’s method [73] starts with a naive high-level language specification of the 
algorithm. Then the following steps are performed:
1. Buffer all variables. This step introduces additional subscripts for variable
31
referencing to eliminate broadcasting and to provide a unique data element 
for each variable for each loop iteration.
2. Determine the processor functions by collecting the assignment statements in 
the loop bodies into m input, n output functions.
3. Apply a linear reindexing transformation R mapping the original loop 
indices to new indices. R transforms serial loops into a form in which the 
inner loops are vectorized and the outer loops are serial. Thus, R is parti­
tioned into two transformations:
The range of T is the serial, outer, reindexed loops (T denotes time). The 
range of S is the vectorized, inner, reindexed loops (5 denotes space).
4. Analyze the data flow of the algorithm to determine the direction, speed, and 
timing of the data through the network.
The algorithm model assumed in Kuhn’s method is a set of computation nodes 
indexed by the vector value of the indices of the iteration when they are computed. 
The structural information is modeled by the dimensionality of the iteration space and 
the dependency vectors which are difference vectors of the indices of dependent com­
putation modes. The geometry of the algorithm is represented by the iteration space.
Moldovan and Fortes [74] - [77] developed a technique which abstracts compu­
32
tations in terms of their data dependencies. Their method is based on a mathematical 
transformation of the index sets and the data dependence vectors associated with the 
given algorithm.
From a set of recurrence equations that describe the algorithm, an algebraic 
model is first derived. The algebraic representation is then transformed by local and 
global transformations. Local transformations are used to rewrite computations 
which are then mapped into the functional and structural specification of the cells of 
the systolic array. Global transformations are composed of time and space transfor­
mations and are used to restructure the algorithm. The following steps summarize 
the method:
1. Pipeline all variables in the algorithm. This step is similar to the first step of 
Khun’s method.
2. Find the set of data dependence vectors, D .
3. Identify a valid transformation T  for the data dependence vectors and the 
index set. The transformation T  is partitioned in two functions as follows:
II maps the index set of the algorithm to the first k coordinates of the new 
index space, which is selected a priori. II is calculated such that I1D > 0 
as this preserves the execution ordering. 5 maps the index set onto the 
remaining coordinates of the new index space. It is chosen to satisfy the
33
expectations about the geometrical properties of the algorithm.
4. Map the algorithm into hardware. The functions performed by the cells are 
derived directly from the mathematical expressions indicated in the algo­
rithm. The geometry of the systolic array is derived from the mapping S. 
S maps the dependence vectors into communication links and the index 
space into processing elements. The array timing is derived from the map­
ping n.
Moldovan and Fortes extended their methodology to encompass fixed-size sys­
tolic arrays [78]. The n -dimensional index set of the algorithm is partitioned in 
bands using (w-l)-dimensional hyperplanes that are linearly independent. They are 
denoted collectively by the set
S p  — {npl, iip2> • • • ’
The mapping of the index points of a band to processors is done by associating 
each coordinate j; in the / ra_1 processor space to one partitioning hyperplane 
n^-, i g [1, . . . , n-1)}. Then, for each point j  e J n , the i th coordinate of 
the processor is simply
Ji = IIpi j  mod mi
Quinton’s method [79] consists of three steps:
1. Finding a Uniform Recurrent System of Equations (URE) that is equivalent 
to the problem to be solved. This system is defined over some convex
34
J
subset D .
2. Finding a timing function that is compatible with the dependencies of the 
URE. This requires the identification of a convex space of feasible timing 
functions from which one can be chosen heuristically. Such space can be 
found systematically from the knowledge of dependency vectors and D .
3. Finding an allocation function that maps the URE onto a finite architecture. 
This function projects D into space along a preselected direction which is 
chosen so that two points in D with the same image under the timing func­
tion do not map into the same point.
Steps two and three are solved using elementary convex analysis techniques 
when considering quasi affine timing functions and quasi linear allocation functions.
2.3.2. Limitations of Previous Methods
All of the methods described in the previous section can be used to design sys­
tolic architectures. However, these methodologies cannot be considered complete, 
meaning that they are not fully constructive and that some of the transformations 
used are done in an ad hoc manner. Furthermore, in most cases, very little can be 
said about the optimality of the resulting design(s).
The transformations used in different mapping procedures were analyzed by 
Fortes at al. [60]. The transformational system is represented as a three dimensional 
space associated with algorithm representation, algorithm model and architecture
35
specification. Different forms in which an algorithm can be presented to the transfor­
mational system are associated to the axis of the algorithm representation. The axis 
of algorithm model shows different levels of abstraction used to represent relevant 
features of the algorithm. The axis of architecture specification is associated with the 
level of design in which the systolic array is described. This three dimensional space 
can be graphically depicted as a Y chart (Figure 2.1) where arcs can be drawn to 
illustrate transformations that map a given representation into another representation 
in the same axis or between distinct dimensions. Arcs drawn in full lines represent 
systematic transformations whereas those drawn in broken lines represent ad hoc 
transformations.
Figure 2.2(a) shows the Y chart corresponding to Gannon’s method [60], [70]. 
All the arcs in the chart are drawn in broken lines indicating that the transformations 
used in this method are done in an ad hoc manner making Gannon’s method very 
primitive. On the other hand, S.Y. Kung’s method [60], [65] does not define a way 
to map the algorithm to its corresponding Signal Flow Graph (SFG). Thus the first 
transformation in the method (algorithm to SFG mapping) is an ad hoc transforma­
tion. This is represented in a broken line arc in the Y chart shown in Figure 2.2(b).
Similar Y charts can be drawn to all the methodologies surveyed in the previous 
section. This analysis of the transformational system shows that all the existing map­
ping methods use at least one ad hoc transformation except for the methodologies 
proposed by Moldovan and Fortes [74] - [77], Li and Wah [71], and Cheng and Fu
Algorithm
representation
Algorithm
model
Function semanticsEnglish-like'
Mathematical
expressions
/Structural or syntactical 
Geometry
Pseudo code
High level 
language
< 'Register Transfer level
• Processor Memory Switch
■ -Functional Blocks
•Array Models
Architecture
specification
Figure 2.1 Y-chart for transformation systems, 
(ref. [60])
Algorithm
representation
Algorithm
model
Functional
operators
"  w . v  dEntry: Algorithms' Function
specification
Cell functions
Processing
cells
Global array y  
structure
Architecture
specification
Algorithm Algorithm
representation model
Entry: Algorithms' 'Signal Flow Graph
Selection o f 
basic operations
.Localization SFG with operation 
modules
IFG with operation 
delay  modules
Systolic array
Architecture
specification
Figure 2.2 Y-charts. (ref. [60])
a) Gannon’s method.
b) S.Y. Kung’s method.
38
[72]. The Y charts of these three systematic mapping procedures are shown in Fig­
ure 2.3.
Constructiveness of the methodologies is an important issue that should be dis­
cussed. A mapping procedure that transforms an algorithm into systolic architectures 
is said to be constructive if the array designs are derived completely automatically 
from the initial algorithm representation; that is without the user’s intervention.
The methods suggested by Quinton [79] and Ramakrishnan et al. [69] are the 
only two methods that can be considered constructive. However, the latter method is 
confined to designing linear arrays. To show how a method can be nonconstructive, 
consider Moldovan and Fortes method [74] - [77]. Their method extracts the depen­
dencies between the variables of a program and models them as points in euclidean 
space. Then they look for a linear transformation T that maps these points onto an 
array. We directly quote: "since (they) do not have a technique which leads directly 
to the optimal transformation, (they) developed a software package which finds many 
valid transformations and then (they) pick the one which best fits (their) needs" [75]. 
On the other hand, Jover and Kailath describe their method by stating [67]: "the 
method is not a panacea; it is not an automatic procedure .. the creativity of the 
designer is still important". Indeed, the designer has something to say and offer 
while using most of the existing mapping procedures.
As stated earlier, for most methods described in the previous section, very little 
can be said about the optimality of the resulting designs. In some cases, like in
Algorithm
representation
Algorithm
m odel
[Algebraic m odelEntry: Program
Local
transformations/
im pu ta tio n s
\  Global 
transformations I
.Dependencies
'Input and output variables
I/O pons i
Size, Data
Communications
Timing
Local cell design
Architecture
specification
Objective
function
Algorithm
model
Algorithm
representation
Parameterization
[Constraint
\equations
Entry: Recurrence 
equations Optimization
Param eter
mapping
Interconnections, velocities 
spatial distributions o f 
data flow
►Basic cell functions
Architecture
specification
Figure 2.3 Y-charts. (ref. [60])
(Cont.)
a) Moldovan and Fortes’ method.
b) Li and Wah’s method.
c) Cheng and Fu’s method.
Figure 2.3 Continued.
Algorithm
model
Symbolic ’ 
expression
Space-Time
expansion
Entry: R ccursr 
formula
.Basic cell
[Register level
Basic cell 
function
Systolic
architecture
Architecture
specification
41
Cohen et al.’s method [62], [63], search for the optimal design is done in an ad hoc 
manner. In Mostow and Lam’s method [64], user determined transformations are 
used to optimize the designs. The study by Li and Wah [71] is the only previous 
study addressing the subject of systematic optimal mapping of algorithms onto sys­
tolic arrays. Although the study proposes a good method for optimal mapping, the 
method suffers some serious limitations. The procedure considers only a subclass of 
planar 2D structures. The approach, as stated before, is based on obtaining a set of 
constraint equations from the defining recurrences. For the 2D case, six equations 
are obtained. Solutions of these equations provides 13 parameter values. These 
values define timing as well as spatial relationships of data which in turn define the 
array structure. Solution of constraint equations, for feedback recurrences is done by 
trial and error and in many cases a solution cannot be obtained.
Besides, the Li and Wah approach imposes restrictions on input and output data 
spatial relationships. For instance, a 2D data array (it could be an input or an output 
data matrix) when passed through a systolic array must have the elements along a 
row or a column arranged in a straight line and equally spaced. Thus often the 
design space is limited to a sub-class of all possible designs. Unfortunately, for 
many algorithms, the time-optimal design may not exist within the limited design 
space.
Furthermore, a typical assignment statement in an algorithm of many existing 
methodologies is of the form:
42
y ( T )  = F  [ y ( T  -  d i ) ,  y ( T -d 2 y ( T - d n ) ]  (1)
where I  are index points in the index space of the algorithm and d±, d 2, ■ ■ . , dn 
are vectors of the index space, called dependence vectors. The dependence vectors 
denote that the computation of the variable y at the index point I ,  depends on the 
computations of the variable y at the points /  -  rfj, I  -  d 2, ■ ■ ■ , /  -  dn. If all the 
components of a dependence vector dL are constants, dt is termed as a fixed 
dependence vector. Otherwise, dx is called a variable dependence vector. An algo­
rithm with all fixed dependence vectors has a regular data flow.
The methods that use algorithms with statements similar to that in (1) make two 
assumptions on the algorithm. The first assumption is that the data flow is regular; 
i.e., all dependence vectors are fixed. There exist algorithms, however, whose depen­
dence vectors have components that are functions of the index variables; i.e., the data 
flow is irregular, but can be mapped to regular architectures such as systolic arrays. 
This will be demonstrated later in the dissertation. These algorithms cannot be han­
dled by any of the methods reported so far.
Dorairaj [80] extended Moldovan and Fortes’ work on mapping algorithms into 
VLSI arrays. He modified their method so that it includes some algorithms with 
variable dependence vectors of one nonconstant component. The variable depen­
dence vector is replaced by three fixed dependence vectors so that the data flow is 
regularized.
43
The second assumption that previous methods make is that the computation at 
all points in the index space of the algorithm use the same set of dependence vectors. 
These methods may not be able to find the timing function if not all dependence vec­
tors are associated with the computation at each point in the index set. This applies, 
for example, to the LU-decomposition problem where not all index points have to 
utilize all dependence vectors of the algorithm, but the previous methods force all 
points to use the same set of dependence vectors. Furthermore, the timing function 
obtained in such cases may be suboptimal. This is because by associating all depen­
dence vectors with all points in the index set, unnecessary ordering between points 
which are not data-dependent may be introduced.
In addition to all these limitations and despite the vast prospects of future non- 
planar and 3D VLSI designs, none of the existing methodologies can produce (or 
map) to nonplanar architectures. Nonplanar arrays have been suggested for some 
problems (matrix multiplication, matrix inversion, etc.) and proven to be superior to 
planar arrays, in many cases. They offered better PE utilization, increased speed and 
reduced I/O bandwidth requirements [23] - [28].
2.3.3. Overview of the New Methodology
The foundation of the mapping methodology described in this dissertation can 
be traced to the work done by Khun [73] and the research reported by Moldovan and 
Fortes [74] - [77]. In our method, we use an algorithm model which is a generalized 
form of the simpler models described by the mentioned two methods. Our model
44
consists of a nested-loops program that includes if-then-else statements. These state­
ments permit the model to divide the index set of the algorithm into subsets each of 
which may use different subsets of dependence vectors. This insures that unneces­
sary ordering between index points that are not data-dependent will not occur.
In addition, as in Khun [73], Moldovan and Fortes [74] - [77] methods, our 
methodology employs the concept of dependence vectors to order the index points of 
the algorithm in space and time. However, by differentiating between two types of 
dependence vectors, we allow the ordering of index points to be flexible in the sense 
that the ordering depends on the I/O bandwidth requirement and PE fan-out which 
are set a priori by the designer, rather than being imposed by the mapping methodol­
ogy. This will permit the ordering procedure to provide an optimal ordering of index 
points since the points may be shuffled and rearranged, in some cases, without 
affecting the final result. The rigidity of the previously proposed index point order­
ing procedures makes them suboptimal. Furthermore, unlike other methodologies, 
the approach reported here does not put constraints on the topology or dimensionality 
of the target architecture. Thus, the methodology produces the "best-fit" design for 
the algorithm whether that design is 1-D, 2-D, or nonplanar.
The ordered index points are represented by nodes in a diagram called Systolic 
Precedence Diagram (SPD). The SPD is a form of precedence graph that takes into 
account the systolic operation requirements of strictly local communications and reg­
ular data flow. Therefore, any algorithm with variable dependence vectors has to be
45
transformed into a regular indexed set of computations with local dependencies. This 
can be done by replacing variable dependence vectors with sets of fixed dependence 
vectors.
The SPD is transformed into an acyclic, labeled, directed graph called the Sys­
tolic Directed Graph (SDG). The SDG models the data flow while keeping the ord­
ering of index points in time intact. In this representation, each index point (or SPD 
node) is represented by a vertex and the flow of data is represented by edges.
The target architectures are obtained by projecting the SDG along defined direc­
tions. The projection is constrained by the assumptions: 1) minimal number of input 
and output links is required; and 2) no more than one operation may be performed by 
a PE at a given time unit. If more than one valid projection direction exists, different 
designs are obtained. The resulting architectures are then evaluated to determine if 
an improvement in the performance can be achieved by increasing fan-out. If so, the 
methodology provides the corresponding systolic implementation.
By employing a new graph transformation, the SDG is manipulated so that it 
can be mapped into fixed-size and fixed-depth multi-linear (FDML) arrays. The 
latter is a new concept of systolic arrays that is adaptable to changes in the state of 
technology. It promises a bounded clock skew, higher throughput and better perfor­
mance than the linear implementation.
CHAPTER 3
ALGORITHM REGULARIZATION AND SYSTOLIC 
PRECEDENCE DIAGRAMS
You see things that are and say, "why?" But I  dream things that never were and say, 
"why not?"
George B. Shaw
This chapter presents the first major step towards designing time-optimal sys­
tolic arrays. Starting from a high level language description of the algorithm, an 
irregular data flow is first regularized by localizing the data dependencies. The algo­
rithm is then mapped into a Systolic Precedence Diagram (SPD) that determines the 
lower bound on the total execution time of the algorithm on a systolic array with 
minimum PE fan-out.
The chapter is organized as follows: in Section 3.1 an algorithm model and a 
technique to transform algorithms to a form suitable for VLSI implementation are 
presented. The described technique pipelines all the variables in the given algorithm 
so that broadcasting is eliminated. In Section 3.2, data dependencies of the algorithm 
are studied and irregular data flows are regularized. Section 3.3 presents an algo­
rithm for constructing the SPD. The steps used towards mapping the algorithm into 
its corresponding Systolic Precedence Diagram are illustrated throughout the chapter
46
47
by using different examples.
3.1. Algorithm Model
Algorithms that can be mapped into systolic arrays are of a repetitive form that 
can be described in a high-level language by using nested loops; for  loops in a Pascal 
like language or do loops in a FORTRAN like language. The model detailed here is
a generalized form of the one described by Moldovan [74] - [76]. When using for
loops, a systolic algorithm can be written in the form shown below:
for / 1 = v j to u i do 
for i 2 = v2 to u2 do
for in = vn to un do
if ( condition (i\, . . .  , in) )  
loop-body (1);
else if ( condition (iv  )
loop-body (2);
else
loop-body (p); (1)
Where loop-body (I) (/ = 1 to p )  contains one or more of the following statements:
J i (  g\(T) ) = F\i  [ J i (  wj OO) , . . . ,  y \ { w 2( T ) y k ( w 3(T) 
yk ( ) > * i( w 5(I) xm( w 6(I) ) ] ;
yk( 8k0~) ) = Fki [ JiC w t t f ) ) , .  . . ,  yi(_w2( 7) ) , . . . ,  yk(_w3(T) ) , . . . ,  
yk( w 4(I) ) , w 5(I) xm( w 6(/) ) ] ;
Vj and Uj (J = 1 to n ) are integer valued linear expressions possibly involving the
48
indices i\, . . . ,ij_\ and I  = The vector /  is represented in space by
a point called the index point. An index point is equivalent to one iteration of the 
nested loops shown in program (1) (one value from each of i l5 i2 in)■
Let Z" denote the set of n-tuples of integers. The index set of program (1) is a 
subset of Z ” and is defined as
J n = {Tj = (  ( ii)j • ■ ■ ■ > (in)j )■ v i ^  0 'i)j ^  « i , • • • , v„ < {in)j < un }.
A sequential execution of the program defines an ordering on the points of the index 
set J n which is known as lexicographical ordering of J n . An index space S n is a 
space that contains all index points I j .
The functions g,- and Wj are two linear functions in J n , i.e. gh wj :
J n —^ J n. Since these functions are linear, they can be either one-to-one func­
tions (one index point is mapped to exactly one point), or many-to-one functions 
(many points are mapped to one point). The latter situation occurs when g,-(/) = a  
or wy(/) = P has many solutions, where a  and P are n-tuple constants.
An output variable appears on the left-hand side of an assignment statement, 
while an input variable appears on the right-hand side of the statement. The same 
variable can be input and output. The variable is said to be used in statement S if it 
appears on the right-hand side of S. It is generated in S if it appears on the left-hand 
side of 5 [73] - [75]. The functions F,-; (in program (1)), i = 1 to k and / = 1 to 
p ,  are functions that map the input variables to output variables. These functions 
can be either linear or nonlinear.
49
Between the variables generated at different index points, there are some depen­
dencies which dictate the algorithm’s communication requirements. These dependen­
cies can be described by what is known as dependence vectors. The concept of 
dependence vector has been used by many researchers as a tool in designing sys­
tematic methodologies for mapping regular algorithms into systolic architectures. It 
was first used, in relation to VLSI arrays, by Khun in [73] and was formally defined 
later by Moldovan in [74] and [75].
Moldovan defines a dependence vector as follows [75]: denote by X  and Y 
two variables whose indices are two integer functions /  and g defined on the 
index set J n; it is written as X ( / ( / ) )  and Y ( g ( I ) ) .  Variables X  and Y are 
generated in statements 5{-(/) and Sj(I), respectively (Moldovan denotes the 
assignment statements by Sk (J), k = 1 to N ,  where N  is the number of state­
ments in the algorithm program). Variable Y ( g ( I ) )  is said to be dependent on 
variable X ( f ( I ) ) if and only if X ( f ( I ) )  is an input in statement S j (/) and 
there exist two index points / [ £  J n and I 2 e J n such that
1) 1 1 < 12 ("<" means "less than" in lexicographical sense)
2) f (T 2) = g(T1).
The vector d = 12 -  11 is called the data dependence vector.
In this work, we use the same dependence vector definition given by Moldovan, 
but in addition we differentiate between two types of dependence vectors: rigid and
50
non-rigid. This new concept is discussed later in this chapter. To apply the 
definition to the model described in program (1), it can be restated as follows: d = 
12  -  1 1 such that
1) y,( w (I2) ) is a variable used at a statement S and yj( g ( I j ) ) is a 
variable generated at the same statement S .
2) w (f2) = § (/;).
Note that y{- and yj can be either the same variable or two different variables. 
Also, it is noted that our definition did not require that y,( w (I ) ) has to be gen­
erated in a statement. The reason is that, in our model, each of the variables y{, I
= 1 to k ,  where k is the number of statement in the algorithm program, are gen­
erated at some statement. The input variables (used only) are denoted by xt , t = 1 
to m .
The important information that need to be included in the algorithm model is 
the algorithm index set, the algorithm input and output variables, the functions that 
map the input to output variables and the data dependencies which dictate the 
algorithm’s communication requirement.
Definition 3.1: An algorithm is a 5-tuple A = (Jn,X,Y,F,D) where:
Jn is the index set of A; n is the number of indices.
X is the set of input variables of A.
Y is the set of output variables of A.
F is the set of functions that map the input variables to output vari­
ables of A.
D is the set of dependence vectors.
The form of program (1) shown earlier is a generalized form of simpler algo­
rithm models described in [73] - [77]. The if-then-else statements in the program 
permit this model to encompass many complex, nonuniform algorithms which may 
lead to PE switchable arrays (arrays with multifunction PEs). This class of algo­
rithms is not covered by previously proposed systolic algorithm models.
3.1.1. Pipelining the Variables
Because an algorithm can be described in many forms, the first step in the map­
ping procedure is to write the algorithm in the form of program (1) which is suitable 
for VLSI implementation. This might require pipelining (or buffering) the variables 
[73] - [78].
The role of this step is to eliminate all possible data broadcasts which may exist 
in the original algorithm. For example, consider a variable v that is generated at 
the index point / 0 e  / "  (n is the dimension of the algorithm) and used at r 
different points /,• e J n , i = 1, 2 r. There are r data dependence vectors for 
variable v, as shown symbolically in Figure 3.1a.
52
lo
a }
l o
%
b ;
Figure 3.1 Types of data flow.
a) Broadcasting of data.
b) Pipelining of data.
53
Maximum parallelism is achieved when the operations at all index points /,-, 
i'=l, 2 , . . . , r are performed in parallel, provided that there exist no data depen­
dencies between them. In such a case, the generated variable v needs to be broad­
cast at once. However, as stated in Chapter 1, broadcasting is not desirable in VLSI 
implementation. Therefore, the solution is then to reduce the number of the original 
data dependencies by pipelining the propagation of variable v for the r usage 
index points, as shown in Figure 3.1b.
If the r+\ points ( / 0 and Ij, j  = I, . . .  , r )  can be aligned along a line
y = a x + c , where y  and x  are two of the n axes in the index space S ", and
a and c are constant integers, the data can be pipelined along this line In this 
case the r dependence vectors of the original algorithm (with broadcasting) are 
replaced by only one dependence vector. The following result, though simple, is 
very useful.
Lemma 1: To pipeline the data along a line in the index space S n given by 
i 2 = ai\  + Cj, i 3 = c 2, • • ■ , in = cn_x (a and Cj, j  = 1 to n - 1, are constant 
integers and ik , k = 1 to n,  are the n axes in S n), the data is sent from the 
point ( ih i 2, ■ ■ ■ , in) to the point (ii+ l, i 2+a, i3, . . . , in).
Proof: We have to prove that points ( ih i 2, . . . , in) and ( i j+1, i 2+a, i 2, ■ ■ ■ , in)
lie on a line parallel to i 2 = a i x + Cj, i 3 = c 2, . . . , in = cn_i. The slope of the
1 Notice that points that lie on parallel lines are considered to be on the same line in the context of 
pipelining variables.
54
line L ,  with respect to axis i h that joins ( /1; i 2, . . . ,in) and
i 2+a -  i2
O'l+l, i 2 +a, 13, . . . ,in ) is equal to - —  ---- — = a .  Therefore L  is parallel to
l !+l -  i !
/ 2 =  a i l  +  c b  h  =  c 2> ■ • • > in =  cn - 1- o
In the case where the r+1 points cannot be aligned, they are divided into 
groups such that all the points in a group can be placed on a line. If the r+1 index 
points lie on a surface S and every point can be mapped to one of I nonparallel 
distinct lines, then the r  dependence vectors are replaced by / dependence vec­
tors. The total number of dependence vectors in an algorithm is one of the important 
factors in determining whether the algorithm can be mapped to a systolic array or 
not. The reason is that dependence vectors of an algorithm are equivalent to links in 
a systolic array and the local interconnection feature of VLSI arrays impose an upper 
bound on the number of links between PEs.
For an algorithm with n loops (n index variables), broadcast typically exists 
if the number of coordinates in a used or generated variable is less than n , where a 
coordinate is an element of the vector g (  I  ) or w ( /  ). Therefore, in order to 
avoid broadcast, all the variables with less than n coordinates are first indexed with 
n expressions, ei (i = 1 to n), and new artificial variables may have to be intro­
duced such that for each generated variable there is only one destination.
An input variable (used only) need not be pipelined if it is indexed with n 
expressions, where n is the dimension of the algorithm (such as the variable a in
55
the vector-matrix multiplication algorithm to be discussed in the next subsection), 
otherwise, it has to be pipelined. This guarantees that at the input of every index 
point all used variables will be available without broadcasting. An algorithm can be 
pipelined by implementing the following steps:
1. Find the dimension of the algorithm which is equal to the number of 
index variables. Let this dimension be n .
2. Identify the variables with number of coordinates less than n. Equate 
each of these coordinates to a constant to obtain a system of equations for 
each variable.
3. If the system of equations describes a line in the n -dimensional space S n, 
lemma 1 is applied to pipeline the particular variable. The variable is 
indexed with n expressions, and new statements are introduced.
4. If the system of equations does not describe a line, it should be first bro­
ken into different systems each of which characterizes a line. Then, 
lemma 1 can be applied to every line separately.
To explain how the above steps are applied to transform an algorithm into a 
form suitable for VLSI implementation, let v be a variable in a 3-dimensional space 
5 3 with axes ij, i 2 and i 3, and let the coordinates of the v be two expressions 
2ii  -  t3 and i2. By equating each of these expressions to constant integers, Cj 
and c 2, the following system of equations is obtained:
56
2*i — / 3 = Ci
«
h  = c2
This system describes a line in a 3-dimensional space. The equation of the line can 
be rewritten as z3 = 2 ix + c l5 i2 = c 2  making the slope of the line, with respect to 
axis i j, equal to two. By applying lemma 1, the data will flow from the index 
point (ij, t 2, 13) to (t i+ l, i 2, i 3+2). Therefore the variable v can be pipelined by 
indexing it with all three index variables, vO'j, i 2, z3), and introducing the new 
statement:
vO'j+l, i'2, z3+2) = v ( ih i2, i 3).
3.1.2. Examples to Illustrate Pipelining
This subsection illustrates the pipelining technique on six different algorithms. 
Some of these algorithms are also used in succeeding sections to demonstrate the 
mapping procedure. The matrix-vector multiplication and the convolution product 
algorithms are two dimensional, while the other algorithms used in this subsection 
are three dimensional.
Matrix-Vector Multiplication: Consider the matrix operation c* = A5*, where ~c 
and 5* are n-dimensional column vectors, and A = [a^ ] is an nxn  dense matrix. 
The typical element of c* is given by
Ci(k) = C;(fc_1) + aik bk
The used variable bk has only one coordinate, while the algorithm is two dimen­
57
sional. Therefore, this variable has to be pipelined along the line k = c , where c 
is a constant. To apply lemma 1, we replace bk by bjf'* and introduce a new 
statement b£‘+^  = b£‘\  This causes all dependence vectors for b to lie on the line 
k = c. The algorithm becomes:
for i = 1 to n do 
for k = 1 to n do 
begin
c .{k) _ c.(k-1) + a .k b£i)
b t l) = b P
end. (2)
Matrix-Matrix Multiplication: Consider the matrix operation C = AB, where A = 
[fly], B = [bij] and C = [c^] are nxn  dense matrices. The matrix C can be 
evaluated according to the following recurrence:
Cif] = Cif~l) + aik bkj
The variables a and b have two coordinates each, yet the algorithm is three 
dimensional. By equating each of the coordinates of variables a and b to a con­
stant, a can be pipelined along the j  axis which can be described by i -  c x, 
k = c 2, where and c 2 are constant integers. Variable b can be pipelined 
along the i axis described by j  = c 3, k = c 4, where c 3 and c 4 are constant 
integers. Thus, according to lemma 1, to pipeline a and b,  the variable a flows 
from the point (i , j , k ) to the point ( i , j+ l ,  k), while the variable b flows from 
(i , j , k ) to (i+1, j ,  k).  The matrix-matrix multiplication algorithm can be
58
rewritten as:
51
52
53
for i = 1 to n do 
for j  = 1 to n do 
for  k = 1 to n do 
begin
c M  = c t "  + a # '  
a i i+l) = a P
^ +1> = W
end. (3)
Convolution Product Algorithm: Given a sequence Jt(0), x (l) , . . . , *(&!), and a 
set of coefficients w(0), w (l), . . . , w (k2), the convolution algorithm given by
for i = 0 to k l do 
for j  = 0 to k 2 do
y ( i j )  =  y ( U - i )  + w 0 )  x ( l - j )
computes the sequence y(0 ,k2) , y ( l , k 2), . . . , y ( k x,k2). The used variable w( j )  
has a missing coordinate. Therefore, it should be pipelined along the line j  = c , 
where c is a constant integer. Applying lemma 1, w (J ) is replaced by w ( i , j ) 
and a new statement, w ( i + l j )  = w (i,j) ,  is introduced.
The value of the used variable x ( i - j ) is pipelined along the line i - j  = c or 
j  = i - c ,  where c is a constant integer. If lemma 1 is to be applied, x { i - j )  is 
changed to x ( i , j ) ,  and the statement x(i+ \,j+ l)  =x ( i , j )  is inserted in the pipe­
lined version of the algorithm. By letting x( i , j )  = x ( i - j ) ,  the convolution algo­
rithm is rewritten as:
59
51
52
53
for i = 0 to k \ do 
for j  = 0 to k2 do 
begin
w(i+lj) = w(ij)
x (i+ lj+ l) = x(ij)
y ( U )  =  y ( h j - l )  +  M U )  x ( i j )
end. (4)
LU-decomposition: Consider the problem of factoring an nxn  dense matrix
A = [aij] into a lower triangular matrix L = [/f-] and an upper triangular matrix 
U -  [Uij] such that A = L U . The factors L and U can be evaluated using 
Doolittle’s method [31]:
a-<*> a t X) ~ Iik ukj
hk ~
0
1
“kk
ukj
i<k
i=k
i>k
k>j
k<j
After pipelining the variable / along the line i = C\, k = c 2 and the variable u 
along the line k = c $ ,  j  = c 4, where c l5 c 2, c 3 and c 4 are constant integers, 
the LU-decomposition algorithm is transformed to the following form:
SI:
for k  = 1 to n-1 do 
for i = k to n do 
for  j  = k to n do
if  ( i = k )
= a « - l)
60
else if ( j  =- -k)
S2: I P  = a t l) / «4f-1}
S3: U&P =
else begin
S4: ukp  = »>
S5: #  = i t i)
S6: = a t "  ~ I t "
end. (5)
Transitive Closure/Shortest Paths Problems: Let G be a directed graph. The 
graph G ' , which has the same vertex set as G , but has an edge from u to w if 
and only if there is a path from u to w in G , is called the transitive closure of 
G [86].
Every simple graph G can be represented by an adjacency matrix. The adja­
cency matrix of an n -vertex directed graph is defined as an nxn  binary matrix 
A = [a-ij ] where = 1 if and only if there is a directed edge from vertex i to 
vertex j  or i = j .  The transitive closure problem is to compute the transitive clo­
sure binary matrix A' = [a'^ ] where a'^ = 1 if and only if vertex j  is reachable 
from vertex i through a sequence of directed edges [87].
The Warshall algorithm for solving the transitive closure problem accepts an 
nxn  binary matrix A and performs the sequence of transformations described 
below [87]:
for k = 1 to n do 
for i = 1 to n do 
for j  = 1 to n do
a k (i , j ) = a k~'(i,j) + a k~l{i ,k).ak~l{k J ) ;  (6)
61
where "+" and are Boolean OR and AND, respectively. The variable a,  which 
at each iteration is used at three index points and generated at one index point, has 
three coordinates at all four positions in the statement described in loop (6). There­
fore, it needs not to be pipelined since the dimension of the algorithm is also three; 
i.e., the algorithm uses three index variables.
Equivalently, the Floyd algorithm for finding the shortest paths of every pair of 
vertices of an n-vertex directed graph can be described using the above algorithm but 
in this case A=[ai;] is the distance matrix, "+" is the minimum operation, and is 
the addition operation. The distance matrix of an n-vertex directed graph G is 
defined as an nxn  matrix A = [ai;] in which
1. atj = the cost of the edge from vertex i to vertex j  if there is an edge.
2 . aij = oo if no edge exists from i to j .
3. aij = 0 if i = j .
3.2. Data Dependencies and Algorithm Regularization
Due to advances in computer architecture which featured multiprocessing and 
vector machines, the analysis of data dependencies received considerable attention in 
the 70’s [88] - [93]. The purpose of the analysis was to detect concurrency of opera­
tions to speed-up programs and to design efficient compilers. These early studies 
suggested transforming programs and algorithms to accommodate the new charac­
teristics of the parallel machines. Although program transformations proposed before
62
contain many basic results, they are not adequate enough for VLSI implementations. 
In addition to a high degree of parallelism, VLSI arrays suggest pipelining and local 
communication [75].
Data dependence vectors which can be found from the pipelined algorithms 
using the definition given in Section 3.1, are very useful in the context of VLSI array 
design. First, all possible pairs of generated and used variables are formed. As 
stated earlier, only data dependence vectors between variables in the same statement 
are considered. Then, for each <generated, used> variable pair such as 
<yi ( g; (/) ), y j(  W[ (I) )>, the following three steps are performed:
1. Change I  in the generated variable to / '.
2. Let gi ( f )  = w, (7).
3. Compute (from 2.) d (I) = I  -  / ' .
If the data in the algorithm flow in a regular manner, all computed dependence 
vectors are independent of / .  Otherwise, at least one dependence vector is a func­
tion of / .  Except for the work done by Dorairaj [80], only dependence vectors that 
are independent of index points were studied in relation to VLSI array design. This 
is due to the misconception that it is not possible to map algorithms with variable 
dependence vectors (dependent on I )  into regular architectures such as systolic 
arrays. In this work, both types of dependence vectors are considered.
63
3.2.1. Data Flow Regularization
A dependence vector whose components are integer constants is called a fixed 
dependence vector [80]. Fixed dependence vectors indicate regularity in the com­
munication structure of an algorithm. However, there are many important algorithms 
which exhibit variable dependence vectors. In many cases the variable components 
of the dependence vector are a function of the size of the algorithm, resulting in an 
irregular data flow.
Dorairaj [80] extended Moldovan’s work on mapping algorithms into VLSI 
arrays by modifying the method so that it includes some algorithms with variable 
dependence vectors. The idea is to replace a variable dependence vector with 
equivalent fixed dependence vectors. In this study, Dorairaj’s work is further 
extended and generalized so that many more irregular algorithms can be encom­
passed.
As stated before, assignment statements of the algorithms in concern are of the 
form:
y i i g i T ) ) = f  [ y i(  wi ( 7) ) , . . . ,  37( w2( 7 ) yk{ w3( f ) ) , . . . ,  
y*( w 4(T)) .  *i (  w5(T) ) . • ■■.  xm( w 6( I ) ) ] ;
Suppose the generated-used pair y, ( g(I )  ) and y; ( w2(/) ) form the dependence 
vector d.  Then, d = 1 - 1 '  such that g (/ ')  = w 2(J), where g and w 2 are
two linear functions in i.e. g , w 2 : J n —> J n . The dependence vector d is 
fixed if both of the functions g and w 2 are one-to-one functions. If any of the
64
following three cases is true, d is variable:
1) g is a many-to-one function; i.e., the equation g(I )  = a  has many solu­
tions, where a  is an n-tuple constant.
2) w2 is a many-to-one function; i.e., the equation w2(/) = P has many 
solutions, where P is an n-tuple constant.
3) both g and w2 are many-to-one functions.
Dorairaj [80] provided a proof that for the second case (w2 is a many-to-one 
function), a dependence vector d with one variable component can be replaced by 
three fixed dependence vectors d min, dR and dL. In the following, this idea is 
extended by proving that for all the three cases mentioned above, a dependence vec­
tor with I variable components is equivalent to 2/+1  fixed dependence vectors.
Theorem 1: A variable dependence vector d of  I variable components can be 
replaced by 2/+1 fixed dependence vectors: d min, dR. and dLj, y = 1 to /, and
^min =  (* i , . . . ,y i , . . . ,y 2,...,y/,...,X n),
where yj (j = 1 to /) takes the position of the variable component xv. of d, 
and
y ;  =
-min  ( abs (xv.) ) if all the values of xv. are negative 
min( abs(xv.) ) otherwise
65
dR. = (0 , . . . , 0 , yj', 0, , 0), j  = 1 to /,
where y d  takes the position of the variable component xv. of d and
y / = i .
dL. = (0 , . . .  , 0 , y " ,  0 , . . .  , 0), j  = 1 to Z,
where y "  takes the position of the variable component xv. of d  and
Proof: The Theorem will be proven correct for each of the three possible cases.
Case 1: In this case, g is a many-to-one function. Let 71? / 2, . . . be the
solutions of g( I )  = a, where a  is a n-tuple constant and k\  is a function of the
size of the algorithm. Let d  = where xv. (J -  1 to /)
are the variable components. As g has k x solutions, the /-tuple of integer vari­
ables (xVi, xV2, . . . ,xVi) g  Z l has k x different values. Then d =
{d i , . . . ,d it . . . ,dki}, where each one of these vectors is a fixed dependence vec­
tor. This situation is depicted in Figure 3.2a.
Since k i  is a function of the size of the algorithm, the number of fixed depen­
dence vectors corresponding to d will not be identical for different algorithm sizes. 
This requires going through the whole procedure of designing an architecture for 
each size of the algorithm. Also, for large sizes, ki  might be a very large number 
making the design of a VLSI array impossible. Theorem 1 replaces d not only by 
a small number of fixed dependence vectors but also by a number that is not a
66
(a)
r ' d L
mm
min
h
( b )
Figure 3.2 Data flow when g is a many-to-one function.
a) Irregular data-flow.
b) Regularized data-flow.
67
function of the size of the algorithm.
As d e  {dp . . . ,di, . . . ,dki}, we have to prove that each of these fixed 
dependence vectors is equivalent to (can be synthesized from) the system of vectors 
d mjn, dR., and (j = 1 to /). Let Im be the index point where the variable is 
used and let / min be one of the solutions to g (I) -  a  such that d min = 
An ~ ^min- We can write
= l\*dRl + + • • • + Jl*dRl + (J.[*dLl ,
where = 0 , j j  > 0 , (iy > 0  (j = 1 to I), and /,• are the k x solutions of
g (/) = a . It is clear that
An — A — d[ — (An — ^min) (^min — A )
= ^min + + ' ' ' +
Therefore, <7, can be represented by a linear sum of d mm, dR. and dL. 
(j = 1 to /). Thus the Theorem is correct for this case. The modified data flow, for 
/ = 2, is shown in Figure 3.2b.
Case 2: In this case w 2 is a many-to-one function. Let the index points I p  
12 > • • ■ >At2 *3e t^e solutions of w 2(I) = |3, where (3 is an n-tuple of constants and 
k 2 is a function of the algorithm size. Let d = (xp...jcVi,...j;V2,...j:Vi,...j:n), where 
xv. (j = 1 to /)  are the variable components. As w 2 has k 2 solutions, the /-tuple 
of integer variables QcVl, xV2, . . . ,xV/) e  Z l has k 2 different values. Then d
68
may be represented by a linear sum of {dy, . . . , dit . . .  , dk2}, where each one of 
these vectors is a fixed dependence vector. This situation is shown in Figure 3.3a.
As in case 1, it will be similarly proven that d is equivalent to a set of fixed 
dependence vectors d min, dRj, and dL. (j = 1 to /). Let 7min be one of the solu­
tions to the equation w 2(J) = P such that d min = 7min -  Im, where Im is the 
index point where the variable is generated. Now, we can write
h  - / min =  S i*dRi +  r\i*dLi +  • • • +  8 / * ^  +  r\{*dLi ,
where 8j*r\j = 0, 8j > 0, r\j > 0  (j = 1 to I) and IL are the k 2 solutions of
w2(7) = p. Also, it is clear that
I i  ~  ~  d’i ~  ( f i  ~  ^m in ) (^m in  — ^ m )
= ^min +  h * d Rx +  r \ l* dL 1 +  ' ' ' +  +  11 l* d L r
Therefore, dt can be represented by a linear sum of d min, dRj, and dL.
(J = 1 to /). Thus the Theorem is correct for this case. The transformed data flow
for 1 = 2  is shown in Figure 3.3b.
Case 3: This is the case where g and w 2 are many-to-one functions. Let 
7i, I 2, . . . ,Iki be the solutions of g(I )  = a, and let 7^  +1, . . . J kl+kz be the
solutions of w2(7) = P, where a  and p are two n-tuple constants. Let d =
(xlv..^:Vi,...^tV2,...^cV(,...^tM), where xv. (y = 1  to 7) are the variable components of 
d. As g has ky solutions and w2 has k 2 solutions, the / -tuple of integer vari-
69
Im
Ik
(a)
Im
'min
!' d L,
h
(b)
Figure 3.3 Data flow when w is a many-to-one function.
a) Irregular data-flow.
b) Regularized data-flow.
70
ables (xVi, xV2, . . . ,xVl) e Z l has k 3 = k^*k2 different values. Also in this case
d can be synthesized from a set of vectors {dx, . . . ,d it . . . ,dki), where each one
of these vectors is a fixed dependence vector. This situation is shown in Figure 3.4a.
This case is a combination of the first two cases. It will be proved that d can 
be replaced by a set of fixed dependence vectors d min, dRj, and dLj, j  = 1 to /. 
Let /OTi be one of the solutions to g (I) = a  and Im2 one of the solutions to 
w 2(I) = P such that d min = l m2 -  Imi. We can write
Tm i- h x = Yi *dRl + + • • •  + 7 i*dRl + \i-t*dLl
and
Ii2~ Im 2 ~  81 *dRi +  t i \*dLi +  ■ • • +  8 [*dRi +  r it *dLl ,
where yj*\ij = 0 , 5; *r\j = 0, yjs [ij t  8j, r\j > 0 , (j = 1 to I), 7h are the k x
solutions of g(I )  = a  and Il2 are the k 2 solutions of w2(/) = (3. This will lead 
to
d i  =  =  <rm 2 - r m i)  +  +  (Ti 2 - r m2)
= ^min + (Yl+8 l)*rf/?, + + ■ • • + (yi+§t)*dRl + (^i+r\i)*dLi.
Therefore dt can be represented by a linear sum of d min, dL. and dR. (j = 1 to 
I). Hence, the variable dependence vector d of / variable components can be 
replaced by a set of 21 + 1 fixed dependence vectors. Thus the Theorem is also 
correct for this case and the proof is complete. The modified data flow, for / = 2, is 
shown in Figure 3.4b. □
Figure 3.4 Data flow when g and w are many-to-one functions.
a) Irregular data-flow.
b) Regularized data-flow.
72
3.2.2. Dependence Vectors Subsets
The if-then-else statements included in the algorithm model described by pro­
gram (1) allow the algorithm to have statements that use only a subset of the index 
set. In addition, as it can be seen from the previous subsection, for algorithms with 
variable dependence vectors, a variable might use different dependence vectors at 
different index points. Referring to Figure 3.2b, it is clear that the index point Ii{ 
uses d i2 and dRi as its dependence vectors while the index point 7min uses d^ , 
and dRz. For these reasons, the set of index points J n of an algorithm
r
may be divided into r subsets J-\ i = 1 to r and L 7 /"  = 7" , such that all
(=1
index points in the same subset use the same dependence vectors. Also, the set of 
dependence vectors D is partitioned into r subsets Dt , / = 1 to r and
r
KJ Dt -  D , such that all index points in / "  use vectors in Di for their data 
i=l
r
dependencies. It is noted that, generally, P i  Di ^  0 .
/=i
If all dependence vectors of an algorithm with no if-then-else statements are 
fixed, the data dependence vectors set D and the index set J n do not have to be 
partitioned. In this case all dependence vectors d e D are applied to all index 
points Ij e J n .
73
3.2.3. Ordering of Index Points
The aim here is to use the notion of data dependence vectors to dictate the com­
munication requirements of an algorithm and to infer information about the time of 
execution of the algorithm’s statements at a given index point. Thus, ordering the 
index points in space and time becomes the main issue. Systolic array features such 
as local communications have to be maintained while providing a time-optimal order­
ing of the index points. In this subsection, the basis on which index points can be 
properly ordered are detailed. The next section provides an algorithm for the optimal 
ordering of all index points /,• e J n, where J n is the index set of the algorithm in 
concern.
As stated earlier, a dependence vector dictates the flow of a data stream between 
index points. The data stream through a sequence of index points can be classified 
as follows:
1. The data stream is just transmitted without alteration.
2. The data stream flow has an accumulating function.
3. The data flow is neither transmitting nor accumulating.
For the first case to occur at least one of the statements in the algorithm has to 
be a simple assignment statement such as a (i , j+ 1) = a(i , j ) .  The second case arises 
when statements such as c ( i , j , k ) = c ( i , j , k - l )  + a ( i , j , k ) appear in the algorithm. 
The flow of variable a in the first case is dictated by the dependence vector d \ =
74
(0 ,1) ', while the flow of variable c in the second case is characterized by the 
dependence vector d 2 = (0,0,1)'. These two dependence vectors have one thing in 
common. Consider, for example, the case where d\ = / 2 -  / 1 and d 2 = 74 -  73, 
where I  h / 2, / 3 and / 4 are index points in the index set of the algorithm whose 
lexicographical order is as given. It is clear that in this case the operations at / 2 
and / 4 can precede (or be concurrent) in time (with) the operations at / j  and / 3, 
respectively without affecting the final result despite the lexicographical ordering of 
the index points, d j and d 2 are referred to as non-rigid dependence vectors. In 
the third case, the order of index points cannot be altered without affecting the final 
result. In such case, the data flow is dictated by a rigid dependence vector.
Definition 3.2: A dependence vector d is called non-rigid if and only if for every 
pair of dependent index points, Ik e J n and 7/ e 7" , where Ik = It + d ,  the 
operation at Ik can precede (or run concurrently) in time (with) the operation at 7/ 
without affecting the final result. Otherwise, d is rigid. We implicitly assume that 
11 is ordered before Ik in the lexicographical sense.
Introducing the notion of non-rigid dependence vectors proves to be an impor­
tant and significant step towards finding time-optimal ordering of index points. Also, 
it permits using the I/O bandwidth and the index point fan-out as variable factors 
which can be fixed by the designer to satisfy the requirements. The fan-out of an 
index point 7; is defined to be the number of index points where a variable v is
75
concurrently used provided that v was generated at 7; . Without differentiating 
between rigid and non-rigid dependence vectors, I/O bandwidth and index point fan­
out are typically determined by the mapping procedure since, in that case, the order 
of index points in space and time is fixed. Therefore, mapping procedures that do 
not distinguish between rigid and non-rigid dependence vectors cannot guarantee 
optimality.
The second issue that should be addressed regarding the ordering problem is 
how to order the index points that use a variable dependence vector. In Section
3.2.1, it was proven that a variable dependence vector d  with / nonconstant com­
ponents can be replaced by 2/+1 fixed dependence vectors rfmin, dR. and dL. (i 
= 1 t o /). It is clear that the 21 vectors dR. and dL. are non-rigid because data is 
sent, from one index point to another index point, without being modified in the pro­
cess. The vector <7min, however, can be either rigid or non-rigid depending on the 
original variable dependence vector d .
Figures 3.2b, 3.3b and 3.4b show possible ways of ordering the index points 
(for 1 = 2 ) using the 2/+1 fixed dependence vectors. Referring to Figure 3.3b, it 
can be seen that the data is sent first to index point / min, then it is transmitted to 
index points 7 ^ + ^ ,  Tmin+dLi, Tmin+dL2 and Tmin+dRi. From Tmin+dRl, the 
data is transferred to index points I mm+2dRi, I m^ - d R +dRi and I min+dR +dL2. In 
this case the fan-out of index point 7min is equal to four while, the fan-out of 
I min+IRl is equal to three. If 1 = 3 ,  the fan-out of 7min would be six, while the
76
fan-out of / min + d.Ri would be five. It is clear that the fan-out of index points 
increases with I. This will perhaps make it impossible to design VLSI arrays for 
algorithms with variable dependence vectors that have more than three nonconstant 
components.
3.2.3.I. Ordering Index Points with Fixed Fan-out
To order the index points that use a variable dependence vector d with I 
nonconstant components while keeping the index point fan-out independent of /, we 
include an additional dependence vector to the set of fixed dependence vectors that 
replace d . This dependence vector which is called the zero dependence vector and 
denoted by dz , has a dimension equal to the dimension of the algorithm and all of 
its components are zeros. As an example, dz = (0  0 0)' for n -  3, where n is 
the dimension of the algorithm. It is clear that by including dz to the set of fixed 
dependence vectors, we are actually adding some dummy index points to the index 
set J n . These index points act as buffers and do not have to perform any opera­
tions.
In Subsection 3.2.1, we have shown that any variable dependence vector d e 
{di, , dt , . . .  , dk } with I nonconstant components can be represented by a 
linear sum of <7min, dR. and dLj, where j  = 1 to /. Then, for all three cases 
described in the mentioned subsection,
di = d m in  +  yi*dRl +  ‘ +  yi*dRl +
77
where j j  *\lj = 0, Jj > 0, \ij > 0  (j -  1 to I). By adding Q*dz to both sides of 
the equation,
^  + Q*dz ~ ^min + J\*dRi + [i\*dLl + ■ ■ ■ + y[*dRi + \ii*dLi + Q*dz ,
where 0 > 0. Knowing that dz is a null vector, then + Q*dz = dt and
di = ^min + Yi*dRl + + ’ ' ‘ + 7i*dR, + V-i*dLl + B*dz .
Therefore dt- can be replaced by 2 1 + 2  fixed dependence vectors d min, dLj, dR. 
and dz , where j  = 1 to /.
The zero dependence vector dz is employed in the ordering procedure as fol­
lows. The index points that use the variable dependence vector d of I noncon­
stant components are divided into I groups. The first group consists of the index 
points that use one of the three vectors d mjn, dLi or dRi as one of their depen­
dence vectors. The second group contains the index points that utilize either dLi or 
dRi as one of their dependence vectors, while the index points in the i th group use 
either dL. or dR;. dz will be used to transfer data from index points of one group 
to the index points of another group. For example, to transmit data from 7niin to I  j 
= / min + dRi + 2dLi, the data will pass through index points / min + dRi, I min+dRi + 
dz and / min + dR l + dz + dLl, before reaching Ij. Figure 3.5 shows this ordering 
symbolically. In the figure, n = 4, 1 = 3  and 7min is assumed to be (0, 0, 0, 0).
It is clear that the ordering method described here increases the execution time 
of the algorithm since some of the index points are repeated; i.e., 7; + dz = 7,.
78
min o
Figure 3.5 Index point ordering with fixed fan-out.
79
However, this method guarantees that all index points will have a fan-out less or 
equal to three (in addition to links corresponding to other dependence vectors in the 
algorithm). Actually, except for index points / min + adz, a  > 0, all index points 
will have a fan-out equal to two; an index point in the i th group transmits data to 
one of the points in the same group and to another point in the group.
This method is equivalent to inserting delay buffers in the data stream to reduce 
fan-out. Thus, many more algorithms with variable dependence vectors can be regu­
larized and eventually be mapped into regular architectures such as systolic arrays.
3.2.3.2. Deadlocks in Index Points Ordering
The above ordering methods have one drawback: an algorithm that implements 
one of these methods might reach a deadlock state when ordering index points that 
use two or more variable dependence vectors that have one or more nonconstant 
component(s) at the same coordinate(s). To show how t he deadlock might occur 
take, for example, two variable dependence vectors d\  and d 2 that have one non­
constant component at the same coordinate. Assume that d\  = {dn , . . .  , d i k ) 
and d 2 = {d2\, • • • , d 2k} dictate the flow of data from index points /,-, i = 1 to 
k,  to index points Imi and Im2, respectively. This situation is shown symbolically 
in Figure 3.6a.
By using Theorem 1, d i  and d 2 can each be replaced by a set of three fixed 
dependence vectors {dmini, dR, dL } and {dmiri2, dR, dL }. By implementing any of 
the ordering methods described earlier, the ordering of /,-, i = 1 to k,  is as
80
22 2k
o
IfTlj
4 o
Figure 3.6 Data flow when tw'O dependence vectors have a 
nonconstant component at the same coordinate.
a) Irregular data-flow.
b) Regularized data-flow using Theorem 1.
depicted in Figure 3.6b. In this case, the operation at / min2 depends on the opera­
tion at / mini ( /min2 -  7mini = a dR, where a  is an integer constant) and the opera­
tion at 7mini depends on the operation at f min2 (fmini - /~min2 = PdL = - a d R). 
Therefore, the operation at index point 7mini cannot precede in time the operation at 
index point / min2 and vice versa. This results in a deadlock situation.
Theorem 1 replaces a variable dependence vector d\  = {dn , , d lk } by
2/+1  fixed dependence vectors d ^ ^ ,  d^L. and d iR , i = 1 to /, where l<7minil = 
k _
Min I d y  I. As for every variable dependence vector, d j is variable if one of the 
following conditions is true:
1. g(I )  is a many-to-one function; a variable generated at many points is 
used at one point.
2 . w(I)  is a many-to-one function; a variable generated at one point is 
used at many points.
3. g{I)  and w(7) are both many-to-one functions.
Assume that the first condition is true and that Imi is the index point where the 
variable whose flow is dictated by is used. Then, d mmi = Im -  7mjni, such that 
7 m in, *s one ° f  the solutions of g (I) -  a; a  is an n-tuple integer constant. Since
82
k,  are all the solutions of g(I )  = a. For the remainder of this discussion, it is 
assumed that the first condition above (g is a many-to-one function) is true for all 
variable dependence vectors mentioned here. However, it is noted that the same rea­
soning apply to all three cases.
Following the argument presented at the beginning of this subsection, the 
deadlock problem may be prevented by not allowing any pair of index points, /, 
and Ij,  to use two conflicting fixed dependence vectors, such as dR and dL, 
where Ij -  Ii = $dR = - 0 dL . Referring back to Figure 3.6b, it is easy to see that a 
deadlock would have not occurred if / mini = / min2. In that case, none of the pair of 
index points would use conflicting dependence vectors.
Therefore, the solution to the deadlock problem is to modify the flow of data 
dictated by the p  variable dependence vectors that have at least one nonconstant 
component at the same coordinate so that only one index point, denoted by I min* 
generates data needed by the p used index points. IM1N can be chosen such:
— ^m in i "h ^mm.i +  " ’ ' +  ^m in p
I  M IN  =  ~
where all operations are n-tuple integer operations; n is the number of coordi­
nates in an index point.
Theorem 2: Any p  variable dependence vectors d 2, ■ ■ . , dp that have / 
nonconstant components at the same coordinates can be replaced by p +21 fixed 
dependence vectors: d ^ ,  d 2f ,  ■ ■ . , dpf ,  dRj and dLj, j  = 1 to / ,  such that
where d min., i = 1 to p ,  and dR. and dLj, j  = 1 to / ,  are as defined in Theorem
1.
Proof: The proof is two parts. Firstly, we have to prove that a variable dependence
vector d-t = {dn , . . . , di f , . . ■ , dip} can be replaced by some of the p +21
fixed dependence vectors mentioned above. By following the proof of Theorem 1 
and using IMIN instead of / mjn and d^  in place of d mm, it can be easily proved 
that di is equivalent to 2/+1  fixed dependence vectors: di f , dR. and dLj, j  -  1 
to /.
Secondly, we have to prove that if
—  ^ m i r i ]  "f" ^ m i n 2 ^ m i n p
I  M IN  =  ~
then,
d \ f  +  ^ 2 /  +  '  '  '  +  d p f  —  d j u j j j j  +  d m i n 2 +  +  d m ; n ^
We know that
^1/ = I m i  ~  I  M I N
d p f  =  ~  I  M I N  ■
Therefore,
It should be noted that the solution described in this subsection, in addition to 
preventing deadlock, guarantees the regularity of the modified data flow by replacing 
the variable dependence vectors with fixed dependence vectors whose number is not 
a function of the size. However, it does not assure locality of communication since 
the components of the dependence vectors d ^ ,  d 2f ,  ■ ■ . , d ^  might be much 
greater than unity.
3.3. Algorithm to SPD Mapping
The Systolic Precedence Diagram (SPD) is a diagram that graphically displays 
the computations in accordance with the algorithmic precedence rules under the sys­
tolic processing requirement. That is, computations are represented by nodes ordered 
in compliance with the precedence requirements of the algorithm and with the sys­
tolic imposed requirement of strictly local communications and regular data flow. 
The SPD provides a model from which parallel operations are identified. The lower 
bound on the total execution time of an algorithm on a systolic array with minimum 
PEs fan-out and for a given I/O bandwidth can be determined from the SPD.
Definition 3.4: An SPD is a 5-tuple (X, Y, N, DJ, 10) where:
85
X is the set of input variables. This set is equivalent to the set of 
input variables of the algorithm.
Y is the set of output variables. This set is equivalent to the set of 
output variables of the algorithm.
N is the set of nodes. The nodes have one-to-one correspondence
with index points e J n . It is assumed that the operation at
each node is equivalent to all computations performed at the 
iteration corresponding to /,.
DJ is the set of pairs (£>,•, 7") where 7 ” is the subset of index
points that use the fixed dependence vectors in D i . 7" = 7"
i
and P i  7" = 0 ,  while KJ Dt = D and P  Di * 0 ,  where
i i i
7" is the index set and D is the set of fixed dependence vec­
tors of the algorithm.
1 0  is the I/O bandwidth restriction. This factor is chosen by the 
designer.
In practical systems there will always be a limit on I/O bandwidth. In many 
cases I/O cost represents a major part of total system cost. Thus a priori cap on the 
maximum allowable I/O bandwidth may need to be defined before further design 
steps can be taken. For instance, consider the matrix-matrix multiplication problem. 
If the matrices involved are nxn  dense matrices, a total of n? multiplications are
86
needed. If the array structure and I/O bandwidth are disregarded for a moment, it 
becomes obvious that all n 3 multiplications can be executed in parallel in one time 
step. This would require, however, a bandwidth of 2n3 elements/step; an intolerable 
requirement especially for large n .
The following are some general rules that the SPD construction algorithm 
should maintain [35]:
1) A task is a simple operation performed by the algorithm at a given index 
point 11 e J n . The operation is equivalent to all computations performed 
at the iteration corresponding to /,•.
2) Each task is represented by a node containing a task number. Tasks are 
placed at time levels in accordance with their order of execution as 
required by the algorithm.
3) Each node has some input arcs and some output arcs. Each input arc is 
labeled with the datum it carries and a number. The number identifies the 
task from which the associated datum emanates.
4) Tasks in the same level (horizontally adjacent nodes) can be executed in 
parallel. The levels are numbered in ascending order L X, L 2, ■ • • in 
accordance with precedence constraints imposed by the algorithm. Tasks 
at Li are executed ahead of tasks at L j , j > i .
5) To guarantee maximal parallelism, a task is initiated as soon as possible.
87
It is placed at the smallest numbered level such that I/O bandwidth 
requirements, data dependencies of the algorithm and minimum node fan­
out are maintained.
3.3.1. An Algorithm for Constructing the SPD
Given the set DJ  and the I/O bandwidth requirement of an algorithm, its SPD 
can be constructed using Algorithm A. The algorithm is divided into two major parts. 
In the first part, the nodes of the first level are chosen, while in the second part, all 
others nodes are assigned a time level. It is assumed that the index set J n is 
divided into r disjoint subsets each associated with a set of fixed dependence 
vectors 2 Ds , s = 1 to r .
ALGORITHM A
BEGIN
I. By executing the program of the algorithm in concern in the given order, assign 
the index point /,■ e / "  to a node at L j if and only if:
1. There exists no index point Ij e J n such that /, = /  ■ + dk , where dk is
i
any rigid dependence vector in Ds .
2. There exists no index point Ij which has already been assigned to a node,
such that 11 = Ij + mdh  where m is an integer constant and dL is any
2 This implies that the algorithm in concern has been regularized and its variable dependence vec­
tors substituted by a set of fixed dependence vectors.
88
non-rigid dependence vector in Ds .
3. I/O bandwidth requirement is not violated.
II. a.  For each node at L j, construct a new program that reorders the index 
points. If the initial algorithm in concern is of the form:
for i j = Vj, u x do
for in = v„, un do 
begin 
loop-body; 
end.
Then the program associated with the node (at L j) corresponding to index point 
Tj = ((ii)y,(/2)y, • • • is
for i l = (.i{)j, . . . , u h v v  . . . -  1
for in = (in) j , . . . , un, v n , , (in)j -  1 
begin 
loop-body; 
end.
where the first iteration of the new program corresponds to I j .
b.  By executing, in parallel, all the new programs that correspond to nodes at
L l5 place the index point /; e  / "  at the lowest time level possible according
to the following:
1. Find the smallest integer t, t> 1, such that all the index points Ij e J n
89
where l x = Ij + dk are assigned to level Lp , p < t , and dk is any rigid 
dependence vector in Ds . If no rigid dependence vectors exist, set t = 2 
(level 2).
2. Find the smallest integer w,  w>0, such that there exists no index point Ij
e  J n assigned to Lt+W, where I x = I j  + mdt , dt is any non-rigid depen­
dence vector in Ds and m is an integer, and such that the VO bandwidth
requirement is not violated. If no non-rigid dependence vectors exist, set w 
=  0.
11 is placed at level Lt+W.
END (Algorithm A).
3.3.2 Explanation of SPD Construction Algorithm
As stated earlier, there are two parts in the SPD construction algorithm. In the 
first part, the index points that can be placed at the time level Ly are chosen. All 
the inputs to these points come directly from the input data to the algorithm. This is 
reflected in statement I  of Algorithm A which states that if an index point is to be 
placed at the first time level, it cannot be rigidly dependent on any other index point 
in the index set. In addition, a node is assigned level L x if its external inputs are 
not needed by another node already assigned to L-y. This is declared in the second 
statement of part I  which states that two index points which are nonrigidly depen­
dent on each other cannot be placed both at L  j. This assures that no broadcasting is 
permitted. If broadcasting were to be allowed, this statement (second statement in
90
part I )  would be omitted. And finally, the I/O bandwidth requirement should be 
maintained. If, for example, the input bandwidth is restricted to n elements, the 
nodes placed at the first time level cannot have more than n inputs corresponding 
to n external input elements.
In the second part, all the remaining index points /,• e J n are assigned to the 
lowest time level possible. To preserve local communications property of VLSI 
arrays and to guarantee the minimum number of time levels (and in turn minimum 
execution time), new programs which are reordered versions of the original 
algorithm’s program are constructed. The first iteration of each program corresponds 
to an index point placed at the first time level making the number of new programs 
equal to the number of nodes at Lj.
To help explain the remaining part of the algorithm, we use three terms: 
HOLD, REJECT, and PROCEED. These terms refer to (imaginary) states that an 
index point might take while looking for a time level to be placed at. By executing 
all the programs in parallel; that is taking one point from each program at a time, an 
index point /,• e / "  can be put in one of three states: HOLD, REJECT, or 
PROCEED.
a. 11 takes a HOLD status if it rigidly depends on another index point Ij 
(d - 11 -  Ij is a rigid dependence vector in Ds) which has not been 
assigned a time level yet.
91
b. It is in REJECT state if it has been already assigned to a level.
c. 11 takes a PROCEED status, otherwise.
if  Ti is in HOLD,  it stays in that state until all the index points in the 
different programs have been given a state. Only at that time, /,• tries to change 
status. The process terminates when all the index points It e J n have been given a 
REJECT state.
When an inde x point I t is given a PROCEED state, the construction algo­
rithm proceeds to find out its time level. It is assigned to the lowest level L[+w 
such that all the index points that it rigidly depends on have been assigned to one of
the levels Lpi, p l < t , and such that the fan-out of all nodes at Lpi> p 2 <t+w,  is
kept to a minimum. If /,• does not rigidly depend on any index point, t is set to 
2 , and if it does not nonrigidly depend on any other index point, w is assumed to 
be 0. In addition, the I/O bandwidth requirement should not be violated; that is if the 
input bandwidth is restricted to n elements, the number of input arcs that are carry­
ing the input data of the algorithm cannot exceed n at any time level.
It is easy to verify that if all the original dependence vectors of a given algo­
rithm are fixed, then all the nodes of the SPD will have a fan-out of one; that is an 
output of a node at Ll will, at most, appear as input at only one node in L} , j>i.  
However, if some of the dependence vectors are variable, the fan-out of the nodes 
can be between one and three, according to the ordering method described in the pre­
92
vious section.
Theorem 3: For a given I/O bandwidth and for the minimum node fan-out, an SPD 
constructed using Algorithm A is time-optimal.
Proof: Time optimality can be proven if the following can be shown to be correct:
1. No more nodes can be placed at the first time level (L x) of the SPD.
2. Given the nodes at L j, the number of time levels in the SPD is
minimum.
The first point will be proven by contradiction. Assume after running part I 
of Algorithm A index point I} was not assigned to level L h Now assume that 
Ij  could be assigned to L }. In order for Ij  to correctly belong to L x, three con­
ditions must be met:
a) the number of external input arcs incident on nodes to L l is less than the 
maximum input bandwidth allowed.
b) Ij does not rigidly depend on any other index point in J n .
c) external input data to Ij is not used by any other index point assigned to
L\-
From part I  of Algorithm A , it can easily be seen that if all these conditions 
were true then I j would have been assigned level L x. This contradicts the above 
assumption and Ij violates at least one of the above three conditions.
93
To prove the correctness of the second point, let Lt be the highest level of the 
constructed SPD, and assume that it is not time-optimal. This means that the SPD 
algorithm had assigned at least one index point /,- to a node x  at Lt , but actually 
this node can be placed at a lower level Zy, t' < t.
To place node x  at L t , the SPD algorithm must have found at least one 
index point Ij assigned to a node at Lt_i such that (step II of the SPD construc­
tion algorithm)
Ij -  Ii -  md[ or (1)
Tj = h - d k (2)
where m is an integer, dk is a rigid dependence vector and dt is a non-rigid 
dependence vector. If /,• is assigned to a node at Lt> ([' < t - 1) and Ij (com­
puted from eq.(l) or eq.(2)) is assigned to Lt_x, then either the minimum fan-out 
requirement (eq.(l)) or a particular data dependency (eq.(2)) is violated. Thus, x 
could have not been placed at Lt>. It immediately follows that the SPD is time- 
optimal and that it thus determines the lower bound on systolic execution time of the 
algorithm for a given I/O bandwidth and for the minimum node fan-out. □
It should be clear that if the non-rigid dependence vectors of an algorithm were 
used as rigid dependence vectors in applying Algorithm A, the ordering of index 
points would be suboptimal. As will be shown and explained in the next chapter, 
doing this is sometimes necessary if, for example, we have to trade time-optimality
94
for a specified array geometry (e.g. planar versus nonplanar).
3.3.3. Examples to Illustrate SPD Construction
Matrix-Vector Multiplication: The pipelined version of the matrix-vector multipli­
cation algorithm is given by loop (2) (Section 3.1.2). There are four <generated, 
used> pairs of variables, but only two of them produce nonzero dependence vectors. 
Let d x be the dependence vector corresponding to <bl +l(k'), b l (k)> and let d 2 
be the dependence vector corresponding to <c^(i'),  c k~l (i)>. According to the 
definition of dependence vectors (see Section 3.2), d x = ( i - i ' , k - k ' ) ‘ such that 
i'+l = i and k ' = k ,  while d 2 = ( i - i ' , k - k ' ) ‘ such that i' = i and k' = k - 1. 
Therefore, d x = (10)' and d 2 = (0 l)r. The two distinct pairs of <generated, 
used> variables and their respective data dependence vectors are summarized in 
Table 3.1.
Table 3.1
Dependence Vectors for Matrix-Vector Multiplication
The pairs of generated-used 
variables
Dependence vectors
i - i '  k - k '
<bi,+\ k ' ) ,  b ‘(k )> 
< c^(/'), ck~\i)>
(i oy = d x
(0 l)r = d 2
It is easy to check that the two dependence vectors of the matrix-vector multi­
plication algorithm are fixed and non-rigid. The DJ set consists of one pair 
(DX, J \ )  where
95
d 1 = { d ^ a o y ,  d 2 = ( o i y  } 
j  l  = j i  = { f  = (j, k ): i ,  * = 1 to n }
If the maximum input bandwidth is chosen to be n + 1 in elements from the 
matrix A and one element from the vector tf) ,  loop (2) can be directly mapped 
into the SPD shown in Figure 3.7a (for n = 3) following the SPD construction algo­
rithm described earlier. The highest time level is L ln-\. However, if the limit on 
the input bandwidth is increased to 2n (n elements from the matrix A and n 
elements from the vector U>), n nodes can be placed at the first time level making 
the number of SPD levels equal to n . The corresponding SPD in = 3) is shown in 
Figure 3.7b.
Matrix-Matrix Multiplication: The matrix-matrix multiplication algorithm is given 
in loop (3) (Section 3.1.2). There exist five different <generated, used> pairs of vari­
ables, but only three distinct nonzero dependence vectors. The variable a which is 
generated and used in S 2, forms the pair <a^ ( i ' ,&'), ,k)>. It follows that d^
= ( i - i ',  j - f , k - k ' ) ‘ such that i' = i, /  = j - l  and k' = k.  Then d\  = (0 1 0)'. 
Similarly, the data dependence vector for variable b which is generated and used in 
statement S3  can be found, d 2 = { i - i ' , j - f , k - k ' )' = (1 0 0)'. The variable c 
is generated and used in 51 forming the pair < c ^ ( i ', / ) ,  c k~1(i, j )>.  This makes 
d 3 = (0 0 1)'. The three distinct pairs of <generated, used> variables and their 
respective data dependence vectors are summarized in Table 3.2.
96
( 2 )
22
'23 '32
33
Figure 3.7 SPDs for matrix-vector multiplication (n = 3).
a) Input bandwidth of n+1.
b) Input bandwidth of 2n .
97
Table 3.2
Dependence Vectors for Matrix-Matrix Multiplication
The pairs of generated-used 
variables
Dependence vectors
i - i '  j - f  k - k '
<a-i (i' ,k'), a y_1(i,fc)> 
< b1’ {k' J ) ,  b l~ \ k , j ) >  
< c ^( i ' , / ) ,  c k~ \ i , j ) >
(0 1 0)' = d x
(i o o y = d 2
(0 0 1)' = d 3
It is clear that all the three dependence vectors are fixed and non-rigid. The set 
DJ consists of only one pair (D l5C j) where
d 1 = { J 1 = ( o i o y , d 2 = ( i o o y , r f 3 = ( o o i ) '  },
/ j  = / 3 = { /  = ( t , j , k ): i, j  and k = 1 to n }.
If the input bandwidth is restricted to 2n ; that is one row of matrix A and one 
column of matrix B ,  n nodes can be placed at level L 1 of the SPD. These 
nodes generate c 1(i ,i), i = 1 to n. Each of these nodes will be assigned a new 
program. The n programs are executed in parallel to generate all other nodes of 
the SPD. The corresponding SPD for n = 3 is shown in Figure 3.8a. The highest 
time level is L 2n-
If all the elements of the two matrices, A and B , were allowed to be input at 
the first cycle (bandwidth = 2n 2), the generated SPD would have n 2 nodes at the 
first time level and the highest level would be L„. The corresponding SPD for 
n = 3 is shown in Figure 3.8b.
LU-Decomposition: The pipelined version for finding the upper and the lower tri-
98
a)
Figure 3.8 SPDs for matrix-matrix multiplication (n = 3). 
(Cont.)
a) Input bandwidth of 2n .
b) Input bandwidth of n 2.
99
Figure 3.8 Continued.
°ll bll °I2 b22 °I3 b33 °22 b2l °23 b32 °2I bl3 °33 b3l °3I b l2 °32 b23
cll CI2 CI3 C2I c22 c 23 °3I c32 c33
2 4 3  5 1  6 5  7 6  8 4  9 8  1 9  2 7  3
°I2 b2l °I3 b32 0|l bl3 0 23 b3l °2I b l2 “22 b23 °3I bll 032 b22 033 b33
(2) <2) (2) (2) (2) (2) (2) (2) (2) 
ell CI2 CI3 C2I c 22 e23 °3I c32 c33
II 13 12 14 10 15 14 16 15 17 13 18 17 10 18 II 16 12
0|3 b3t 0 || b l2 °I2 b23 °2I b ll 022 b22 023 b33 032 b2l 0 33 b32 03 | b|3
CH C|2 C|3 C2 | C 22 c 23 O31 C32 033
b)
100
angular matrices of a dense matrix using Doolittle’s method is given by loop (5). 
There are six <generated, used> pairs of variables, while only three distinct nonzero 
dependence vectors are produced. The pairs <ul (& ',/), a k~x(i,j)> in 51, 
<l1'(k ' , / ) ,  a k~l{i J )> in 52, and <al (£ ', / ) ,  a k~l{i,j)> in 56  produce 
d x = ( k - k ' , i - i ' , j - j ' Y  = ( 1 0  0)'. The pairs < / '( £ ', / ) ,  u l~l(k,j)> in 52, 
<u1’ (Jc , / ) ,  u l~^{k,j)> in 53  and 54, and <al (& ',/), u l~l{k,j)> in 56 pro­
vide d 2 = ( k - k ' , i - i ' , j - / y = ( 0  1 0 ) '. Finally, the pairs <lJ' (i' ,k'), l j ~\i,k)>  
in 55, and <a1’ (k ' , / ) ,  w; - 1(z' ,k)> in 56  result in d 3 = ( k - k ' , i - i ' , j - f ) ‘ = 
(0 0  1)'.
It is easy to verify that both the matrix-matrix multiplication and the LU- 
decomposition algorithms have the same set of dependence vectors. However, all the 
dependence vectors for the LU-decomposition are rigid. In addition, some depen­
dence vectors are applied only to a subset of the index points because of the if-
then-else statement in the algorithm. The DJ set consists of three different pairs 
(Dt , 7t-3), i = 1 to 3:
D ! = { d! = ( 1 0  0)' },
A3 = { I  = (k , i, j ):  k = 1 to « - l ,  i = k,  j  = k to n }.
D 2 = { d x = (1 0 0 ) ' , d 2 = (0 1 0)' },
J 2 = { I = (k, i ,  j ):  k = 1 to n - 1, i = k+1 to n,  j  = k }.
D 3 = { d x = (1 0 0)', d 2 = (0 1 0)', d 3 = (0 0 1)' },
J 3 -  [ I  = (k, i ,  j): k = 1 to n - 1, i = £+1 to n, j  = k+l  to n }.
101
For any input bandwidth greater or equal to n , the constructed SPD would be 
the same. The SPD for n = 4 is shown in Figure 3.9. The optimal time is 3n-3  
units.
Transitive closure/Shortest paths: The Warshall algorithm for solving the transitive 
closure and (equivalently) the Floyd algorithm for solving the shortest path problems 
are described by loop (6). The derived data dependence vectors are shown in Table
3.3. The dependence vectors of the last two pairs have variable components, with
x  andy = 1 - n ,  . . .  ,0, . . .  , n - l .
Table 3.3
Dependence Vectors for W arshall’s Algorithm
The pairs of generated-used 
variables
Dependence vectors
k - k '  i - i '  j - f
<atd( / ', / ) ,  a k~l(i,j)> 
<aK ( / ' , / ) ,  a k~l{i,k)> 
Ka* ( / ' , / ) ,  a k~ \k , j )>
(1 0 0 )' -  d x 
(1 0  x)' = d 2 
(1 y 0)' = d 3
Let the index function associated with the used variable a k~l{i,k) be w(I)  = 
where I = ( k , i , j ) .  As can be verified, w is a many-to-one function. 
This situation falls under Case 2 (described earlier in Section 3.2.1). The same can 
be said about the used variable a k~l(k,j).
The third component of the variable dependence vector d 2 assumes the range 
of values given by x .  The minimum of the absolute values of x  is 0 which occurs
102
131 20
Figure 3.9 SPD for LU-decomposition (n = 4).
103
when j  = f  = k.  Similarly, the minimum of the absolute values of y  in d 3  is 0 
which occurs when i = i = k. By Theorem 1, d 2  and d 3  can be substituted by 
their corresponding fixed dependence vectors d min, dR and dL . So, the equivalent 
set of fixed dependence vectors for the Warshall (Floyd) algorithm is:
d j  =  (1 0  0 y
d 2min = ( 1 0  0 ) ' d 2R = ( 0  0  1)' d u  =  (0  0 - 1 ) '
d 3min = ( 1 0  0)' d 3R = (0 1 0)' d 3L = (0 - 1  0)'
And the corresponding DJ set is given is:
D l  = { J 1 = (l 0 0) ', d 2min = (1 0 o y , d 3min = ( 1  0 0)' },
7" = { I  = (k, i,  j ) \  k = 1 to n, i = k,  j  = k }.
D 2 = { J 1 = ( 1 0  0) ', d 2min=(  1 0  0)', d 3 R = ( 0 1 0)' ),
J 2  -  { /  = (k, i ,  j ): k = 1 to n, i > k,  j  = k }.
D 3  = { d x = (1 0 0)', d 2mm = ( 1 0  0)', d 3L = (0 -1 0)' )
J 3  = { I  = (k, i , j):  k  = 1 to n, i < k,  j  = k }.
D a = [ d x = { 1 0 0) ', d 2R = ( 0  0 1) ', d 3min{ 1 0 0)' }
J 4  = { / = ( & ,  i ,  j ):  k = 1 to n, i = k,  j  > k }.
etc.
The DJ set consists of nine subsets of dependence vectors with the corresponding 
subsets of index points.
For any input bandwidth greater or equal to n , the generated SPD would be the 
same. The reason is that all the dependence vectors of Warshall/Floyd algorithms are
104
considered to be rigid. The SPD for n=3 is shown in Figure 3.10. It is easy to 
verify that some of the nodes have a fan-out of two. The highest time level is L 5n_4. 
This time was proven optimal by other researchers [55], [56].
Dynamic Programming Problem: A string of n matrices are multiplied:
M = M ] X M 2 x M 2 x •••  xM „
Let r 0 ,r\ , . . .  ,rn be the dimensions of the n matrices with rt_j and r (- being 
the dimensions of M,-. Denote by mi;- the minimum cost of computing the product 
M(- x  M(+1 x  • •• M j . The dynamic programming algorithm which finally produces 
m ln is as follows [75]:
for I -  1 to n-1 do 
for i = 1 to n-l do 
for k -  i to i+l-1 do 
begin
m l (i ,k) = m l~l{i,k)
m l {k+l,i+l) = m l~l(k+\,i+l)
m l (i,i+l) = MIN [ml~l{i,k) + m /-1(A:+l,i+/) + r ( i- \) r(k)r( i+ l)}  
end. (7)
Excluding the variable r ,  the data dependence vectors derived from the above 
algorithm are shown in Table 3.4. The vectors corresponding to the last two g e n ­
erated, used> pairs have variable components at the same position, with
x = I - 1, 1-2, . . . , 1 and y = - 1, -2, . . .  , - /+ 1
(2) 20
Figure 3.10 SPD for transitive closure (n = 3).
106
Table 3.4
Dependence Vectors for Dynamic Programming
The pairs of generated-used 
variables
Dependence vectors
i - r i - i k - k '
<mr ( f  ,k'), m l 1 (i ,k )>
<mr (k'+ 1/  + /'), m l~ \k+ \,i  + 1  )> 
<m l (XX+l'), m l~l {i,k)>
<ml (i'X+l'), m l~l (k+l,i+l)>
(1 0 0 ) ' -  d ]
(1 -1 0 ) ' =  d :
(1 0 X )' =  d ,
(1 -1 y)' =  d<
Let the index function associated with the generated variable m l (i 'X+l')  be 
g( I  = (/, i, k) ) = (I 'XX+ l')-  As can be verified, g is a many to one function. 
Therefore, the data flow associated with the dependence vectors d 3 and d 4 is 
similar to the one shown in Figure 3.2a. By applying Theorem 1, the equivalent set 
of fixed dependence vectors for the dynamic programming algorithm is given as:
di
d 2
d“imin
^Amin
(1 0 0)' 
(1 - 1  0 )' 
(1 0 1)' 
(1 -1  - 1)'
d 3R = ( 0  0 l )1
d 4R - (o o i y
d 3L =  (0 o-iy
d AL = (0 0 - 1)'
Looking at Table 3.4, it can be easily verified that d 3min occurs when 
k = i'+l' = i+ l - l ,  while dmin4 occurs when k+1 = i' or k = i ' - l = i .  How­
ever, since d 3R = d 4R -  dR and d 3L = d 4L = dL , a special ordering of index 
points has to be followed to avoid deadlocks. The DJ set for the dynamic program­
ming problem consists of five pairs (Dt-, ,/,-3), i = 1 to 5, where 
D, = {d, = (l 0 0)', d 2 = (-1 -1 0)' },
Ji  = { (l , i , k ): I =  1 to n - l ,  i = 1 to n - l ,  k = 2/+/-1 and k = 2/+/-1
107
d 2 = { ^  = (1 o oy, d2-  ( - i - i  oy,dR=(o o iy },
2 i + l - l
J 2 = { (l , i , k ): I = 1 to n - l ,  i = 1 to « - / ,  £ >
03  ={<*!  = (! o 0) \ d 2= ( - l - l  0 ) ',4 « = (0  0 - 1)' },
J 3 = { I = 1 to n - l ,  i = 1 to n - l ,  k <
D< = M™„3 = (1 0 l ) \ d 2 = ( - l - l  0)' ),
J 4 = { (l , i , k ): / = 1 to n - l ,  / = 1 to n - l ,  k = i + l - l  }.
D 5 = { = (1 -1  -1 )'. rfl = (1 0 0)' ),
J 5 -  { ( l , i , k ) :  1 = 1 to n - l ,  i = 1 to n - l , k =  i }.
For input bandwidth of n , the constructed SPD for n
and& ^i+Z -l ).
and k *  i }.
= 4 is shown in Figure
3.11.
108
m 443322 3322
3423
4433
m13 m24 m 24
■4
m24
m 14
ml3 m44
m 14
Figure 3.11 SPD for dynamic programming (n = 4).
CHAPTER 4
CONSTRUCTION OF SDG AND ITS MAPPING 
INTO FULL-SIZE VLSI ARRAYS
When you can’t change the direction o f the wind, adjust your sails.
Unknown
This chapter exploits the gained knowledge on communication requirements of 
the algorithm and utilizes the index point ordering information provided by the Sys­
tolic Precedence Diagram, to construct a Sytolic Directed Graph (SDG). The SDG is 
an acyclic, directed graph that models the data flow as well as the timing for the exe­
cution of the given algorithm on a time-optimal array. The SDG can then be pro­
jected along defined directions to obtain the systolic array(s) which is (are) then 
evaluated to determine if an improvement in the performance can be achieved by 
increasing PE fan-out.
The chapter is organized as follows: Section 4.1 details a procedure to transform 
the SPD to an SDG, while Section 4.2 discusses mapping the SDG into VLSI arrays. 
Section 4.3 illustrates the mapping procedure using different examples. In Section 
4.4 three different kinds of switchable arrays are defined, and finally, the possibility 
of increasing performance of the designed arrays by rearranging links and delays is 
studied in Section 4.5.
109
110
4.1. SPD to SDG Transformation
The Systolic Directed Graph (SDG) is an acyclic, labeled, directed graph that 
consists of computation vertices and data flow edges. A computation vertex is 
equivalent to a simple arithmetic or logic unit where an output edge carries data that
is a function of the data carried on input edges. An output edge is an edge that is
incident out of the vertex, while an input edge is incident into the vertex. A data 
flow edge e ^  carries data from vertex vx to vertex vy.
Definition 4.1: The SDG is 3-tuple G = (V, E, L) where:
V is a set of computation vertices. Every vertex directly
corresponds to a node in the SPD or to an index point in the 
algorithm’s index set J n. Therefore, the operation at each vertex 
is equivalent to all computations performed at the corresponding
index point. In addition, the set V includes an imaginary ver­
tex, specified by v0, which corresponds to the host computer 
and is called the I/O vertex. If the SDG can be thought as a 
transport network in graph theory, the vertex v0 might be
viewed as the source vertex of the network. Thus, Cardinality
(V) = Cardinality ( /" )  + 1.
E is the set of directed edges. Each edge corresponds directly to
the flow of a datum from one SPD node to another SPD node. A
directed edge, denoted by exy, connects vertex vx to vertex
I l l
if and only if an output of the SPD node corresponding to vertex 
v* is an input to the SPD node corresponding to vertex vy . A 
directed edge is also equivalent to the dependence vector between 
the index points corresponding to vx and vy . It is noted that 
the external input edges that carry data from the outside world are 
considered edges incident out of the source vertex v0.
L is a set of labels. Every vertex vx in V is assigned a unique 
label lx . This label is an n+3-tuple (n is the dimension of the 
algorithm) lx = {ix, . . .  , in , tx , f x , depx ), where
~ 0"l» • • • > in)
is the index point corresponding to vx .
tx is the firing time of the computation at vx . This time is
the same as the time level of the node corresponding to 
vx . It is computed by the SPD construction algorithm (see 
Algorithm A described in Chapter 3). 
f x is a integer label that identifies the operation dt vx . If the
i
algorithm program does not include an if-then-else state­
ment, f x = 1 for all vx e  V. This implies in this case 
that the operations performed at all vertices are the same.
depx is the spatial depth of vx , to be discussed later.
112
The SPD to SDG transformation is carried out in five steps as follows:
1. Replace every SPD node by a computation vertex.
2. Label every vertex with its corresponding index point and its firing time.
3. Find out how many different functions the SPD nodes perform. This 
number is equal to the number of loop-bodies inside the if-then-else state­
ments of the algorithm’s program. Label each function or computation 
with a unique integer number which is then assigned to the vertices 
corresponding to the specific function.
4. Compute the depth of each vertex according to Algorithm B.
5. Connect the vertices by edges according to Definition 4.1.
Let V* = V -  {v0}. V* is divided into k (k is algorithm and problem size 
dependent) disjoint subsets, Vl5 V2, . . . , Vfc, in accordance with Algorithm B 
below. In the algorithm, vertices are removed from V* and assigned to a subset 
Vm, 1 < m <k, at every iteration, nv refers to the total number of vertices origi­
nally in V* and nr is the number of vertices that remained in V* after the previ­
ous iteration.
ALGORITHM B
BEGIN
nk = 0 ; n r = n v ;  V0 = {v0}.
113
S 2 ■ Number the vertices remained in V* from 1 to nr, i.e., V j ,  . . . , vnr.
S 3. nk = nk + 1.
5 4. for x  = 1 to nr do S 5
5 5. I f  (at least one of the edges incident into vertex vx is incident out of a ver­
tex vy g Vnk-l) ^en
include vx in V ^ ; 
delete vx from V*; 
depx = n k ; 
nr = nr -  1;
5 6. if (V* #  0 )  then goto S 2.
END (Algorithm B).
Definition 4.2: Consider the graph G = (V, E). Let Vj and V2 be two disjoint 
subsets (Vj O  V2 = 0 )  of V such that V = Vj LJ V2. Then the set Sj of all
those edges of G having one end vertex in Vj and the other in V2 is called a
cut-set of G.
Algorithm B divides the SDG vertices into k disjoint subsets such that V = 
V i U V 2 • • • LJ V*. Therefore the SDG has, at least, k —1 cut-sets Sj, 
S2 , . . . , Syt_1, where S,- is the set of edges connecting vertices in Vj to vertices 
in Vj+i- Furthermore, the SDG G = (V, E, L) is partitioned into k subgraphs G(- 
= (V,-, Et-, Lt ), where
114
k
U V f  = v * ,
i=1 
k
LJ L, = L , but
;=l
k
U E ;  * E .
;=1
Actually, the set of edges E includes the edges in the subgraphs G(- and the edges 
in the cut-sets Sf, i.e.,
E = U  E,- U  U  Si .
i=l i=1
Algorithm B also assigns to every vertex vx a depth index, depx , which is 
equal to I if vx belongs to the subset Vt . This depth identifies the spatial dis­
tance of vx from the vertex v0, representing the host computer. Assuming that 
each edge in G has a weight of one, the depth of vertex vx is equal to the shortest 
path from v0 to vx .
4.2. SDG To VLSI Array(s) Transformation
A model of the VLSI array structure is needed in order to relate the features of 
an algorithm to the characteristics of the hardware. The operation of the systolic sys­
tem is typically studied at two different levels: the global level and the local level 
[75]. At the global level, the timing of data movement is inspected; while at the 
local level, the activities taking place inside the processing elements are examined. 
In this work, we mainly deal with the global level. The following definition of a 
VLSI array is sufficient for the purposes of this work.
115
Definition 4.3: A VLSI (or Systolic) Array is 3-tuple SA = (I, T, F) where
I refers to the set of interconnections between the processing ele­
ments. These interconnections support the flow of data through 
the array; a link can be dedicated only to one data stream or it 
can be used for the transport of several data streams at different 
times. Since each of the processing elements is equivalent to one 
or more computation vertices, each interconnection link 
corresponds to one or more edges of the SDG.
T is the set of timing vectors. Every processing element is assigned
a binary vector i? = (x^ x2, . . . , x,)* such that if x,- = 1 the 
corresponding PE is ON at the unit time tinit + i , where tinit is 
the initial time when the operation of the array starts, else the PE 
is idle at tinit + i. Notice that x, corresponds to the last level 
of the SPD.
F is the set of function vectors. Every processing element is
assigned a vector of integer values J* = ( f h f 2, ■ ■ . , f t )‘ 
whose dimension equals to the total execution time of the algo­
rithm (t). Each element /,• specifies the function performed by 
the PE at time tinit + i (which is represented by x(- in ~x). For 
example, /,• = 2 denotes that the corresponding PE performs the 
function labeled by 2 at time tini, + / if and only if the i th ele­
116
ment of the time vector is 1.
Before proceeding with SDG to hardware mapping, the following two basic assump­
tions regarding the target array are made:
1. A processing element of the designed architecture should not perform 
more than a single operation at a certain time unit. As stated before, an 
operation is equivalent to all the computations done at one iteration of the 
algorithm program.
2. Given a bound on maximum I/O bandwidth, the goal is to map the SDG 
into an array with minimum number of input and output links.
The simplest computation model to associate with the SDG is one having a sin­
gle processor devoted to each vertex. However, in general it is desirable to imple­
ment the algorithm using a smaller number of processors, by assigning one processor 
to the computations at multiple index points. This necessitates mapping of the com­
putation vertices onto a smaller number of processors.
To derive the array processor structure, the SDG is projected along a defined 
direction. Projections are useful since they preserve the regularity properties of the 
SDG, resulting in a regular implementation. All vertices that lie on a line that is 
parallel to the projection line are handled by the same processor.
Generally, the systolic array obtained by projecting the SDG along a specific
117
direction is unique. This implies that for each possible projection direction a 
different systolic design is obtained. The set of different systolic arrays for a particu­
lar algorithm can be found by projecting the SDG along the direction of the depen­
dence vectors or along the direction of vectors each of which is the summation of a 
pair of dependence vectors. However, two restrictions are applied in choosing the 
projection direction
1. To minimize the number of external input and output links, the SDG to 
architecture transformation is done by projecting each subgraph G(, i = 
1 to k,  independently (onto a distinct set of PEs) along the projection 
line. Thus, to choose a certain dependence vector dt as a projection 
line, none of the edges in the SDG corresponding to dt can be a 
member of any of the cutsets, Sj j  = 1 to k -1 ,  of the SDG. Further, 
these edges have to be members of one of the subgraphs G(-, / = 1 to k .
2. Since the data flow dictated by a variable dependence vector is partitioned 
into many different data flows, only dependence vectors which are origi­
nally fixed, are considered candidates for selection as projection directions.
The second restriction can be relaxed if no dependence vectors satisfying the 
above requirements can be found. Instead of replacing a variable dependence vector 
d  by its corresponding d min, dRi ,..., dRm, dLi ,..., dLm, where m is the number
1 Let the projection direction be the vector along which the SDG is to be projected.
118
of nonconstant components in d , d is substituted by d min, dRi dRm, dLi 
dLm. In this case, the vector dLi is missing and all the index points which use dLi 
as a dependence vector will use dRl instead by employing modulo vector addition. 
Figure 4.1a shows the original ordering of index points covered by a variable depen­
dence vector with one nonconstant component, while Figure 4.1b shows the new ord­
ering under the relaxing method. Once this is done, dR can be used as a projection 
direction.
It is clear that a price has to be paid by, perhaps, increasing the total execution 
time of the algorithm. Thus, the optimality of the relaxing method is not guaranteed. 
However, this new ordering of index points is necessary if the algorithm in concern 
is irregular. Such algorithms do not have projection directions that meet the above 
conditions and are thus unimplementable, using our methodology and maintaining our 
assumptions, without relaxing condition 2 (above). If, for example, a PE in the 
designed array is allowed to perform more than one operation per time unit, the ori­
ginal ordering of index points could be used and dR and d^ could be chosen as 
projection directions, thus guaranteeing optimality. It is worthwhile to note here that 
this relaxing method has been employed in designing an optimal systolic array for the 
transitive closure problem. Time-optimality of that design has been proven by other 
researchers [55], [56]. But, this is only one example.
After selecting a valid projection line, along dt for example, the vertices of the 
SDG are mapped to PEs and the edges are transformed into links. Two computation
‘•mo
I|  Imin+dL| I l min Ik
O r , . -, O t t O ^ O ^  — o
dmin
in
,  W  u R) ^  q R|
a)
Lmo
Imin+d|_,
dmin
Imin
5r
dR
b)
Figure 4.1 Ordering of index points that use 
a variable dependence vector.
a) Orginal ordering.
b) Relaxed ordering.
120
vertices, vx and vy , are mapped to the same PE if and only if Ix -  Iy = a dt , 
where a  is any positive or negative integer, lx is the index point corresponding to 
vx , and Iy is the index point corresponding to vertex . Also, an edge exy is
projected to a link connecting the processing element PEt to PEj if and only if 
vx is mapped to PE,- and is mapped to PEj.  Every processing element PEl 
in the target array is assigned a pair of vectors, a function vector J* and a timing 
vector ~t, where /,• = m and = 1 if and only if there exists a vertex vx that
is mapped to PEt , with f x -  m and tx = i.
4.3. Examples to Illustrate Systolic Arrays Design
Matrix-Vector Multiplication: The SPD with an input bandwidth of to n+l for 
the matrix-vector multiplication algorithm (Figure 3.7a) is transformed to the SDG 
shown in Figure 4.2a, following the rules described in Section 4.1. There are three 
valid projection lines: along d j (flow of variable b), along d 2  (flow of variable
c), and along d\  + d 2. Figures 4.2b, 4.2c and 4.2d show the three time-optimal 
arrays obtained by projecting the SDG along each of the three projection lines, 
respectively. In the array of Figure 4.2b, the elements of vector 5* are static, while 
the elements of vector c* are moving from one PE to another. The array in Figure
4.2c accumulate the results by keeping the elements of the vector c* inside the PEs.
For the array in Figure 4.2d the elements of both vectors are moving. This is the 
same design reported in [3] and used for band matrices. Each processing element in
121
'22
23 32
33
'33
'32‘23
°2I
c )
Figure 4.2 Matrix-vector multiplication with 
bandwidth of « + l. {n = 3 )
(Cont.)
a) SDG.
b) Array projected along d x.
c) Array projected along d 2.
d) Array projected along d  j + d 2.
122
Figure 4.2 Continued
123
the arrays shown in Figure 4.2 is assigned a timing vector and a function vector of 
dimension 5 (2n -  1). The function vectors are uniform since the vertices of the 
SDG are labeled with the same label function. All processing elements of these 
arrays perform the function c 't+1 = cx + a.b . As an example of a timing vector, the 
vector of the PE that accumulates c 2  in the array of Figure 4.2b is:
— >
X =
The SPD shown in Figure 3.7b is for a maximum bandwidth of 2n . Following 
the rules in Section 4.1, this SPD is mapped to the SDG in Figure 4.3a. By project­
ing the SDG along d h d 2  and d x + d 2, three circular arrays can be designed. 
The arrays are shown in Figures 4.3b, 4.3c and 4.3d, respectively. The function and 
timing vectors are of dimension n. The timing vectors consist of all l ’s making the 
PEs on at all time.
Matrix-Matrix Multiplication: The SPD of the matrix multiplication problem with 
input bandwidth of 2n (Figure 3.8a) can be transformed to the SDG shown in Fig­
ure 4.4a following the rules given in Section 4.1. d 3  (flow of variable c)  is the 
only valid projection line since it is the only dependence vector that does not have 
any corresponding edge in the cut-sets of the SDG. The resultant array is shown in 
Figure 4.4b. This is the same cylindrical array reported in [23]. The vertices of the
124
°2I
°3I
°ll
°32
°I2
«?22
°I3
°23
°33
a)
°I3
°I2
°ll
4
a 2 l
a 23
a22
°32
°3I
°33
b) c )
Figure 4.3 Matrix-vector multiplication with 
bandwidth of 2n. (n = 3)
(Cont.)
a) SDG.
b) Array projected along d\.
c) Array projected along d 2.
d) Array projected along d x + d 2.
Figure 4.3 Continued
Q 13 0 a 21 0
0 a 0 a23 31
a 22 0  a 33 0
C
a)
Figure 4.4 Matrix-matrix multiplication with 
bandwidth of 2n. (n = 3)
(Cont.)
a) SDG.
b) Array projected along d 3.
Figure 4.4 Continued
127
° I 3
10
.0
° 2 3 b 32 °33 b33
°I2 b 2) a  2 2 b 22 ° 3 2 cr ro oi
° l l b n ° 2 I b l2 ° 3 I b.3
'33.
b)
128
SDG are all labeled with the same function label and all PEs perform the operation: 
c T+1 = cx + a.b. The dimension of the timing vector is equal to 2n -  1 (5 in Fig­
ure 4.4b). As an example, the timing vector of the PE that accumulate c 23 is:
The SPD with bandwidth of n 2 shown in Figure 3.8b yields the orbital array 
reported in [25] and shown in Figure 4.5. All the n 2 input elements have to be 
preloaded in order for this design to work. The processing element function in the 
orbital array is the same as that in the cylindrical array. However, the timing vectors 
which are of a dimension n consist of all l ’s.
It is clear that the above two designs are time-optimal with respect to their input 
bandwidth, but they are both nonplanar. A planar array for the matrix multiplication 
problem can be designed by forcing all the dependence vectors to be rigid. For this 
case, the constructed SDG is shown in Figure 4.6a, and the resultant planar array is 
shown in Figure 4.6b. It is noted that the price for demanding a planar geometry is 
an increased execution time to 3n -  1.
LU-Decomposition: The SPD for the LU-decomposition shown in Figure 3.9 can be 
transformed to the SDG presented in Figure 4.7a. By projecting the SDG along the 
three valid projection lines (d\, d 2  and d x + d 2), three different arrays are
129
an 112. a13
bn
■22. 123 >21
■34
Figure 4.5 Orbital array for matrix-matrix multiplication.
130
°23 °22 °2I 0
Figure 4.6 Planar array for matrix matrix 
multiplication, (n = 3)
a) SDG.
b) Projected array along d 3.
4.1 V i * *  (Co"1* 
i)SB G
otop0'
da-
da-
Ptojecl
a  p ^ aV ? I
r t ^ v . c ; = « aason81
132
Figure 4.7 Continued
'II
331
u23
U 4 4
JL
u34
*24
UI2 UI3 UI4
°ll 0 0 0
° 2 I 0 12 0 0
°3I
CMCM
O
°I3 0
°4I °32 °23 °I4
O42 °33 °24
°43 °34
°44
b)
133
Figure 4.7 Continued
32
an  0  0  0
Q|2 Qgl ® 0
°I3 °22 °3I ®
°I4 °23 °32 °4I
°24 °33 a42
°34 °43
a44
C)
Figure 4.7 Continued
134
^ n = G i i = j ® = n ^ = 0 z z
? = 3 S = ^ 3 = 0 :;;^ 3 c =0 = '
II.9
a 2 l,g  a 12,g
.9 a 22 .g  a l3, g
“ 3 2 ,g  °23 ,g  °I4, g
>,g “33. g a 24 ,  g
0 4 3 ,g  ° 3 4 .g
° 4 4 , g
d )
135
obtained. Figures 4.7b, 4.7c and 4.7d show the resulting three arrays with their 
respective input sequences. The first array saves the L matrix inside the PEs and 
outputs the U matrix. The second array saves U and outputs L . as can be seen, 
either array has two types of processing elements, the circular PE type and the square 
PE type. These arrays have not been reported before. In the third array, the ele­
ments of both matrices L  and U are output, and the processing elements do not 
save any data. There exist three types of PEs in this design, namely left, central, and 
right PEs. This array is the same array reported in [31]. The operation of these
arrays are detailed in the next section.
Warshall’s algorithm: Warshall’s algorithm has one fixed dependence vector. 
However, after transforming the SPD shown in Figure 3.10 to an SDG and applying
Algorithm B to partition the graph into k subgraphs, all the edges corresponding to
the fixed dependence vector become members of the cute-sets SL, i -  1 to k - \ .  
Therefore, no valid projection line exists. The only solution is to use the relaxed 
projection line selection method described in Section 4.2 to make d 3R = (0  1 0)' a 
valid projection line. With that the constructed SDG is shown in Figure 4.8a and the 
designed array is presented in Figure 4.8b. This time optimal array has been reported 
in [57].
4.4. Switchable Arrays
As stated earlier, the SDG is a labeled graph where every vertex corresponds to
*a)
Figure 4.8 Transitive Closure.
(Cont.)
a) SDG. (n = 3) (Edges between 
subgraphs are not drawn for clarity)
b) Projected array, (n = 4)
137
Figure 4.8 Continued.
41
°3I
°2I
On
42
°32
°2 2
°I2
0
43
°33
°23
qI3
0
0
44 
°34 
°24 
0 14
0
0
0
2D
2D2D
2D2D 2D
b)
138
a computation at a specific index point It e J n (J n is the index set of the algo­
rithm in concern), and every edge corresponds to the flow of a datum (or to a depen­
dence vector). To obtain a systolic array, several vertices of the SDG are typically 
mapped to a single processing element and many edges are usually projected into a 
link. Since the computations at two distinct vertices might be different, the PE might 
have to perform different functions at different time units. As mentioned earlier in 
Section 4.2, the schedule for different functions can be specified by two vectors: the 
function vector and the corresponding timing vector. Also, since the edges of the 
SDG may have different directions, the links of the designed systolic array might 
have to change directions with time. An array that has the above characteristics is 
called a switchable array. In the next three subsections, three types of switchn^'lity 
are discussed: PE-switchability, link-switchability, and data-flow switchability.
4.4.1. PE-Switchability
In general, two different vertices may have two different types of computations. 
A computation type can be as simple as an addition operation or as complex as a 
nonlinear arithmetic expression. To differentiate between computation types, the ver­
tices of the SDG are labeled by a function number such that two vertices of different 
computation types are labeled with two distinct function integers.
Definition 4.4: An array with at least one multi-function computing PE is called 
PE-switchable, otherwise, the array is PE-nonswitchable.
139
Lemma 4.1: If for every vertex vx e V* have f x = c , then the corresponding 
array is PE-nonswitchable, where c is a positive integer and V is the set of ver­
tices of the SDG.
Proof: The proof is straightforward since no matter how the SDG is projected, all 
the vertices assigned to a PE perform the same computation type. □
Lemma 4.2: If every connected subset of vertices whose edges are parallel to the 
selected projection line in an SDG are labeled with the same computation type, then 
the corresponding array is PE-nonswitchable.
Proof: The proof follows immediately by considering that all the vertices along a 
projection line are mapped to the same PE. □
It is noted that PE switchability of the array depends on the projection direction. 
According to Lemma 4.2, the resulting array is PE nonswitchable if all the vertices 
along the projection line have the same computation type. However, by selecting a 
different projection direction (if possible), the resulting array may be PE switchable.
To illustrate PE-switchability, consider the LU-decomposition arrays designed in 
the previous section and shown in Figures 4.7b, 4.7c and 4.7d. It is clear from the 
LU-decomposition algorithm (described in program (5)) that at different index points 
different operations might have to be performed. Therefore, by projecting the SDG 
(Figure 4.7a) along a particular projection line, some of the PEs in the designed array
140
will have to execute different functions at different time units resulting in a PE- 
switchable array.
It is evident that a PE-switchable array requires some control mechanism to 
operate correctly. For example, the array shown in Figure 4.7b has two different 
types of PEs, namely a circular PE and a square PE. The circular PE is a simple 
buffer where the output link at cycle % has the value of the input link at cycle 
x -  1. The square PE operates in two different modes: a special mode and a normal 
mode as shown in Figure 4.9. One way to differentiate between the two modes is to 
associate a tag bit t with each input element. A set tag bit triggers the special 
mode, while a reset t switches the PE to the normal mode. To operate correctly, 
every input element in the first row has a set t ,  while all other elements have a 
reset t, and each PE delays the tag bit by one time unit. This causes each PE to run 
the special mode for only one cycle which is the first time unit it receives an input 
link.
For another example of PE switchable array, in the array shown in Figure 4.7d 
and reported in [31], the processing elements are divided into three groups: right, 
central and left. Each PE performs a special function during a single time unit. The 
special function performed by a certain PE is defined for and executed by all PEs 
belonging to the same group. The modes of operations for all PEs are shown in Fig­
ure 4.10.
Control of array operation can be easily implemented using tag bits. A tag bit
QS p e c i a l  M o d e  
C= a / b i
N o r m a l  M o d e  
Oq ■ Qj — c b 
bo = bj
Qo
bi
Figure 4.9 The modes of operation of the PEs in 
the array shown in Figure 4.7b.
SPECIAL MODE
Right  Group
<ti> c = c
Central Group
C*i) c'= l/c (»i+i)
L e f t  G ro up
( t j ) c'= be t »i +  |)
Figure 4.10 The modes of operation of the PEs 
the array shown in Figure 4.7d.
143
is associated with every datum moving on a north-bound link. The tag bit g is set 
to 1 before being appended to an A matrix element upon entering the array. 
Thus, storing of the tag bits in memory is not necessary.
The arrival of the first north-bound input with g =1 to a certain PE triggers the 
special mode operation for one time unit. During the special mode time unit, the PE 
performs a special function and complements g (to 0) before transmitting it north­
ward. Before and after this special mode time unit the tag bit is transmitted north­
ward unchanged. This mechanism guarantees that every PE will be in the special 
mode for one and only one time unit during the decomposition of A . Otherwise the 
PE is in the normal mode performing c' = c - a b . More details regarding array 
operation and the control mechanism of this particular array can be found in [31].
4.4.2. Link-Switchability
Definition 4.5: An array with links that change directions with time is called link- 
switchable array, otherwise, the array is link-nonswitchable.
Lemma 4.3: Any link-switchable array can be transformed to a link-nonswitchable 
array.
Proof: The proof is direct since any bidirectional link in a link-switchable array can 
be replaced by two unidirectional links resulting in a link-nonswitchable array. □
144
Control of switchable links should be simple. Further, given Lemma 4.3, it is 
evident that no special attention need to be given to link-switchable arrays.
4.4.3. Data-Flow-Switchability
Definition 4.6: An array with at least one PE that saves a datum for some time units 
and then outputs it at a latter time s called data-flow-switchable, otherwise, the array 
is data-flow-nonswitchable.
Lemma 4.4: If an SDG has at least two valid projection lines along two different 
dependence vectors, it is possible to produce a data-flow-nonswitchable array.
Proof: From Section 4.2, it is clear that if there are at least two dependence vectors, 
d i  and d 2  (for example), that can be used as projection lines, then at least one 
additional valid projection line exists; along d \ + d 2. Knowing that a dependence 
vector dictates the flow of a datum, if the SDG is projected along d x + d 2, none of 
the PEs in the designed array would have to save any of the variables thus resulting 
in a data-flow-nonswitchable array. This is clear since a datum v remains static 
inside a PE only if the array is designed by projecting the SDG along the flow of v 
which is dictated by a dependence vector. □
Lemma 4.5: A necessary condition for an algorithm to produce a data-flow-
switchable array is for that algorithm to have at least two distinct dependence vectors 
that dictate the flow of the same variable.
145
Proof: According to Definition 4.6, a data-flow-switchable array saves a variable for 
certain number of time units then sends it on an output link. Knowing that if the 
SDG is projected along the flow of a variable which is dictated by a single depen­
dence vector, the resulting array would have the variable static inside the PE at all 
time. Therefore, the flow of a variable that it is saved inside a PE at some time unit 
Xj and is flowing along a link at another time unti x2, has to be dictated by at least 
two distinct dependence vectors. □
Warshall’s algorithm for solving the transitive closure problem is an example of 
algorithms which have at least two dependence vectors dictating the flow of one vari­
able. The array for implementing this algorithm is shown in Figure 4.8b. The opera­
tion of the array is mainly divided into two major phases. In the first phase, input 
data is partially processed to varying degrees depending on which row it is in, and 
then loaded into the registers of the respective rows of the array. Row 1 is simply 
stored in the first row of PEs. The second row of inputs is processed by the first PE 
row and then loaded into the registers of the second PE row, and so on, until the nth 
row of inputs is updated by the PE rows 1 through n — 1 and stored in the registers 
of the n th row of PEs.
The second phase starts when the last row of inputs crosses the first row of PEs. 
At this time, the PEs switch modes, and the stored elements begin to propagate 
southward. Consecutive PE rows will then switch modes in a staggered fashion, one 
per time unit. The input data which was partially processed, and stored in the
146
registers (in phase 1) now move downwards and get completely transformed by the 
time they get out of the array.
From the above discussion it is evident that a control mechanism is needed. All 
PEs should be able to store the first element they receive. This can be done by 
including a flag bit in the PEs. This bit is represented by a flip-flop that is initially 
cleared (=0). If the flag is not set yet, the PE saves the input element and sets the 
flag. Once the flag bit is set (=1), the PE will operate in normal mode. PE functions 
are described in Figure 4.11.
4.5. Improving Performance
Since systolic arrays designed in accordance with the methodology reported in 
this dissertation thus far have been proven to be time-optimal for a given I/O 
bandwidth and minimum fan-out, one potential way to further improve the perfor­
mance is to increase PE fan-out. The fan-out of a processing element is defined as 
the number of links incident out of the PE. The number of output links may be 
increased by splitting the flow of some data into two or more different flows without 
affecting the final result. This section introduces and describes the concept of split­
ting data flows and answers questions such as: Which data flow can be split? How 
many times can it be split? Is increased fan-out beneficial in all cases?
4.5.1. Data Flow Splitting
Definition 4.7: A data flow corresponding to a dependence vector d is a line L
147
Circular PE
Mode I :
out
aout
'out
a out= a in 
Square PE a in
Tin
ta + 1 Mode I:
'out
aout
aout =a in OR t Tjn AND R ) 
Tout = Tn
Mode 2:
aout
aout“ R 
reset flag
Mode 2:
JL
—  R —
T
aout
flout = R 
reset  flag
Figure 4.11 The modes of operation of the PEs in 
the array shown in Figure 4.8b.
148
of index points such that an index point /; lies on L  if and only if there exists a 
point Ij where /,• -  Ij = d or Ij -  /, = d.
The concept of splitting a data flow can be better understood through an exam­
ple. Consider a data flow which consists of index points / j ,  / 2 , . . . , and is 
dictated by a dependence vector d as shown in Figure 4.12a. It is clear that 
11 -  Ii-\ = d , the fan-out of every index point in the flow is one, and k time units 
are needed to execute the computations at all index points. If the fan-out at point I  j 
is increased to two, for example, the flow of data is split (provided that data depen­
dencies between index points are not violated) into two concurrent data flows. In 
such case, every index point that relies on an output of 1 1 would a dependence vec­
tor dm that is a multiple of d ; i.e., dm = a d , a  is a strictly positive integer. 
Figure 4.12b shows one way of ordering the index points after splitting the data flow. 
It can be seen that k —1 time units are saved in this case. Furthermore, if the fan-out 
at each index point is increased to two, the ordering of the points would resemble a 
binary tree making the total execution time at all k index points O ( logfc ), as 
shown in Figure 4.12c. The idea of this example may be extended so that the fan-
r
out can be increased to more than two. Of course, splitting a data flow cannot be 
achieved unless the data dependencies between index points do not conflict with the 
new ordering. In order to determine whether a certain data flow may be split or not, 
the notion of rigid data flow must be introduced.
149
9da
a)
2 d 2 d
4 \
Ik.
b ‘
Figure 4.12 The concept of splitting data flow. 
(Cont.)
a) Original data flow.
b) Data flow split into two flows.
c) Data flow is a binary tree.
Figure 4.12 Continued.
150
II
c )
151
Definition 4.8: A rigid data flow is a flow of data between index points interrelated 
by a rigid dependence vector. Otherwise, the data flow is nonrigid.
Lemma 4.6: Assuming that the algorithm cannot be modified, only a nonrigid flow 
which is simply transmitting data may be split into two or perhaps more flows.
Proof: As stated earlier in Chapter 3, a data flow can be transmitting, accumulating, 
or neither of the two. By definition, the third is a rigid data flow, while the first two 
flows are nonrigid. According to Definition 4.7 and 4.8, any index point that falls on 
a rigid data flow line uses at least one rigid dependence vector, making the ordering 
of the points fixed and unchangeable. Therefore, only accumulating and transmitting 
data flows need to be considered further. As for an accumulating nonrigid data flow, 
the flow can be separated but new index points have to be added to the index set of 
the algorithm so that the final result is computed from the accumulated partial results. 
This requires modifying the algorithm which is not possible according to the lemma. 
Therefore, only transmitting data flows may be split without affecting the final result 
and without altering the original algorithm. □
It is easy to see that a nonrigid, transmitting data flow can be split into any 
desired number of flows. However, the fan-out of index points eventually is reflected 
in the PEs. The amount of PE fan-out implementable depends on the technology, the 
size of the algorithm, the dimension and even the structure of the array. If, for 
example, an array is linear, it might be possible to increase the fan-out of a PE to
152
four, six, or eight making the array two-dimensional or even nonplanar. However, if 
the designed array is already nonplanar with a PE fan-out greater than four (as an 
example), it might be not practical to increase the fan-out of any processing element 
in the array. Also, we have only considered splitting one data flow, but one should 
not forget that changing the order of index points might affect the flow of many vari­
ables. Thus, although a flow may be nonrigid and is only transmitting, the achiev­
able increase in PE fan-out may be limited or even impossible to implement. In the 
next subsection, the fan-out increase at the array level is studied in more detail.
4.5.2. Increasing PE Fan-out
Since the Systolic Directed Graph (detailed in Section 4.1) describes the data 
flow as well as the timing for the execution of the given algorithm on the target 
array, in many cases it can be successfully used to infer information about the possi­
bility of increasing the fan-out of some PEs as well as the amount of increase. The 
following results establish the basis for improving performance of the designed arrays 
through increasing PE fan-out.
Lemma 4.7: An array designed by projecting an SDG in which all the edges 
correspond to fixed and rigid dependence vectors cannot benefit from increasing PE 
fan-out.
Proof: The proof is straightforward since the SDG corresponding to an algorithm 
characterized by fixed and rigid dependence vectors does not include any nonrigid
153
data flow. Thus, according to Lemma 4.6, no data flow can be split making PE fan­
out increase impossible. 13
Lemma 4.8: Any data flow that is parallel to the chosen projection direction cannot 
be split.
Proof: We know that all index points that lie on a line L  parallel to the projection 
line are mapped to the same PE. Therefore, by splitting L , two or more index 
points that are on L would fall on the same time level making some of the PEs in 
the designed array perform at least two operations at the same time unit. This con­
tradicts our initial assumption of one operation per PE per unit time, as stated in Sec­
tion 4.2. □
Lemma 4.8 excludes any array that is designed by projecting the SDG along its
only nonrigid and transmitting data flow, from benefiting from any PE fan-out 
increase. At this point, we can summarize the characteristics an algorithm (or its
corresponding SDG) should possess in order for increased PE fan-out to be
beneficial. To benefit from increased fan-out, the algorithm (or its SDG) has to pos­
sess at least one of the following characteristics:
1. All the edges of the SDG that are members of the cut-sets Si , / = 1 to 
k - 1, where k is the nunfber of subgraphs formed after applying Algo­
rithm B to the directed graph describing the algorithm (Section (4.1),
154
correspond to nonrigid, transmitting data flows.
2. All the edges of the SDG that are members of the sets Et , i = 1 to k, 
and that are not parallel to the chosen projection line, correspond to nonri­
gid, transmitting data flows.
3. The algorithm has at least one variable dependence vector that does not 
have any nonconstant component at the same coordinate as that of any 
other variable dependence vector.
In the first case, each of the subgraphs G,-, i = 1 to k , can be thought of as a 
supervertex and the edges in the cut-sets Sj ,  j  = 1 to k -1 , are data flows along the 
supervertices. If there exists only one nonrigid and transmitting data flow along the 
supervertices (all edges in Sj  are parallel), the fan-out of these supervertices can be 
increased to any desired number /. Thus, the fan-out of every vertex in the sub­
graphs Gi, i = 1 to k, that is incident into an edge in Sj ,  j  = 1 to k -1 ,  can be 
increased to /. However, if there are more than one nonrigid and transmitting data 
flow along the supervertices, the fan-out can only be increased to two. The reason is 
that, in this case, to increase the fan-out of any vertex, all data flows of different 
directions must be split making the interconnections of the designed array nonlocal.
The same can be said about the second case. But here, instead of using the data 
flows between subgraphs, the flows inside a subgraph G(-, i = 1 to k , are considered. 
If there is only one nonrigid, transmitting data flow that is not parallel to the projec­
tion line, the fan-out can be increased by any desired number without affecting the
155
final result. However, if there are two or more flows, the increase in fan-out is lim­
ited to two for the same reason as in the first case.
A variable dependence vector with m nonconstant components is replaced by 
2 m +1 fixed dependence vectors: d min, dL. and dR., i = 1 to m , where dL. and 
dR. correspond to nonrigid, transmitting data flows. For every nonconstant com­
ponent at the position i , there are two groups of index points: group A consists of 
points that use dL, as dependence vector and group B consists of points that use 
dR. as dependence vector. This situation is shown symbolically in Figure 4.13a 
where / l5 / 2 , . . . , /5  are generated index points, Im is the used index point, 
and / 2 -  lm = d min- Generally, the cardinality of group A (number of index points 
in the group) is not equal to the cardinality of group B , and the number of time units 
needed for the data to reach the last index point is equal to the higher cardinality. 
An improvement in performance can be achieved by dividing the index points into 
two groups of equal cardinality and setting
= dL. if Card(A) < Card(5)
7"i -  Tk = dR. otherwise
where Ik is the highest index point (in lexicographical sense) in the two groups and 
Card(A) and Card(B) are the cardinalities of groups A and B , respectively. This 
situation is shown symbolically in Figure 4.13b, where Card(/4) < Card(B).
Any algorithm that possesses any of the three characteristics mentioned above is
156
i-mo
■mm
' r(DHyKy
*1 *2 *3
dR
I4  *5
a)
• m
O
,<yKyHyK) o~,
V 11 I2 I3 l 4  15__ ✓
~  * 1
b)
Figure 4.13 Ordering of index points that use a
variable dependence vector. (5 points)
a) Original ordering.
b) Ordering with all generated points 
have a fan-out of two.
157
guaranteed to benefit from increasing PE fan-out with one more restriction. The per­
formance cannot be improved for arrays that have already reached the absolute lower 
bound on the parallel execution time of the algorithm. This time is defined to be the 
ratio of the number of index points in the algorithm’s index set and the number of 
processing elements in the designed array. The reason is that in this case the pro­
cessing elements are 100% utilized and cannot support more parallelism if the pro­
perty of performing one computation at a time unit, is to be kept.
4.5.3. Examples to Illustrate Increasing PE Fan-out
For the sake of examples, this subsection studies the possibility of improving the 
performance of systolic arrays designed in Section 4.2. All dependence vectors of 
the LU-decomposition problem are fixed and rigid. Thus, none of the designed 
arrays for this problem qualifies for increasing PE fan-out.
The SDG for the matrix-vector multiplication algorithm has one nonrigid, 
transmitting data flow; the flow of variable b along the dependence vector d x. 
Therefore, only the array shown in Figure 4.2c could benefit from increasing PE fan­
out because it uses ^2 as projection line. Since there is only one data flow which is 
nonparallel to the projection line, the fan-out of any PE in the array in Figure 4.2c 
can be increased to any desired number. Figure 4.14a shows an array (n = 3) with 
one PE that has a fan-out of two, while Figure 4.14b shows an array (n = 5) where 
one of the PEs has a fan-out of four. The total execution time of such arrays with
one PE having a fan-out of I is n + -y. For the case where all the PEs have a
158
a)
°23
°22
°2I
°I3
0|2
b z
b3
°33
°32
°3I
b2
b)
c )
b3
b z
b|
Figure 4.14 Arrays for matrix-vector multiplication.
a) One PE with fan-out of two. (n = 3)
b) One PE with fan-out of four, (n = 5)
c) All PEs with fan-out of two. (n = 7)
159
fan-out of /, the execution time is n + log in  units. Figure 4.14c shows an array 
with n = 7  and I = 2, for the same problem.
The SDG for the matrix-matrix multiplication algorithm with I/O bandwidth res­
tricted to 2n, does not have any nonrigid, transmitting data flow inside any of its 
subgraphs. But, it has two such flows between subgraphs. Therefore, the fan-out 
along the flows of the a and b variables can be increased to just two. The 
modified array for n = 5 is shown in Figure 4.15. The number of time units saved
by this array (which was reported by El-Amawy in [28]) is - j .
The Warshall algorithm for solving the transitive closure problem has no nonri­
gid dependence vectors, but it has two variable dependence vectors. In Section 4.3, 
one of the variable dependence vectors was used as a projection line. Therefore, the 
other vector can be used to improve the performance of the planar array shown in 
Figure 4.8b. This calls for a higher dimension topology. Following the rules
reported earlier in this chapter, a cylindrical array that saves - j  time units can be
designed. The resulting array is shown in Figure 4.16 for n = 5 . The array has 
been reported in [57].
Figure 4.15 Array for matrix-matrix multiplication with 
bandwidth of 2n and one row of PEs
with fan-out of two
1 6 1
Figure 4.16 A cylindrical array for transitive closure. 
(n =5)
CHAPTER 5
DESIGN OF SIZE-INDEPENDENT AND 
FIXED DEPTH VLSI ARRAYS
Man’s mind once stretched by a new idea, never regains its original dimension.
Oliver W. Holmes
This. chapter focuses on extending the graph methodology for mapping algo­
rithms into size-dependent systolic arrays, detailed in the previous two chapters, to 
encompass size-independent and fixed-depth multi-linear (FDML) arrays. A size- 
independent systolic array is an array with a fixed number of processing elements 
M , while the number of PEs in an FDML array of depth dp is a function of the 
size of the algorithm with the restriction that the depth of the array does not exceed 
d p . The depth of the array (which is equivalent to the number of subgraphs of the 
corresponding SDG representation) is defined as the weight of the longest of the 
shortest paths from the host computer (or its memory) to any PE in the array, assum­
ing that the weight of all links in the array is unity. We measure the size of the 
algorithm N  by the number of index points in the index set J n of the algorithm; 
that is iV = Card ( J n ), where Card refers to the cardinality of a set. Other 
measures of the problem size are possible, for instance the number of input elements, 
the number of rows or columns in a matrix, etc.
162
163
From an implementation view point, fixed-size (used interchangeably with size- 
independent) and fixed-depth multi-linear arrays may offer the desired adaptability to 
the changes in the state of technology. As technology advances, new structures with 
extended sizes may be constructed offering higher throughputs and better perfor­
mances. Furthermore, the significance of FDML arrays is attributed to their close 
relationship with linear arrays which are known to be globally synchronizable even 
when large propagation delays exist [82]. On the other hand, globally synchronous 
arrays with arbitrarily large sizes may not be feasible even if other technological lim­
itations such as number of pins, chip area, and I/O bandwidth are disregarded. This 
accounts for the importance of designing small, fixed size arrays that may be used to 
run problems of large sizes.
The rest of this chapter is organized as follows: Section 5.1 defines the problem 
of designing fixed-size and FDML arrays. Section 5.2 discusses how an SDG can be 
manipulated and presents a new graph transformation. Section 5.3 uses the procedure 
of manipulating the SDG to design FDML arrays, while Section 5.4 applies the pro­
cedure in designing fixed-size arrays. Section 5.5 contains some examples.
5.1. Defining the Problem
In Chapter 4, the algorithm was represented by a directed, weighted graph called 
the SDG. By projecting the SDG along a defined projection line, the N  vertices (N  
is the total number of vertices in the SDG) are mapped into M  processing elements, 
where M  is a function of N . Now, suppose that we have designed a systolic array
164
of M' PEs for a problem of size N' and we want to run a problem of size N" , 
N " ~3> N ', on the same array. The only way to do this is to partition the problem 
into smaller subproblems of size N' each, and map these subproblems into the fixed 
systolic array of M' PEs. During the partitioning process, all side effects such as 
external communication overhead (between different parts), additional hardware and 
processing elements complexity, should be minimized. Therefore, the goals are as 
follows [78]:
1. The total execution time of the partitioned algorithm is proportional only 
to the product of the number of partitions and the time to process one par­
tition. Therefore, delays caused by the partitioning process must be kept 
to a minimum, to guarantee time-optimality.
2. Partitioning should not result in any significant increase in the complexity 
of processing elements.
3. The amount of overhead in external hardware and external communication 
is as small as possible.
It is noted that the partitioning process is never safe from side effects. For 
example, a fixed-size systolic system will always need extra memory to save the 
intermediate results. Furthermore, a full-size array of more than one type of process­
ing elements might lead to a fixed-size array with more complicated PEs, where 
some PEs would have to perform different operations at different time units. The
165
objective of the designer is then only to minimize the additional hardware and control 
needed for the correct operation of the fixed-size sy stem and to minimize the execu­
tion time required to run a large size problem on a smaller size array.
Our approach to the partitioning problem is to use the SDG algorithm represen­
tation and then divide it into subgraphs that can be mapped directly into the proces­
sor space. The reason for using the SDG model is that it is powerful and yet flexi­
ble. The graph is regular even though the original algorithm may not be. This pro­
vides the opportunity to manipulate the algorithm easily and formally at a rather 
abstract level while maintaining algorithm integrity and time optimality.
5.2. Manipulating the SDG
As stated earlier in Chapter 4, the SDG is divided into k subgraphs G l5
G 2, ■ ■ . , Gk , where k is a function of the algorithm size. In a brute-force
approach, each vertex in each subgraph could be mapped to a distinct processing ele­
ment. But this results in a very inefficient solution. Instead, each subgraph G ,, 
1 < i< k , actually represents a conflict-free set of operations that may directly be 
mapped onto a set of PEs. All vertices along the same projection line are assigned to 
the same PE. Projecting the SDG along a different projection line will change the 
set of operations assigned to a certain PE.
Since the basic SDG subgraphs evolve naturally from the algorithm model 
transformations, they correspond to groups (rows of PEs, columns of PEs, or even 
mesh connected PEs *) of the "best fit" size-dependent architecture, whether the
166
resulting architecture is 1-D, 2-D, or 3-D. It is thus clear that manipulating the SDG 
will invariably manipulate the final architecture. For example, if an SDG G with 
k-subgraphs, k > 1, is transformed into G' with a single subgraph Qc' = 1), then it 
is possible to map the algorithm in concern into a linear array. Similarly transform­
ing an SDG G into G ', where G' has fewer subgraphs (k' < k),  necessarily 
results in a smaller size array with each PE doing more computations.
In addition, each subgraph can be divided into bands Bj ,  j  = 1 to / ,  where a 
band is a subgraph of G;, i e { 1 , 2 , . . . ,  k) .  It is noted that / may not be the 
same for all G(- ' s . By transforming a subgraph G(- into Gt-', where G ,' has less 
bands than G,-, the size of the corresponding group of PEs decreases making the 
designed array smaller.
Let A be a transformation of an SDG G (V , E , L )  such that
A : G (V , E , L ) —> G' ( V ' ,E ' ,L ' )
If G' is to correspond to the algorithm in concern and is to be time optimal then 
the following set of requirements must apply:
1. The data dependencies represented by G' must conform to those 
specified by the algorithm.
2. The transformation A must guarantee time optimality of G' (minimum
number of time levels).
1 as in the orbital array designed for the matrix-matrix multiplication problem shown in Figure 4.5.
167
3. G' must maintain the systolic requirement of local data dependencies and 
the assumption of one operation per PE per time unit, where an operation 
consists of all computations performed by a single vertex. Recall that this 
assumption has been suggested and observed throughout this dissertation.
In the next subsection an SDG —> SDG transformation that is useful towards 
the synthesis of fixed size and FDML arrays is defined.
5.2.1. Graph Stretching
As stated before, the graph in concern is a labeled acyclic directed graph G = 
(V, E ,L ) .  A graph transformation that will be used in the synthesis of fixed-size 
and FDML arrays is referred to as graph stretching.
Definition 5.1: Graph stretching is a transformation a  of the graph G (V, E , L)  
such that
a : G ( V , E , L )  —» G ' ( V ' , E ' , L ' )
where Card (V' ) = Card (V), Card (E ') = Card (E ) and v* e  V is mapped to v'x 
g V' such that
r*  = Tx
t'x *  tx 
f ' x  =  f x
dep'x = depx
and if there is an edge connecting vx to vy in G , G' will have an edge
1 6 8
connecting v'x to v 'y . In addition, in our apj. ’ication a restriction is put on t'x . It 
requires (t'x -  tx) to be as small as possible while accomplishing the task.
Stretching an SDG graph corresponds to shifting some of the vertices of G 
along the time axis in the positive direction. The number of vertices and edges as 
well as the depth of each vertex remain unchanged. Also, the transformation a  
keeps the data dependencies intact. Figure 5.1a shows an SDG G with two sub­
graphs G j = (F j, F j, L j) and G 2 = (F 2, E 2, L 2). Figure 5.1b shows the stretched 
SDG G' where for every vx e  V2, there exists a vertex v'x e V' 2 with t'x -  tx 
=  1.
The next two sections employ the graph stretching transformation to project the 
full-size SDG into fixed-depth multi-linear and size independent systolic arrays.
5.3. Design of FDML Arrays
We characterize an array of any dimension (i.e. ID, 2D, 3D or nonplanar) by 
two coordinates: its width wd and its depth dp . For the remainder of this chapter, 
we will assume that GPEi is a group of PEs (can be rows, columns, or a combina­
tion of both) that consists of PEs constructed by projecting the SDG subgraph G,-, i 
e  {1, 2,.., k }, along the chosen projection direction. Let /V,- be the number of PEs 
in GPEi and FA/?, be the distance between the two farthest PEs in G FF,, where 
the distance between two connected neighbor PEs is defined to be unity. If all PEs 
in GPEi are not connected, then FA/?, = 0. The width of an array is defined as:
169
Li
1 2
1 3 
U
l 5
Figure 5.1 Graph stretching.
a) Orginal graph.
b) stretched graph.
170
wd =
k k
Max ( FAR -. ) if Max ( FAR: ) > 0 
i=1 i=l
k
Max ( N: ) otherwise
»=1
The depth of an array is equal to the number of subgraphs in the SDG as 
defined in Chapter 4. It also equals the longest of the shortest directed paths from of 
any input port to any PE in the array. As examples, a linear array with M  PEs has 
a depth dp = M  and a width wd = 1 if only one of its PEs receives data from the 
outside world, while dp = 1 and wd = M  if all PEs act as input ports. A mesh 
connected array with M 2  PEs has dp = M  and wd = M  if the input bandwidth 
is either 2M  or M , while dp = 1 and wd = M 2  if all of the M 2  PEs are input 
ports.
An FDML array is an array with a width that is a function of the algorithm size 
and a depth that is chosen a priori by the designer. Expectedly, the depth will be a 
function of the state of technology. As technology advances, the depth can be 
increased to provide more processing power. This feature relates particularly to tech­
nological progress in the pin-count and I/O fronts.
To design an FDML array of depth dp that implements an algorithm of size 
N , the following steps are suggested:
1. Map the algorithm into its corresponding SDG following the procedure
discussed in Chapters 3 and 4.
171
2. If the SDG consists of k subgraphs G,-, i = 1 to k , where k < dp, 
use the method discussed in the second section of Chapter 4 to map the 
SDG into hardware. The design is complete. This case is unlikely how­
ever.
3. If the SDG consists of k subgraphs, where k > dp, the SDG is 
transformed into G' that consists of dp subgraphs.
4. The transformed SDG G' is then mapped into hardware.
In the remainder of this section, Steps 3 and 4 are discussed thoroughly. It is 
assumed that we are dealing with an SDG of k subgraphs, where k > d p ; dp is 
the depth of the array to be designed. Also, without loss of generality, k can 
assumed to be a multiple of d p ; that is k = m.dp, where m is an integer greater 
than unity. The subgraphs G, of the SDG will interchangeably be referred to as 
hyperplanes HPt , i = 1 to k , and a group of hyperplanes will be referred to as a 
hypersurface.
5.3.1. SDG Stretching
To transform an SDG of k subgraphs into a graph of dp subgraphs, the k
khyperplanes, HPt , i = 1 to k , are grouped into —  hypersurfaces HSj, j  = 1 to
k— , where each hypersurface consists of dp hyperplanes. If the resulting hyper- 
dp
surfaces were to be mapped directly into hardware, then some PEs would have to
172
perform at least two operations per global clock period. The resulting array would 
be as fast as the full-size array (where dp = k), but such array would, in most 
cases, exhibit some unattractive features such as large number of links per PE and 
more complicated PEs. To remove such drawbacks, the hypersurfaces are stretched 
along the direction of the time axis which enables us to decouple (in time) the opera­
tions that will eventually be assigned to a certain PE. If this is the only goal, the
fctask can be easily accomplished by shifting the hypersurfaces HSj ,  j  = 2 to -j~,
so that no two vertices of different hypersurfaces have the same firing time. How­
ever, the task is much more involved and graph stretching should be performed care­
fully so that the following three conditions are satisfied:
a) No two vertices at same time level are mapped to the same PE 2
b) The amount of stretching is minimal (time optimality).
c) Data dependencies remain intact.
The first condition suggests that the stretching procedure should be able to 
recognize the vertices that are mapped to the same PE. This forces the stretching 
procedure to depend on the selection of the projection line. Therefore, the projection 
direction should be chosen prior to applying any transformation on the SDG.
Assuming that d is the vector along the chosen projection line (it can be any
2 This is to satisfy the assumption of one operation per PE per time unit.
173
valid projection line used in producing full-size arrays) and that d sum is the summa­
tion of all dependence vectors that correspond to edges in the cut-sets Sj, j  = 1 to 
k - 1 (k is the number of subgraphs in the SDG), the following algorithm stretches
the SDG to make it projectable into a fixed depth systolic array of depth d p .
ALGORITHM C
BEGIN
For i = dp + 1 to k in steps of dp do 51 to S 1 3
51. Find vx at lowest time level with depth depx = i .
5 2. Place tz of all vertices vz such that vz -  v* = Id, in an array denoted by
X ; I is a positive integer.
5 3. Find vy such that Ix -  Iy = m d sum and depy = i -  dp, where m i s a  
positive integer.
5 4. Place tz of all vertices vz such that vz - v y = Id, in an array denoted by
Y ; I is a positive integer.
S 5. nshift = 0;
S 6 . I f  (at least one element of X  is equal to an element in F) do
tx = tx + 1; (shift vx by one time unit)
Increment by 1 all elements in X ;
3 Recall that k is assumed a multiple of d p .
174
nshift = nshift + 1; 
go to S 6 .
else
Continue;
S I .  tp = tp + nshift, for every vp such that there exists a directed path from vx 
to vp .
END (Algorithm C).
fcAlgorithm C divides, first, the k hyperplanes into —  hypersurfaces, anddp
then starts shifting the vertices along the time axis until no vertices that are in 
different groups and that will be mapped into the same PE fall on the same time 
level. Step 1 of the algorithm identifies the vertex vx that has the smallest time
klevel in each of the subgraphs Gmdp+i, where m e  {1, 2, , —  -  1}. If vx
€ HSf,  Step 3 looks for a vertex vy e  that will eventually be mapped to the
same PE as vx . If all the edges between the subgraphs Gt , i = 1 to k , correspond 
to only one dependence vector d±, a sufficient condition so that two vertices vx 
and vy of different subgraphs are mapped to the same PE is that Iy -  Ix = m dx\
that is vx and vy are connected through edges corresponding to d\.  Now, if the
edges between the subgraphs correspond to more than one dependence vector, a 
sufficient condition is that ly -  lx = mdsutn, where d sum is the summation of 
dependence vectors that correspond to edges in the cut-sets Sj, j  -  1 to k - l .
175
Step 2 of Algorithm C saves, in an array denoted by X , the time levels of the 
vertices that lie on the same projection line as vertex v*, while Step 3 stores, in 
array Y , the time levels of the vertices that share a projection line with vertex v ,^. 
Step 6 starts shifting the elements in X  by one time unit until there is no time over­
lap between the elements in X  and the elements in Y. To keep the data dependen­
cies intact, all vertices that can be reached from vx through directed edges are also 
shifted by the same amount. For the SDG shown in Figure 5.1a, there exist three 
valid projection directions; along d x, d 2  or d j + d 2, as shown in the figure. 
According to Algorithm C, if the projection direction is along d l or along d 2, the 
stretched SDG is equivalent to the one in Figure 5.1b. However, if the projection 
direction is along d 1 + d 2, the SDG does not have to be stretched since none of the 
lines that will eventually be mapped to the same PE have a time conflict.
Theorem 4: Algorithm C stretches the SDG by a minimum time t such that no 
two vertices that are mapped to the same PE have the same firing time.
Proof: Algorithm C searches for two vertices v* e  //S, and vy e  HSt_}, i e  {1
Ic, , — }, that will eventually be mapped to the same PE, PE/. All the vertices
dp
that lie on the same projection lines as v* and vy are identified and their firing 
times are stored in two arrays denoted by X  and Y, respectively. The vertices 
whose firing times are in X  are shifted by t time units; where t = nshift is 
computed by Step 6 of the algorithm.
176
Suppose that the stretching procedure was not optimal, then the amount of time 
shifting can equal t ' , where t' < t. This means that the loop in Step 6 would stop 
when nshift reaches t' < t. In this case, at least one element of the array X 
would be equal to an element in the array T, according to the condition in Step 6 . 
Therefore, the processing element PE[ would have to perform two operations at the 
same time which contradicts our assumption of one operation per PE per time unit. □
5.3.2. Mapping Transformed SDG into Hardware
At this point, we have an SDG which is divided into hypersurfaces HSi, i = 1
kto — , where k is the number of hyperplanes in the SDG and dp is the depth of 
dp
the systolic array to be designed. Each hypersurface consists of dp hyperplanes 
and no two vertices that are in different hypersurfaces and that will eventually be 
mapped into the same PE, have the same firing time.
To map the transformed SDG into hardware, each hypersurface can be treated as 
a complete, full-size SDG and be projected into a systolic array following the pro-
kcedure outlined in Chapter 4. The result, in this case, is —  arrays that are con-
dp
nected to each other, each has a depth of d p .
By stretching the SDG, as outlined above, we are guaranteed that for each PE,
kPEi, constructed using vertices in HS:, i e  {1, 2, . . .  , — there exists at least
dp
one PE constructed using vertices in HSt_j that does not have a time conflict with 
PE[. A time conflict between two PEs might occur if both PEs perform an operation
177
at the same time unit. Therefore, all PEs in the group corresponding to hypersurfaces 
HS2, H S3, . . . , HSk/dp may be merged with the PEs corresponding to HS j 
without violating the assumption of one operation per PE per time unit.
More formally, to transform the stretched SDG into hardware, the vertices of 
different hypersurfaces that will eventually be projected to the same PE are first 
identified and placed on the same projection line. As stated earlier, two vertices vx 
and vy are mapped into the same PE if and only if one of the following two condi­
tions is satisfied:
1. vx and vy are in the same subgraph G ,, i e  {1, 2, . . . , k }, and 
Ix -  Iy -  d or Iy -  Ix = d , where d is the vector along the chosen 
projection direction.
2. vx and vy belong to two different subgraphs Gpi and Gpi such that 
P i  ~ Pi  + I-dp and Ix -  Iy = mdsum, where d sum is computed by vec- 
torially adding the dependence vectors that correspond to edges in the 
cut-sets Sj, j  = 1 to k -1 ,  and m and I are two nonzero integers.
The second condition above suggests that any two subgraphs Gpi and Gpi such 
that p 2  = p  i + I dp can be mapped to the same group of PEs causing the final array 
to consist of only dp groups.
Since the edges incident into the vertices of subgraphs G i+Ldp, / = 1 to
k k 1, are incident out of vertices of Gidn, 1 = 1 to —— 1, the final
dp p dp
178
architecture will have wrap around links connecting the last group of PEs to the first 
group of PEs. Also, since the SDG was stretched, the difference between the firing 
time of two vertices that are in two different hypersurfaces and that are terminal to 
the same edge, may be greater than unity. Therefore, to guarantee proper operation 
of the designed FDML array, in such a case, an extra buffer memory is placed 
between the last and first group of PEs. This buffer memory is used only to intro­
duce appropriate delays in the path of data streams, thus no complicated memory 
management is necessary. However, the size of the memory is, in general, related to 
the size of the problem.
Before concluding this section, it should be noted that, as for the full-size array, 
every PE in the FDML array is assigned two vectors, a function vector f*  and a 
timing vector if, that will eventually control the timing of operations at the process­
ing element. In addition, the last elements to be input from the outside world should 
be able to switch the communication links to allow the PEs of the first group to 
receive data from the buffer memory instead of the host computer.
5.4. Design of Fixed-Size Arrays
The procedure to design fixed-size arrays discussed in this section applies only 
to algorithms whose index points do not have to be rearranged or reordered to obtain 
optimal execution of the algorithm. The reason is that the reordering procedure 
described in the second part of Algorithm A (Chapter 3) depends on the problem 
size. A primitive solution applicable to algorithms that do not possess this property
179
is given at the end of this section.
A fixed-size or size-independent systolic array is an array of fixed depth dp 
and fixed width w d ; that is the number of PEs in the array is chosen, a priori, by 
the designer. Fixed-size arrays are practically attractive since the size of a VLSI chip 
is constrained by technological factors. There are also limitations imposed by the 
number of I/O pins. While the level of integration is expected to continue to grow 
for sometime, the number of pins will be limited to several hundreds making VLSI 
devices I/O bounded [78].
The following steps can be followed to design a fixed-size array of depth dp 
and width wd that implements an algorithm of size N :
1. Transform the algorithm of size N  into its corresponding SDG following 
the procedure outlined in Chapters 3 and 4.
2. Number the lines parallel to the chosen projection line in each subgraph 
G, , i = 1 to k , according to Algorithm D, to be discussed later. Let hi, 
be the highest line number in Gt , i = 1 to k .
k
3. If Max (hii) < wd and k < dp, then the SDG can be mapped directly
*’ =  1
into hardware following the procedure described in Chapter 4 4. Max is 
an integer function that finds the maximum of the integer arguments.
4 This case is unlikely, however. We assume that an array smaller than the fixed-size one is ac­
ceptable.
180
k
4. If Max (hli) > wd, every hyperplane (subgraph) of the SDG is divided
i =  1
into / bands f i j ,  B 2, • • • > &!•
5. Stretch the SDG by shifting the bands B 2  to Bt along the time axis.
k6 . If k > dp, group the k stretched hyperplanes into —  hypersurfaces
dp
HS\, HS2, ■■■ i HS/c/dp •
7. Stretch the SDG by shifting the hypersurfaces HS 2  to HSk/dp, along the 
time axis.
Steps 1 and 3 use procedures discussed in Chapter 3 and 4, while Steps 6 and 7
were described in the previous section. Steps 2, 4 and 5 are the only steps that need
k
to be detailed. For the remainder of this section, it is assumed that Max (hlt ) s> wd
i=1
and k »  dp.
5.4.1. Line Numbering
Before dividing the subgraphs of the SDG into bands, the lines that are parallel 
to the projection line are numbered so that the vertices in each band can be 
identified. Assuming that d  is the vector along the chosen projection direction and 
di is any dependence vector of the algorithm with dL d , Algorithm D labels 
each vertex vg with a positive number plg .
181
ALGORITHM D
BEGIN
51. Find the vertex vx e  G i at the lowest time level. Let plx = 1.
52. Let ply = 1 for every vertex vy e  G x such that Iy -  Ix = m d,  where m is 
a positive integer.
5 3. Let plz = I for every vertex vz e  G j such that Iz -  Iy -  Id, , where ply =
1.
5 4. For j  = 2 to k do 5 5
55. Set pi, = plz for every v, e Gj such that vz e  Gj_x and I, -  l z -  d,. If
more than one vz satisfies the above condition pi, = Max (plz).
END (Algorithm D).
Algorithm D, first, assigns a label equal to unity to all vertices that share with
the vertex vx (at the first time level) a line parallel to the projection line. Then, any
vertex in G j that is directly dependent on a vertex that lies on the line numbered 
I - I  is labeled with the number /. And any vertex v, in G, is assigned the 
highest label of vertices vz in such that v( is directly dependent on the ver­
tices vz. Figure 5.2 shows an SDG of one subgraph and three valid projection 
directions; along d j, d 2  or d j + d 2, as shown in the figure. Each vertex is 
labeled with a 3-tuple Qc, y , z ), where x  is the line number that the vertex lies on 
if the projection direction is along d\, y corresponds to the projection direction d 2
182
(  111 )
( 122 ( 212 )
( 221 )
d l  +  d 2
Figure 5.2 Line numbering.
183
and z is the label according to the direction d j + d 2.
Once the lines parallel to the projection line are numbered, we are ready to 
group them into bands that will eventually be mapped into groups of PEs where each 
group has a width equal to wd. The next subsection deals with constructing the 
bands and stretching the SDG so that all the bands in one hyperplane can be mapped 
to the same group of PEs.
5.4.2. SDG Stretching
As stated earlier, each hyperplane in the SDG HPit i e  {1, 2, . . . , k ,  is
divided into / bands Bj ,  j  = 1 to /, and the bands B 2  to B t are shifted along the
time axis such that if all the bands in a hyperplane are mapped to the same group of
PEs, each PE would only perform one operation per time unit. Then, the k
kstretched hyperplanes of the SDG are grouped into —— hypersurfaces HSs , 5 = 1
dp
kto — , and then the hypersurfaces HS 2  to HSk/dp are shifted along the time axis 
dp
so that if all hypersurfaces are mapped to the same set of PEs, no time conflict 
occurs. A time conflict might occur if two vertices with the same firing time are pro­
jected to the same PE.
A hyperplane Pt , i e  { 1 , 2 , . . . ,  k} is divided into bands as follows:
1. A vertex vx is assigned to the band B j if and only if 
(m -l)w d  <plx <m.wd, and there exists no vertex vy such that
184
ply < (m -l)w d ,  where m is positive integer.
2. After forming f ij, the bands B 2  to Bt are each assigned wd lines
that are consecutively numbered.
It should be noted that the number of lines in B j does not have to equal w d ;
it actually might be less or greater than wd since two lines in the same hyperplane
might have the same label. The only restriction, however, is that the difference 
between the highest and lowest labels in the band does not exceed w d .
Since all bands in a hyperplane will eventually be projected to the same group 
of PEs (or more precisely all parallel lines that are numbered n^,
+ wd, . . .  , + i.wd, where i is a positive integer, are mapped to the same
PE), the SDG is stretched, using Algorithm E, such that no two vertices vx and
vy with \ply -  plx \ = i.wd, are at the same time level. It is assumed that d  is
the vector along the chosen projection direction.
ALGORITHM E
BEGIN
FOR i= w d + \  to hi\ in steps of wd do S I  to S 9
51. Find v* e G x at lowest time level with plx =/ ' .
5 2. Place tz of all vertices vz such that vz - v x = Id, in an array denoted by
X ; I is a positive integer.
185
53. Find at highest time level with ply = 1 .
54. Place t2 of all vertices vz such that vy -  vz = Id, in an array denoted by 
Y ; / is a positive integer.
5 5. nshift = 0;
56. / /  (at least one element of X  is equal to an element in Y ) then
= tx + 1; (shift vx by one time unit)
Increment by 1 all elements in X ; 
nshift = nshift + 1; 
go to 5 6 .
else
Continue;
5 7 . tp = tp + nshift, for every vp such that there exists a directed path from vx 
to vp .
58. Let plz = 1  for every vz such that 7Z -  Ix -  md, where m > 0.
59. If there exists a vertex vx with plx = i,  go back to 51.
510. For every vertex vz do
• plz -  pl2 MOD wd
• If pl2 = 0 then pl2 -  wd 
END (Algorithm E).
186
Algorithm E shifts, first, all the vertices vz that belong to the line numbered 
wd+ 1 along the time axis by t time units so that none of them would be assigned 
the same time level as the vertices of the line labeled 1, where t -  nshift is com­
puted by Step 6 . Then all the vertices vp that can be reached from vz through a 
directed path are also shifted by t units. The same process is performed for the 
vertices that belong to the line numbered 2wd+l, then the line labeled 3wd+l, etc. 
In addition to stretching the SDG, Algorithm E renumbers the lines parallel to the 
projection line so that two lines with the same number are eventually mapped to the 
same PE. This is reflected by Step 7 of the algorithm. If the SDG shown in Figure 
5.2 has to be projected into one PE, it is stretched according to Algorithm E to yield 
the two graphs shown in Figure 5.3. Figure 5.3a shows the stretched SDG if the pro­
jection directions is along d j, while Figure 5.3b shows the stretched SDG if the 
projection direction is along d 2. Both graphs are valid if the projection direction is 
along d j + d 2 -
Theorem 5: Algorithm E stretches the SDG by a minimum time t such that no 
two vertices in the subgraph that are mapped to the same PE have the same firing 
time.
Proof: Algorithm E searches for two vertices v* e HP j at the lowest time level 
and vy g HP j at the highest time level such that plx = i.wd + 1 and ply = 1 that 
will eventually be mapped to the same PE, PE[. All the vertices that lie on the
187
Figure 5.3 Subgraph stretching.
a) Along d v
b) Along d 2.
188
same projection lines as v* and vy are identified and their firing times are stored 
in two arrays denoted by X  and Y, respectively. The vertices whose firing times 
are in X  are shifted by t time units; where t -  nshift as computed by Step 6 of 
the algorithm.
Suppose that the stretching procedure was not optimal, then the shifting time 
could equal t ' , where t' < t. This means that the loop in Step 6 would stop when 
nshift reaches t' < t. In this case, at least one element of the array X  would be 
equal to an element in the array Y , according to the condition in Step 6 . Therefore, 
the processing element PEt would have to perform two operations at the same time 
which contradicts our assumption of one operation per PE per time unit. □
It is noted that Algorithm E shifts the bands in a subgraph so that all bands 
can be mapped into the same group of PEs without any time conflict. Shifting the 
hypersurfaces HS2 to HSk/dp along the time axis can be accomplished by applying 
Algorithm C (described in the previous section) with minor changes. In designing an 
FDML array, the lines parallel to the projection line in a hyperplane are projected 
into different PEs which may not be the case when designing fixed-size arrays. 
Therefore, Steps 2 and 4 in Algorithm C would be changed such that they find the 
vertices vz with the same line number as v* and vy , respectively. The modified 
steps should be stated as follows:
S 2. Place tz of all vertices vz such that plz = plx and depz = depx , in 
an array denoted by X .
189
S 4. Place tz of all vertices vz such that plz = ply and dep2 = depy , in 
an array denoted by f .
5.4.3. Mapping the Stretched SDG into Hardware
At this point, the algorithm of size N  is represented by an SDG that consists
k kof k hyperplanes that are grouped in —  hypersurfaces HSh i = 1 to —, and
each is partitioned into / bands Bj, j  = 1 to / ,  of widths not exceeding w d . For 
the sake of discussion, we define a frame  to be a subgraph of the SDG that con­
sists of all bands with the same subscript in the same hypersurface. The total 
number of bands in a frame is dp which equals the number of hyperplanes in a 
hypersurface. Therefore, if a frame is projected along the chosen projection direc­
tion, it leads to a systolic array with depth dp and width w d.
Algorithm C stretches the SDG such that no two vertices that belong to the 
same hyperplane but different bands and that will eventually be mapped to the same 
PE, are placed at the same time level. Algorithm E stretches the already stretched 
SDG such that no two vertices that belong to the same hyperplane but two different 
hypersurfaces and that will eventually be mapped to the same PE if the two hypersur­
faces are projected to the same processor space, will have the same time level. 
Therefore, the stretched SDG can be divided into frames such that if two frames are 
mapped into the same systolic array, no PE would have to execute two different 
operations at the same time, where an operation consists of all computations per­
190
formed by a single vertex.
Remembering that Algorithm E also renumbers all lines parallel to the projec­
tion line such that two lines in the same hyperplane have the same number or label if 
they will eventually map to the same PE. Therefore, the stretched SDG can be pro­
jected into a fixed-size systolic array such that vertices vx and are mapped into 
the same PE if and only if one of the following two conditions is satisfied:
1. p l x -  p ly  and depx = depy ; that is vx and vy belong to lines that have 
the same label and that are in the same hypeiplane.
2. vx and belong to two different hyperplanes HPpi and HPpi such 
that p  2 = p  i + I.dp and that lx - l y = mdsum, where d sum is com­
puted by vectorially adding the dependence vectors that correspond to 
edges in the cut-sets Sj, j  = 1 to k - \ ,  and m and I are two nonzero 
integers.
After mapping all vertices into PEs, the PEs are connected to each other using 
directed links. A link would connect PEt to PEj if and only if there exists at 
least one edge e 6 E (E is the set of edges in the SDG) that is incident out of a 
vertex that is mapped to PEi and incident into a vertex vy that is mapped to 
PEj.
It is clear that an extra buffer memory is needed for the proper operation of a 
fixed-size array. The buffer memory will simply introduce delays in the path of data
191
streams. It is placed between the group of PEs of depth dp and the PEs of depth 
one, and between the group of PEs constructed using vertices that lie on lines labeled 
wd and PEs assigned to vertices on the lines numbered 1.
It should be noted that All the procedures in this chapter do not apply to SDGs 
where the index points were reordered to obtain optimal execution of the algorithm. 
The reason is that the reordering procedure, due to the use of nonrigid dependence 
vectors, depends on the size of the algorithm. This can be seen from Algorithm A 
that constructs the SPD representation of the algorithm in concern. Whereas the 
stretching transformation employed in this chapter is rigid in the sense that it does 
not permit any reordering of the index points (vertices). An example of such cases is 
the SDG of Figure 4.4a corresponding to the matrix-matrix multiplication problem 
(C = AB) with an input bandwidth of 2n , where n is the number of elements in 
a row or column of the dense matrices A and B . In this SDG, there are n ver­
tices at the first time level where n - 1 of the vertices correspond to index points 
that are less than (in lexicographical sense) index points at higher time levels.
A primitive solution to the above problem is to divide the index set of the algo- 
Nrithm of size N  into —  subsets of index points, where N' is the size of the
algorithm that can be mapped directly into the fixed-size array to be designed. Each 
subset is treated independently and transformed into the SPD representation using 
Algorithm A described in Chapter 3. The SPD is then transformed into the SDG 
representation which is mapped into hardware following the procedure discussed in
192
Chapter 4. All the designed arrays can be mapped to one array which can be used to 
implement the subsets of index points in sequential order.
5.5. Examples
Matrix-Vector Multiplication: The SDG for this problem with input bandwidth of 
n + 1 is shown in Figure 4.2a (n = 2). Now assume that the problem has n = 4 
and the available number of processing elements is p  = 2 .
It is easy to see that the SDG for the matrix-vector multiplication algorithm con­
sists of one hyperplane. Following the SDG transformation proposed in Section 5.3,
the SDG can be divided into — bands and then stretched along the time axis such
P
that all the bands can be mapped into a single band with no time conflict. Figures 
5.4a and 5.5a show the stretched SDGs if the chosen projection lines are either 
or d 2  as indicated. Figures 5.4b and 5.5b present the two fixed size arrays (n = 4 
and p = 2).
The communication between computations in adjacent bands is performed via an 
external buffer memory operating in First-In-First-Out (FIFO) mode. It should be 
noted that external data communication is performed in an orderly manner, and no 
complicated outside control or memory management in necessary. In general the 
maximum number of locations in the buffer memory is related to the size of the 
problem.
The Transitive Closure Problem: The SDG for this problem is shown in Figure
L|
1-2
L4
L5
l 7
1-8
L9
b4 b3b2b,
b)
° 4 4
'34 a43
'33 °4 2
'32 a 4)
'31 a 24
'14 a 23
113 a 22
'12 a 2l
'II 0
M r F I F O  Buffer
Figure 5.4 Matrix-vector multiplication (« = 4 and p  = 2).
a) Stretched SDG along d\.
b) Array projected along d x.
b)
O 0 4 3  0  0 3 4
0  C I 3 3  0 024
0 ° 2 3  >^4 0 | 4
t>3 0 1 3  0  0 4 2
0  a 4 | 0  0 3 2
0  0 3 1  0  0 2 2
0 0 21 b2 a 12
F IF O  Buffer
Figure 5.5 Matrix-vector multiplication (n = 4 and p  = 2).
a) Stretched SDG along d 2.
b) Array projected along d 2■
195
4.8a (n -  3). In this example we assume that n = 4. Therefore, according to the 
SDG construction rules described in Chapter 4, the SDG will consist of four hyper­
planes f j t o P ^
To design a linear array, the hyperplanes P 2  to P 4  are shifted downward along 
the time axis such that all four hyperplanes can be mapped into one hyperplane with 
no time conflict. The stretched SDG is shown in Figure 5.6a (for simplicity, the 
edges between hyperplanes are not shown). The final linear systolic array and its 
input sequence are presented in Figure 5.6b. In addition to the linear PE array, the 
figure also shows a buffer array. The buffer array (which is a FIFO) simply intro­
duces appropriate delays in the paths of different data streams such that the updated 
matrix can be directly fed back to the array to avoid costly memory I/O interaction.
All PEs in the linear array operate in two different modes. To control the 
switching time and to differentiate between the two functions, a tag bit is inserted 
into the leftmost PE from the left and three control bits called header (h ), trailer (t) 
and delay (d ) accompany the matrix elements. For notational purposes, such bits are 
shown only if they are set. If a PE receives an element with a set header, it saves 
the element in an internal register and then sets the header of the next incoming ele­
ment. A set trailer forces the PE to release its saved element and to transmit it 
southward with a set trailer.
The third control bit is called delay. This bit serves two purposes. First, it is 
used to control the buffer array. Figure 5.6b shows a switch after two (or n-2) buffer
196
2
3
4
5
a)
Figure 5.6 Linear array for transitive closure. 
(Cont.)
a) Stretched SDG.
b) Projected array.
197
Figure 5.6 Continued
°44 t
°43 t °34
a42 t °33
CM
o
°4I t a32 °23 °I4 h
°3I a 2 2 °I3 h 0
a2| °I2 h 0 0
°ll h 0 0 0
r — r — r ~ “  
— □ -!-  *
b)
198
registers. If d is set the element takes the long path (n buffers), otherwise the 
shorter feedback path is taken. Second, the delay bit is also used to indicate a 
change in the direction of the PEs horizontal links. Before sensing a set delay bit, a 
PE receives its horizontal elements from the west-bound input link. After it senses a 
set bit, the PE starts receiving the elements from the east-bound input link.
Designing a fixed-depth array with depth p *  2 is similar to designing a linear 
array. However, instead of shifting the hyperplanes, we first group the hyperplanes
into — hypersurfaces and then shift the hypersurfaces along the time axis such that 
P
all hypersurfaces can be mapped into one hypersurface with no time conflict. Figure 
5.7 shows a bilinear (p = 2) array that implements Warshall’s algorithm. Controlling 
this array is similar to controlling the linear array.
To design a P \ X p 2  fixed-size systolic system, we first group the n hyper­
planes into —  hypersurfaces, then divide each hyperplane into —  bands. Finally, 
P i P i
the bands and the hypersurfaces are shifted and the SDG is stretched such that all 
bands can be mapped into one band and all hypersurfaces can be mapped into one 
hypersurface without any time conflict.
As can be verified, the SDG for Warshall’s algorithm is heterogeneous in a 
sense that some computational vertices run different functions and some bands have 
different geometric forms. Two different kinds of bands exist; that is why it is more 
convenient to stretch the SDG such that all bands can be mapped into two bands with
199
a4 l t
a 3l
a 2i
aii h
a 4 2 t
a 32
a 22 
d | 2 h
0
a4 3 t
° 3 3
a 23
d | 3 h
0
0
d4 4 t
a 34  
d 24  
d | 4  h 
0 
0 
0
I— r~
I___ \
I____ I___ 1 I__
Figure 5.7 A bilinear array for transitive closure.
no time conflict. The diagram of the size-independent systolic system is presented in 
Figure 5.8a (pj = m and p 2 = m )- The system consists of two types of arrays: 
Type A and Type B. Type A array is similar to the full size array shown in Figure 
4.8b, while Type B array consists only of square-type PEs. Type B array is shown 
in Figure 5.8b. Operation and functional description of this fixed-size system can be 
found in [57].
TYPE B 
ARRAY
m Xm
TYPE A 
ARRAY
mXm
BUFFER
MEMORY
a)
Figure 5.8 A fixed-size system for transitive closure. 
(Cont.)
a) Block diagram.
b) Array of type B.
2 0 2
Figure 5.8 Continued.
0 4 4
° 4 3 ° 3 4
° 4 2 ° 3 3 ° 2 4
° 4 I CMro
o
° 2 3 ° I 4
° 3 I ° 2 2 ° I 3 0
° 2 I ° I 2 0 0
° l l
1
0 0 0
b4 l b 3 l b 2 l b l l  ~ ”
b4 2 b 3 2 b 2 2 b 1 2 °  — -
b43b33b23bi30  0  — *
b)
CHAPTER 6 
CONCLUSIONS AND FUTURE WORK
We don’t know one millionth o f  one percent about anything.
Thomas Edison
In this dissertation, we have described a comprehensive, constructive, graph 
methodology for regularizing data flow, determining the absolute lower bound on 
systolic execution time, and mapping algorithms into full-size, fixed size and fixed- 
depth VLSI arrays. The contributions of this work are summarized as follows:
1. We have defined an algorithm model which is a generalized form of 
simpler models proposed previously. It consists of a nested-loops program 
that includes if-then-else statements which partition the index set of the 
algorithm into subsets each of which may use different subsets of depen­
dence vectors. This guarantees, unlike other models, that unnecessary ord­
ering between index points that are not data-dependent will not occur. It 
also makes the model capable of modeling a wide class of compute-bound 
algorithms that may include relatively complex, nonuniform and multi­
phase algorithms.
2. We have used the concept of dependence vectors, as some other metho-
203
204
dologies, to order the index points of the algorithm in time and space. 
However, by distinguishing between rigid and nonrigid dependence vec­
tors, the index point ordering procedure becomes flexible and time 
optimal. The maximum I/O bandwidth and the index point fan-out could 
be chosen by the designer rather than being imposed by the ordering pro­
cedure. Time optimality is due to the freedom given to the ordering pro­
cedure to shuffle and rearrange the index points without affecting the final 
result. The ordered index points are represented by what we called a 
Sytolic Precedence Diagram (SPD). The SPD is an adaptation of the fami­
liar precedence graph but it takes into account the systolic operation 
requirements of strictly local communications and regular data flow. The 
SPD determines the lower bound on the total execution time of the algo­
rithm on a systolic array for a given I/O bandwidth and minimum fan-out.
3. Since the SPD should be regular, irregular algorithms with variable depen­
dence vectors must be transformed, first, into a regular indexed set of 
computations with local data dependencies. This is done by replacing a 
variable dependence vector with a set of fixed dependence vectors whose 
number is not a function of the problem size. We have shown ways of 
ordering index points that use variable dependence vectors. These tech­
niques prevent deadlocks that might occur in the ordering procedure and 
insure that node fan-out is independent of the number of nonconstant com­
ponents in the variable dependence vector.
205
4. A procedure to transform the SPD into a graph called Systolic Directed 
Graph (SDG) has been described. An SDG is an acyclic, labeled, directed 
graph that models the data flow as well as the timing for the execution of 
the given algorithm on a time-optimal array. Vertices are labeled by the 
firing time of the operation at the corresponding index point (iteration) and 
by an integer label that identifies the operation performed at that iteration.
5. Systolic arrays are designed by projecting the SDG along defined direc­
tions. The projection is done under the self-imposed requirements of 
minimizing input and output links and limiting each PE to one operation 
per time unit. With the above requirements, if more than one valid pro­
jection direction exist, different designs are obtained. In case a projection 
direction is not found under the given conditions, a relaxing technique is 
used to define a valid projection direction. However, this technique does 
not guarantee time-optimality. Every processing element of the designed 
systolic array is assigned a timing vector and a function vector. These 
vectors are used to control the operation of the architecture.
6 . We have identified algorithms whose optimal systolic implementation 
require switchable structures. Switchability in this context may imply PE 
switchability, link switchability or data flow switchability. These classes 
have been defined formally. ...
7. Algorithms have been characterized in terms of potential benefits from
206
increased PE fan-out. If such benefits exist, the methodology provides the 
corresponding systolic implementation.
8. Systolic arrays of any dimensionality have been characterized by two 
coordinates: depth and width. By using a new graph transformation 
termed graph stretching, we were able to extend our methodology to 
encompass designing fixed-size and fixed-depth multi-linear (FDML) 
arrays. A fixed-size array is an array of depth and width independent of 
the size of the problem, while only the depth in an FDML array is not a 
function of the algorithm size. An FDML array represents a new class of 
systolic arrays that has not been explored before.
Throughout the dissertation, we have demonstrated our methodology using 
different examples. We have designed many arrays, planar and nonplanar, for the 
vector-matrix multiplication, matrix-matrix multiplication, LU-decomposition and 
transitive closure problems. Many of these arrays have not be reported before.
As far as future work is concerned, the following research is suggested:
1. Development of systematic schemes for devising control mechanisms for 
switchable systolic architectures using externally supplied bits. This 
includes finding the lower and upper bounds on the number of such bits 
for a given algorithm.
207
2. Study the possibility of extending ojur methodology to encompass design­
ing VLSI arrays that utilize the concept of data set multiplexing. This 
concept was first reported by Dharmarajan and El-Amawy [94] in the 
context of a particular design.
3. Examine, mathematically, the clock skew problem in FDML and non­
planar arrays.
4. Study the trade off between the throughput and the depth of FDML 
arrays.
Another dissertation comes to an end but as Thomas Edison said a century ago, 
we still "don’t know one millionth o f one percent about anything"!
REFERENCES
[1] J. Backus, "Can programming be literated from the Von Neumann style? A 
functional style and its algebra of programs", ACM, vol. 21, Aug. 1978, pp. 
613-641.
[2] H.T. Kung, "Why systolic architectures?", Computer, vol. 15, Jan. 1982, pp. 
37-46.
[3] C.A. Mead and L.A. Conway, Introduction to VLSI Systems, Addison-Wesley, 
Mass., 1980.
[4] K. Hwang and F. Briggs, Computer Architecture and Parallel Processing, 
McGraw-Hill, N.Y., 1984.
[5] I.E. Sutherland and C.A. Mead, "Microelectronics and computer science", 
Scientific American, vol. 237, no. 3, Sept. 1977, pp. 210-228.
[6] M.J. Foster and H.T. Kung, "The design of special purpose VLSI chips", Com­
puter, vol. 13, no. 1, Jan. 1980, pp. 26-40.
[7] H.T. Kung, "Let’s design algorithms for VLSI systems", Proc. Conf. Very 
Large Scale Integration: Architecture, Design, Fabrication, Cal. Tech., Jan. 
1979, pp. 65-90.
[8] H.T. Kung, "Special-purpose devices for signal and image processing: an 
opportunity in VLSI", Proc. SPIE, vol. 241, July 1980, pp. 76-84.
208
209
[9] H.T. Kung and P.L. Lehman, "Systolic (VLSI) arrays for relational database 
operations", Proc. ACM-Sigmod 1980 Int’l Conf. Management o f Data, May 
1980, pp. 105-116.
[10] H.T. Kung and C.E. Leiserson, "Systolic arrays (for VLSI)", SIAM, 1979, pp. 
256-282.
[11] C.E. Leiserson, "Systolic priority queues", Proc. Conf. Very Large Scale 
Integration: Architecture, Design, Fabrication, Cal. Tech., Jan. 1979, pp. 199- 
214.
[12] S. Manohar, Systolic architectures: a critical survey, Tech. Report no. CS-86- 
05, March 1986.
[13] C.E. Leiserson and J.B. Saxe, "Optimizing synchronous systems", Proc. 22nd 
Annual FOCS Symp., 1981, pp 23-36.
[14] R.B. Johnson, "The complexity of a VLSI adder", Information Processing 
Letters, vol. 11, no. 2, October 1980, pp 92-93.
[15] G.M. Baudet, "Design and complexity of VLSI algorithms", Foundations o f 
Computer Science IV, Part 1: Algorithms and Complexity, J.W. de Bakker and 
J. Van Leeuwen (eds.), The Netherlands, 1983, pp. 49-74.
[16] R.P. Brent and H.T. Kung, "The chip complexity of binary arithmetic", Proc. 
12th Annu. STOC, April 1980, pp. 190-200.
[17] H. Abelson and P. Andreae, "Information transfer and area-time tradeoffs for 
VLSI multiplication", CACM, vol. 23, no. 1, Jan. 1980, pp. 20-23.
210
[18] C.S. Yeh, I.S. Reed and T.K. Truong, "Systolic multipliers for finite fields", 
IEEE Trans, on Computers, vol. C-33, no. 4, April 1984, pp. 351-356.
[19] R.P. Brent and H.T. Kung, "VLSI arrays for linear-time GCD computation", in 
VLSI 83, F. Anceau and E J. Aas (eds.), North Holland, 1983.
[20] J. Vuillemin, "A combinatorial limit to the computing power of VLSI circuits", 
IEEE Trans, on Computers, vol C-32, no. 3, March 1983, pp. 294-300.
[21] H.T. Kung and C.E. Leiserson, "Algorithm for VLSI processor arrays", in 
Introduction to VLSI Systems, C. Mead and L. Conway, Ca, 1980.
[22] J. Savage, "Are-time tradeoffs for matrix multiplication and related problems", 
Journal o f Computer and System Sciences, vol. 22, no. 2, April 1981.
[23] W.A. Porter and J.L. Aravena, "Cylindrical arrays for matrix multiplication", 
Proc. 14th Allerton Conf., Allerton, IL, Oct. 1986, pp. 595-692.
[24] J.L. Aravena and W.A. Porter, "Nonplanar switchable arrays", Circuits Systems, 
Signal Processing, vol. 7, no. 1, 1988.
[25] W.A. Porter, "Nonplanar array processing", Proc. 20th Asilomar Conf, Ca, 
Nov. 1986.
[26] J.L. Aravena, "Triple matrix product architectures for fast signal processing", 
20th Asilomar Conf, Nov. 1986, pp. 53-57.
[27] A. El-Amawy, W.A. Porter, and J.L. Aravena, "Array architecture for iterative 
matrix calculations", Proc. IEE, vol. 134, no. 3, May 1987, pp. 149-154.
211
[28] A. El-Amawy, "Improving the performance of nonplanar systolic architectures 
through increased fan-out", Electronics Letters, vol. 23, no. 5, July 1987, pp. 
765-766.
[29] S. Kak, "A two-layered mesh array for matrix multiplication", Parallel Com­
puting, vol. 6 , 1988, pp. 383-385.
[30] S. Kak, "Multilayered array computing", Information Sciences, vol. 45, 1988, 
pp. 347-365.
[31] A. El-Amawy, "Fast LU-decomposition of dense matrices on an optimal array", 
20th Asilomar Conf, Nov. 1986, pp. 518-522.
[32] H. Barada and A. El-Amawy, "Systolic architecture for matrix triangularisation 
with partial pivoting", IEE Proc., vol. 135, pt. E, no. 4, July 1988, pp. 208- 
213.
[33] A. El-Amawy and H. Barada, "A systolic architecture for matrix triangulariza- 
tion with partial pivoting", Proc 21th Asilomar Conf. Comp. Syst. and Sig., 
CA, 1988, pp. 757-761.
[34] D.E. Heller and I.C.F. Ipsen, "Systolic networks for orthogonal equivalence 
transformations and their applications", Proc. Conf. on Advanced Research in 
VLSI, 1982, pp. 113-122.
[35] A. El-Amawy, "A systolic architecture for optimal filter design support", Cir­
cuits Systems Signal Processing, vol. 7, no. 2, 1988, pp. 151-172.
212
[36] A. El-Amawy, "A systolic architecture for fast dense matrix inversion", IEEE 
Trans, on Computers, vol. 38, no. 3, March 1989, pp. 449-455.
[37] K.R. Dharmarajan, Systolic Algorithms For Stable Matrix Inversion, M.S. 
Thesis, LSU, Dec. 1988.
[38] P.R. Cappello and K. Steiglitz, "Unifying VLSI array designs with geometric 
transformations", Proc. Conf. Parallel Processing, 1983, pp. 448-457.
[39] H.T. Kung and S.W. Song, "A systolic 3-d convolution ship", in Multicomput­
ers and Image Processing: Algorithms and Programs, K. Perston and L. Uhr 
(eds.), Academic Press, 1982, pp. 373-384.
[40] H.T. Kung, L.M. Ruan and D.W.L. Yen, "A two-level pipelined systolic array 
for convolutions", Proc. CMU Conf. on VLSI Systems and Computation, Oct. 
1981, pp. 255-264.
[41] P.R. Cappello and K. Steiglitz, "Digital signal processing applications of sys­
tolic algorithms", Proc. CMU Conf. on VLSI Systems and Computation, Oct. 
1981, pp. 245-254.
[42] M.A. Haddad and P.A. Ramamoorthy, "A systolic architecture based on adders 
digital signal processor", Proc. 22nd Annu. Allerton Conf. on Comm. Control 
and Computing, 1984, pp. 940-945.
[43] G.R. Arce and PJ. Warter, "A median filter architecture for VLSI implementa­
tion", Proc. 22nd Annu. Allerton Conf. on Comm. Control and Computing,
1984.
213
[44] K. Oflazer, "Design and implementation of a single chip 1-d median filter", 
IEEE Trans, on ASSP, vol. 31, no. 5, Oct. 1983.
[45] C. Guerra, "Systolic algorithms for local operations on images", Proc. 22nd. 
Annu. Allerton Conf. on Comm. Control and Computing, 1984, pp. 965-974.
[46] J.L. Guibas and F.M. Liang, "Systolic stacks, queues and counters", Proc. 
Conf. on Advanced Research in VLSI, M.I.T., 1982.
[47] T. Ottmann, A.L. Rosenberg and L.J. Stockmayer, "A dictionary machine (for 
VLSI)", IEEE Trans, on Computers, vol. C-31, 1982, pp. 892-897.
[48] T. Leighton, "Tight bounds on the complexity of parallel sorting", IEEE Trans, 
on Computers, vol C-34, no. 4, April 1985, pp. 344-354.
[49] C.D. Thompson, "The VLSI complexity of sorting", Proc. CMU Conf. on VLSI 
Systems and Computation, Oct. 1981, pp. 108-118.
[50] S.G. Akl and H. Schmeck, "Systolic sorting in a sequential I/O environment", 
Proc. 22nd Annu. Allerton Conf. on Comm. Control and Computing, 1984.
[51] L.J. Guibas, H.T. Kung, and C.D. Thompson, "Direct VLSI implementation of 
combinatorial algorithms", Proc. Caltech Conf. VLSI, Jan 1979, pp. 509-525.
[52] F.C. Lin and Wu, "Systolic arrays for transitive closure algorithms", Proc. Int. 
Symp. VLSI Sys., Nov. 1986.
[53] Y. Robert and D. Trystram, "An orthogonal systolic array for the algebraic path 
problem", Res. report 553, TIM3/IMAG, France, 1985.
214
[54] G. Rote, "A systolic array algorithm for the algebraic path problems (shortest 
paths, matrix inversion)", Computing, vol. 34, pp. 192-219.
[55] S.Y. Kung, S. Lo, and P. Lewis, "Optimal systolic design for the transitive clo­
sure and the shortest path problem", IEEE Trans, on Computers, vol. C36, no. 
5, May 1987, pp. 603-614.
[56] S.Y. Kung and S. Lo, "A spiral systolic architecture/ algorithm for transitive 
closure problems", Proc. IEEE Int. Conf. on Computer Design, 1985.
[57] H. Barada and A. El-Amawy, "A fixed-size systolic system for the transitive 
closure and shortest path problems", Proc. Int. Conf. on Advances in Communi­
cation and Control Systems, Oct. 1988, pp. 623-633.
[58] J. Savage, "A systolic data structure chip for connectivity problems", Proc. 
CMU Conf. on VLSI Systems and Computation, Oct. 1981, pp. 296-300.
[59] M J. Foster and H.T. Kung, "Recognize regular languages with programmable 
building blocks", VLSI 81, J.P. Gray (ed.), Academic Press, Aug. 1981.
[60] J.A.B. Fortes, K.S. Fu and B.W. Wah, "Systematic approaches to the design of 
algorithmically specified systolic arrays", Proc. ICCD, 1985, pp. 300-303.
[61] C.E. Leiserson, F.M. Rose and J.B. Saxe, "Optimizing synchronous circuitry by 
retiming", Proc. 3rd Caltech Conf. on VLSI, ed. R. Bryant, Computer Science 
Press, 1983, pp. 87-116.
[62] V. Weiser and A. Davis, "A wavefront notational tool for VLSI array design", 
in VLSI Systems and Computations, ed. H.T. Kung et al., Computer Science
Press, Oct. 1981, pp. 226-234.
[63] L. Johnson and D. Cohen, "A mathematical approach to modeling the flow of 
data and control in computational networks", in VLSI Systems and Computa­
tions, ed. H.T. Kung et al., Computer Science Press, Oct. 1981, pp. 213-225.
[64] M. Lam and J. Mostow, "A transformational model of VLSI systolic design", 
Proc. IFIP 6th Int. Symp. Comput. Hardware Description Lang. Appl., Pitts­
burgh, PA, May 1983.
[65] D. Gannon, "Pipelining array computations for MIMD parallelism: a functional 
specification", Proc. 1982 Int. Conf. Parallel Processing, 1982, pp. 284-286.
[66] H.T. Kung and W.T. Lin, "An algebra for VLSI algorithm design", Proc. Conf. 
Elliptic Problem Solvers, Monterey, CA, 1983.
[67] J.M. Jover and T. Kailath, "Design framework for systolic type arrays", Proc. 
ICASSP, 1984, pp. 851-854.
[68] D.S. Schwarts and T.P. Barnwell III, "A graph theoretic technique for the gen­
eration of systolic implementations for shift-invariant flow graphs", Proc. o f 
ICASSP 84, 1984, pp. 8.3.1-8.3.4.
[69] I.V. Ramakrishnan, D.S. Fussell and A. Silberschartz, "Mapping homogeneous 
graphs on linear arrays", IEEE Trans, on Computers, vol. C-35, no. 3, March 
1986, pp. 189-209.
[70] S.Y. Kung, "On supercomputing with systolic/wavefront array processors", 
Proc. IEEE, vol. 72, no. 7, 1984, pp. 867-884.
216
[70] S. Y. Kung et al., "Mapping digital signal processing algorithms onto VLSI 
systolic/wavefront arrays", Proc. 20th Asilomar Conf. on Signal, System & 
Computers, pp. 6 , 1986.
[71] L. Guo-Jie and B. Wah, "The design of optimal systolic arrays", IEEE Trans. 
Computers, vol. 34, no. 1, Jan. 1985, pp. 66-77.
[72] H.D. Cheng, W.C. Lin and K.S. Fu, "Space-time domain expansion approach 
to VLSI and its application to hierarchical scene matching", Proc. 8th Int. 
Conf. Pattern Recognition, Montreal, Canada, Aug. 1984.
[73] R.H. Khun, "Optimization and interconnection complexity for parallel proces­
sors, single stage networks and decision trees", Ph.D. dissertation, Univ. of Illi­
nois, Urbana-Champaign, 1980.
[74] D.I. Moldovan, "On the analysis and synthesis of VLSI algorithms", IEEE 
Trans, on Computers, vol. C-31, no. 11, 1982, pp. 1121-1126.
[75] D.I. Moldovan, "On the design of algorithms for VLSI systolic arrays", Proc. 
o f the IEEE, vol. 71, no. 1, Jan. 1983, pp. 605-612.
[76] D.I. Moldovan, "ADVIS: A software package for the design of systolic arrays", 
84 Int’l Conf. on Computer Design: VLSI in Computers, Oct. 1984, pp. 158- 
164.
[77] J.A. Fortes, "Algorithm transformations for parallel processing and VLSI archi­
tecture design", PhD dissertation, University of Southern California, Los 
Angeles, CA, Dec. 1983.
217
[78] D.I. Moldovan and J.A. Fortes, "Partitioning and mapping algorithms into fixed 
systolic arrays", IEEE Trans, on Computers, C-35, no. 1, Jan 1986, pp. 1-12.
[79] P. Quinton, "Automatic synthesis of systolic arrays from uniform recurrent 
equations", Proc. 11th Ann. Sym. Comp. Architecture, 1984.
[80] R. Dorairaj, "Design of algorithm transformations for VLSI array processing", 
M.S. Thesis, Texas Tech. Univ., 1987.
[81] J. Grinberg, G.R. Nudd and R.D. Etchells, "A cellular VLSI architecture", 
Computer, vol. 17, Jan. 1984, pp. 69-81.
[82] A.L. Fisher and H.T. Kung, "Synchronizing large systolic arrays", Proc. o f the 
10th Annual Int. Symp. on Computer Architecture, 1983, pp. 54-58.
[83] S.Y. Kung, K.S. Asrun, R.J. Gal-Ezar, and D.V. Bhaskar Rao, "Wavefront 
array processors language, architecture and applications", IEEE Trans, on Com­
puters, vol. C31, no. 11, 1983, pp. 1054-1066.
[84] D. Heller, "Partitioning big matrices for small systolic arrays", in VLSI and 
Modern Signal Processing, pp. 185-199, ed. T. Kailath, Prentice Hall, NJ,
1985.
[85] H.W. Nelis, "Automatic design and pardoning of systolic/wavefront arrays for 
VLSI", Circuit, Systems and Signal Processing, vol. 7, no. 2, 1988.
[86] A.V. Aho, J.E. Hopcroft, and J.D. Ullman, The design and analysis o f com­
puter algorithms, Addison-Wesley, MA, 1974.
218
[87] M.N. Swamy and K. Thulasiraman, Graphs, Networks and Algorithms, Wiley, 
New York, 1981.
[88] Y. Muraoka, "Parallelism exposure and exploitation in programs", Ph.D. disser­
tation, Univ. of Illinois, Urbana-Champaign, Feb. 1971.
[89] D.J. Kuck, Y. Moraoka, and S.C. Chen, "On the number of operations simul­
taneously executable in Fortran-like programs and their resulting speedup", 
IEEE Trans, on Computers, vol. C-21, Dec. 1972, pp. 1293-1310.
[90] R. Towle, "Control and data dependence for program transformations", Ph.D. 
dissertation, Univ. of Illinois, Urbana-Champaign, Nov. 1976.
[91] U. Barnergee, "Data dependence in ordinary programs", M.S. thesis, Univ. of 
Illinois, Urbana-Champaign, Nov. 1976.
[92] U. Barnergeeer al., "Time and parallel processor bounds for Fortran-like loops", 
IEEE Trans, on Computers, vol. C-28, Sep. 1979, pp. 660-670.
[93] L. Lamport, "The parallel execution of DO loops", Communication ACM, Feb. 
1974, pp. 83-93.
[94] K.R. Dharmarajan and A. El-Amawy, "A parallel VLSI algorithm for stable 
inversion of dense matrices", accepted Proc. I EE, pt. E.
VITA
Hassan Reda Barada was born in Beirut, Lebanon on October 25, 1962. He 
earned his Bachelor of Science and his Master of Science degrees in Electrical 
Engineering from Louisiana State University, in December 1984 and August 1986, 
respectively. He was a teaching assistant (Jan.’85 to Aug.’87) and later a research 
assistant (Aug.’87 to Aug.’89) in the department of Electrical and Computer 
Engineering at LSU.
He is a member of Phi Kappa Phi and Eta Kappa Nu, and a student member of 
IEEE and the IEEE Computer Society.
219
DOCTORAL EXAMINATION AND DISSERTATION REPORT
Candidate: Hassan Reda Barada
Major Field: E le c t r i c a l  E n g in eerin g
Title of Dissertation: A Com prehensive M ethodology fo r  A lgorith m  C h a r a c te r iz a t io n ,
R e g u la r iz a t io n  and Mapping In to  Optim al VLSI A rrays
Approved:
Major Professor and Chairman
Dean of the Graduate S
EXAMINING COMMITTEE:
Qft
Date of Examination:
J u ly  17, 1989
