Systolic Array Implementations With Reduced Compute Time. by Barbir, Abdulkader Omar
Louisiana State University
LSU Digital Commons
LSU Historical Dissertations and Theses Graduate School
1989
Systolic Array Implementations With Reduced
Compute Time.
Abdulkader Omar Barbir
Louisiana State University and Agricultural & Mechanical College
Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses
This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in
LSU Historical Dissertations and Theses by an authorized administrator of LSU Digital Commons. For more information, please contact
gradetd@lsu.edu.
Recommended Citation
Barbir, Abdulkader Omar, "Systolic Array Implementations With Reduced Compute Time." (1989). LSU Historical Dissertations and
Theses. 4830.
https://digitalcommons.lsu.edu/gradschool_disstheses/4830
INFORMATION TO USERS
The most advanced technology has been used to photograph and 
reproduce this manuscript from the microfilm master. UMI films the 
text directly from the original or copy submitted. Thus, some thesis and 
dissertation copies are in typewriter face, while others may be from any 
type of computer printer.
The quality of this reproduction is dependent upon the quality of the 
copy submitted. Broken or indistinct print, colored or poor quality 
illustrations and photographs, print bleedthrough, substandard margins, 
and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete 
manuscript and there are missing pages, these will be noted. Also, if 
unauthorized copyright material had to be removed, a note will indicate 
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand corner and 
continuing from left to right in equal sections with small overlaps. Each 
original is also photographed in one exposure and is included in 
reduced form at the back of the book.
Photographs included in the original manuscript have been reproduced 
xerographically in this copy. Higher quality 6" x 9" black and white 
photographic prints are available for any photographs or illustrations 
appearing in this copy for an additional charge. Contact UMI directly 
to order.
University Microfilms International 
A Bell & Howell Information C o m p an y  
3 0 0  North Z e eb  R oad . Ann Arbor, Ml 4 8 1 0 6 -1 3 4 6  USA  
3 1 3 /7 6 1 -4 7 0 0  8 0 0 /5 2 1 -0 6 0 0
O rd er N u m b er 9025290
Systolic array im plem entations w ith  reduced com pute tim e
Barbir, Abdulkader Omar, Ph.D.
The Louisiana State University and Agricultural and Mechanical Col., 1989
U MI
300 N. Zeeb R&
Ann Aibor, MI 48106
SYSTOLIC ARRAY IMPLEMENTATIONS 
WITH REDUCED COMPUTE TIME
A Dissertation
Submitted to the Graduate Faculty of the 
Louisiana State University and 
Agriculture and Mechanical College 
in partial fulfillment of the 
requirements for the degree of 
Doctor of Philosophy
in
The Department of Electrical and Computer Engineering
by
Abdulkader Omar Barbir 
B.S.E.E. Louisiana State University 1982 
M.S.E.E. Louisiana State University 1983 
December 1989
DOCTORAL EXAMINATION AND DISSERTATION REPORT
Candidate: Abdulkader 0 . Barbir
Major Field: E le c t r ic a l  E ngineering
Title of Dissertation: S y s to l ic  Array Im plem entations With Reduced Compute Time
Approved:
Major Professor and Chairman
Dean of the Graduate^-Sc]
EXAMINING COMMITTEE:
\J)vAa>1vA-A
k —
Date of Examination:
ACKNOWLEDGMENTS
I would like to express my sincere appreciation to Dr. J. L. Aravena, my advisor, for 
suggesting the topic of this research. He has worked closely with me. His support and 
encouragement have proven to be an inexhaustible source of ideas. I would like to thank 
him for providing a firm mooring while still allowing me the freedom to explore. His vast 
knowledge and incisive questions have been vital to this piece of work.
I am thankful to Dr. A. El-Amawy for being a good friend and for his generous support 
without which I would have been unable to come to Louisiana State University. A special 
note of appreciation is extended to Dr. S. Kak for his invaluable help and advice. I would 
also like to thank the rest of my Committee members, Dr. B. Jones, Dr. J. Trahan and Dr. 
W. Waggenspack for taking the time to read this dissertation. A special note of thanks is 
extended to Dr. W. Porter for his constructive ideas and to Dr. A. Marshak for his 
encouragement
Finally, I wish to express my sincere appreciation to my parents and my family for their 
love, support, encouragement and above all patience.
TABLE OF CONTENTS
L I S T  O F  T A B L E S  .................................................................  vi
L I S T  O F  F I G U R E S  ...............................................................  vii
A B S T R A C T  ..................................................................................... x
C H A P T E R  1. INTRODUCTION ................................................ 1
1.1. The Systolic Arrays ............................................................  1
1.2. Limitations of Systolic Arrays ...........................................  4
1.3. An Overview of Our Approach .......................................... 7
1.4. Outline of the Dissertation .................................................  8
C H A P T E R  2. SYSTOLIC ALGORITHMS ....................................  10
2.1. Algorithm Description ........................................................  11
2.2. Data dependency and Pipelining .........................................  13
2.3. Examples of Systolic Algorithms ........................................  16
2.3.1. Matrix-Vector Multiplication ...............................  17
2.3.2. Matrix-Matrix Multiplication ..................................  18
2.3.3. LU Decomposition .................................................  19
2.3.4. QR Decomposition Algorithm .................................  21
2.4. Handling of Input Data .......................................................  23
C H A P T E R  3. MATHEMATICAL REPRESENTATION OF ARRAY 
OPERATIONS AS A FUNCTION OF THE COMPUTE CYCLE ..............  26
3.1. Array Hardware Model ......................................................  27
3.1.1. Modeling Array Connectivity .................................  27
3.1.2. Processing Element Model ...................................... 29
iii
3.2. Description of Array Operations .........................................  32
3.2.1. Example to Illustrate Algorithm Generated Tuples   3 4
3.3. Array Data Paths ............................................................... 35
3.4. Modeling Array Operations as a Function of the Compute Cycle .... 36
3.5. Timing Functions ................................................................  38
3.6. Modeling of Data Flow in the Array ...................................  40
3.7. Precedence Timing Diagram .............................................. 43
3.7.1. A Procedure for Constructing the PTD ...................  43
3.8. Timing Graph Representation of Array Operations ................  45
3.9. Examples to Illustrate the Introduced Procedures ...................  49
3.9.1. Matrix-Vector Array ............................................ 49
3.9.2. Matrix Multiplication on Planar Mesh Array .............  57
3.9.3. Matrix Multiplication on Cylindrical Array ..............  65
3.9.4. LU Decomposition Array .................................... 71
C H A P T E R  4. FAST ALGORITHM REALIZATIONS ...................  83
4.1. Timing Graph Compression  ...... 85
4.2. Tools for Modifying the Timing Graph ...............................  87
4.3. Fast Realizations of Matrix Arrays ......................................  91
4.4. Examples to Illustrate Procedures for Determining Fast Realizations 97
4.4.1. Matrix-Vector Array .............................................  97
4.4.2. Matrix-Matrix Multiplication on the Cylindrical array ...  98
4.5. Timing Graph Cut and Paste .....       99
4.5.1. Examples to Illustrate Timing Graph Cut and Paste ..... 105
4.6. Fast Realizations of Non-Linear Arrays ......................... 106
4.6.1. Timing Graph Decoupling ...................................  106
iv
4.7. Examples to Illustrate Non-Linear Array Decomposition ........... 113
4.7.1. LU Decomposition Array ......................................  113
4.7.2. QR Triangularization Array ....................................  118
4.8. Applications: Design of Smart Structures ...........................  122
4.8.1. Band Matrix Multiplications on the Mesh Array ........... 122
4.8.2. Multiplication of Triangular Matrices on the Mesh Array .. 124
4.8.3. Random Sparsity in Matrices .................................  129
C H A P T E R  5. FAST LINEAR ALGORITHM IMPLEMENTATIONS .... 132
5.1. The Class of Algorithms ........................................................ 133
5.2. Fast Realizations of Bilinear Algorithms .................................  136
5.2.1. Example to Illustrate Fast Bilinear Algorithms ........... 137
5.3. Parallel Implementation of Linear Algorithms ...................... 140
5.4. Time Constrained Implementations ........................................  142
5.4.1. The Basic Cases ....................................................  143
5.4.1.1. Mesh Array Partial Transformations ............  144
5.4.1.2. Cylindrical Array Transformations ....    144
5.4.2. Triple Matrix Product ............................................ 145
5.4.3. Time Constrained Forms of the Triple Matrix Product .... 149
5.4.4. Fast Realizations of the Triple Matrix Product Algorithm .. 153
C H A P T E R  6. CONCLUSIONS AND FUTURE WORK .....................  161
R E F E R E N C E S  ....................................................................................  165
V I T A  ..............................................................................................................  178
V
50
51
52
58
59
61
67
68
69
74
74
80
LIST OF TABLES
Input Nodes for Matrix-Vector Array 
Output Nodes for Matrix-Vector Array 
Connectivity Matrix for Matrix-Vector Array 
Mesh Array Input Nodes 
Mesh Array Output Nodes 
Connectivity Matrix for Mesh Array 
Cylindrical Array Input Nodes 
Cylindrical Array Output Nodes 
Cylindrical Array Connectivity Matrix 
LU Array Input Nodes 
LU Array Output Nodes 
Connectivity Matrix for LU array
vi
LIST OF FIGURES
Figure 3.1. PE Model. 30
Figure 3.2. Matrix-Vector Array, n =3. 49
Figure 3.3. AG Matrix-Vector Array. 50
Figure 3.4. PTD Matrix-Vector Array. 55
Figure 3.5. TG Matrix-Vector Array. 56
Figure 3.6. Mesh Array, n = 2 . 58
Figure 3.7. AG for Mesh Array. 59
Figure 3.8. PTD for Mesh Array. 64
Figure 3.9. TG for Mesh Airay. 66
Figure 3.10. Cylindrical Array, n=2.  66
Figure 3.11. AG of Cylindrical Array. 68
Figure 3.12. PTD Cylindrical Array. 72
Figure 3.13. TG Cylindrical Array. 73
Figure 3.14. a) LU Array. 75
b) PE Models. 75
Figure 3.15. AG for LU Array. 76
Figure 3.16. PTD for LU Array. 81
Figure 3.17. TG for LU Array. 82
Figure 4.1. LU Array, case n = 4. 115
Figure 4.2. Partial TG for LU Array, case/* =4. 116
Figure 4.3. Fast LU Array. 117
Figure 4.4. a) QR Array. 120
b) PE mode Definitions. 120
vii
Figure 4.5. Partial TG for QR Array. 121
Figure 4.6. Fast QR Array. 121
Figure 4.7. 3x3 Mesh Array. 126
Figure 4.8. Partial TG for the 3x3 Mesh Array. 126
Figure 4.9. Partial TG for Band Matrix Multiplication. 127
Figure 4.10, Band Matrix Mesh Array. 127
Figure 4.11. Partial TG for Triangular Matrix Multiplication. 128
Figure 5.1. Mesh Array Y = AX Transformations. 146
a) k = 0. 146
b) k = 1. 146
c) k = 2. 146
d) k = 3. 146
Figure 5.2. Cylindrical Array Y = AX Transformations. 147
a) k=  0. 147
b) k = 1. 147
c )k=  2 . 147
Figure 5.3. Cylindrical Array Y = XA Transformations. 147
a) k = 0 . 147
b) k = 1. 147
c) k = 2 . 147
Figure 5.4. Cylindrical Array for TMP. 151
Figure 5.5. Cylindrical Array Transformations for Z=XR. 152
a) k = 0 . 152
b) k = 1. 152
c) k = 2 . 152
Figure 5.6. Cylindrical Array Transformations for Y=LZ. 152
a) k = 3 . 152
b) k = 4 . 152
c) k = 5 . 152
viii
Figure 5.7. Orbital Array for TMP. 154
Figure 5.8. Orbital Array Transformations for Z=XR. 155
a) k = 0 . 155
b) k = 1 . 155
Figure 5.9. Orbital Array Transformations for Y=LZ. 155
a) k -  2 . 155
b) k = 3 . 155
Figure 5.10. Orbital Array Final Transformation with
Alternating PE Modes of Operations. 155
Figure 5.11. Partial TG for TMP Cylindrical Array. 158
Figure 5.12. Fast Forms for TMP Cylindrical Array. 159
a) Constraining matrix L. 159
b) Constraining L and R matrices. 159
ix
ABSTRACT
The goal of the research is the establishment of a formal methodology to develop 
computational structures more suitable for the changing nature of real-time signal 
processing and control applications. A major effort is devoted to the following question: 
Given a systolic array designed to execute a particular algorithm, what other algorithms can 
be executed on the same array? One approach for answering this question is based on a 
general model of array operations using graph-theoretic techniques. As a result, a 
systematic procedure is introduced that models array operations as a function of the 
compute cycle.
As a consequence of the analysis, the dissertation develops the concept of fa st 
algorithm realizations. This concept characterizes specific realizations that can be evaluated 
in a reduced number of cycles. It restricts the operations to remain in the same class but 
with reduced execution time. The concept takes advantage of the data dependencies of the 
algorithm at hand. This feature allows the modification of existing structures by reordering 
the input data. Applications of the principle allows optimum time band and triangular 
matrix product on arrays designed for dense matrices.
A second approach for analyzing the families of algorithms implementable in an array, 
is based on the concept of array time constrained operation. The principle uses the number 
of compute cycle as an additional degree of freedom to expand the class of transformations
generated by a single array. A mathematical approach, based on concepts from multilinear 
algebra, is introduced to model the recursive transformations implemented in linear arrays 
at each compute cycle. The proposed representation is general enough to encompass a 
large class of signal processing and control applications. A complete analytical model of 
the linear maps implementable by the array at each compute cycle is developed.
The proposed methodology results in arrays that are more adaptable to the changing 
nature of operations. Lessons learned from analyzing existing arrays are used to design 
smart arrays for special algorithm realizations. Applications of the methodology include the 
design of flexible time structures and the ability to decompose a full size array into sub- 
arrays implementing smaller size problems.
CHAPTER 1
INTRODUCTION
1.1. The Systolic Arrays
The increasing demand for highly efficient computing structures for real time 
applications has prompted researchers to consider alternatives to the relatively slow general 
purpose computer. Recent advances in VLSI technology have made it feasible to build 
low cost specialized architectures for compute intensive applications. The concept of 
systolic arrays introduced by H. T. Rung and colleagues [l]-[4] has stimulated a great deal 
of interest in the scientific community as an attractive option for obtaining efficient and fast 
structures for a class of compute bound algorithms.
Systolic arrays have been proposed for a wide range of applications [5]-[39]. Examples 
include arrays designed for arithmetic functions such as computing the greatest common 
divisor, arrays designed for matrix computation, such as dense /band matrix-matrix and 
matrix-vector multiplications. One area, in particular that has benefited from the speed 
improvement offered by systolic arrays is digital signal processing and control applications. 
The recurrences describing these applications have repeated structures that make them good 
candidates for VLSI implementation.
1
2Typically, an array consists of a set of locally connected processing elements. Local 
interconnection among the elements leads to cheaper implementation and higher densities 
on wafers. Processing elements in the array are assumed to perform simple operations 
such as add /  multiply and may be of the same type or a mixture of few different types. 
All processing elements in a systolic array operate in a lock-step fashion synchronized by a 
global clock. Clock skew in large two-dimensional arrays might cause a slow down in the 
system speed [40]. However, it has been shown that 1-D linear arrays are not very 
sensitive to the problem and can be synchronized by the same global clock for relatively 
large systems [40].
In a systolic array, information flows between processors in a pipelined fashion. 
Communication with the host computer is performed by boundary elements. Pipelining the 
information flow in a systolic array solves the Von Neumann bottleneck problem [41], 
where the operation rate is limited by the bandwidth of the processor-memory 
communication link. By pipelining the information flow in the array, once a data element is 
retrieved from the host memory, it will pass through the entire array, contributing to all 
the partial results that are dependent on it, before it is returned back to memory. This 
feature of multiple usage of the same data entry gives rise to high computational 
throughput. In a typical application, systolic array would be attached as a peripheral 
device to the host computer that supply the array with input data and receives the processed 
data from i t
Systolic arrays can have different configurations depending upon the problem being 
solved. Systolic structures can be one-dimensional ( linear ) or two-dimensional 
( planar ). Planar arrays can assume different shapes such as mesh, hexagonal, binary
3trees and triangular. Recently Porter and Aravena proposed non-planar structures [11] 
that offer advantages over their 2-D planar counterparts, for the same type of 
applications. The improvement is in the form of uniform input data rate to the array, better 
processing elements utilization and increased throughput.
Due to the popularity of the array concept in the scientific community, several 
methodologies have been proposed for the automatic mapping of algorithms into VLSI 
array structures. These methodologies can be divided into three basic categories. The first 
approach uses the concept of data streams to identify the operations that can be performed 
in parallel at the algorithm level. Proper operators are then used to obtain the systolic 
structure. Methods that fall under this category include the work reported by Cohen et al 
[42], Weiser and Davis [43],
The second approach modifies existing designs to obtain new architectures. The work 
of Leisersons et al [44] is a good representative of this class. The final approach is 
based on transformations performed at the algorithm model level. In addition, systematic 
procedures are introduced that describe the steps necessary for performing the 
transformations. The modified algorithm model is then expressed in a format directly 
mappable into VLSI arrays [45]-[68]. In this area the work of Barada [59] deserves 
special attention and is described in chapter 2 , section 2.1, when the notion of a systolic 
algorithm is introduced.
41.2. Limitations of Systolic Arrays
The available methodologies for array design consider the generic representation of an 
algorithm. This approach leads to structures that are incapable of adapting to the changing 
nature of the application. A natural conclusion is that no single array will be suitable over 
the entire range of the application. Sooner or later a problem will arise that requires 
modification in the course of the solution.
As a solution to the above problem, several techniques have been proposed in the 
literature that search for the possibility of increasing the flexibility of the amay structures. 
The work of Snyder [69] suggested the use of the CHIP architecture which is a 
reconfigurable layout of processors. However, no systematic procedure was discussed 
for automatically switching the array from one mode of operation to another. Furthermore, 
the class of implementable algorithms is limited by the available interconnection network.
Annaratone et al [70] proposed the programmable systolic chip as an alternative to the 
problem of designing several arrays for different problems. This approach is confined to 
problems that can be performed on 1-D arrays, and the class of implementable algorithms 
is restricted by the available interconnection network. In addition, programming the 
various cells slows down the speed of computations.
Chuangand He [71] proposed a versatile systolic array for matrix computations. Their 
work was a generalization of the work proposed by Nash et al [72]. The architecture is 
based on an array system derived by implementing the Faddeeva algorithm. It is composed 
of a triangular array cascaded with a mesh planar array, which was referred to as the
5trapezoidal array. Problems that can be implemented on the array include: LU
decomposition, inner product, matrix-multiplication and convolution. However, the 
array suffers from unnecessary delays in extracting the output data elements. The array is 
sub-optimal and may require several passes before computing the solution.
Several authors [73]-[76] have suggested methods for the automatic reconfiguration of 
the array, however, their emphasis was focused on the design for testability and fault 
tolerance. Other methods for increasing the versatility of arrays structures includes the 
design of size independent and fixed depth arrays. The literature js full of techniques that 
are proposed for partitioning large size algorithms into problems of smaller dimension, 
suitable for the available size of the array. The interested reader is referred to the references 
in [77]-[83] for a comprehensive overview of the topic.
Current thrusts in the design of systolic arrays consider the generic representation of an 
algorithm [84-85]. The resultant structures execute the indicated operation in a fixed 
number of compute cycles. No allowance is made for special entries in the data elements 
that might be omitted and could lead to more efficient implementation [85]. As an example, 
an array designed to perform dense matrix multiplication will take the same time to compute 
the product of triangular matrices even though half of the computations are multiplications 
by zero. In addition, none of the techniques is suitable for applications that may vary in 
time, such as time varying filters.
Real time applications, such as on line signal processing and control, are significant 
examples of compute intensive application. They have received ample attention in the array 
community. Specialized architectures for 1-D and 2-D convolutions [17], fast Fourier
6transform ( FFT), and other transforms [31-34], Kalman filters [35], digital filters and 
controllers [36], have been published in various professional journals. However, the 
severe timing constraints of these high performance applications have received very little 
attention. For these applications, the compute time may be significant and should be 
deducted from the time available to reach a decision. Failure to do so may render the 
system useless. Critical real time applications may present an interesting example of the 
paralysis by analysis syndrome [86]. An excessive use of time in computing a response 
may preclude the possibility of implementing the response.
The efficient use of systolic arrays in real time applications, requires the capability of 
adjusting the arrays to different realizations. Alternatively, it will be desirable to have 
arrays capable of recognizing, and using to advantage special realizations of a particular 
algorithm, that can be executed in a reduced number of cycles. Examples include 
structures that are capable of omitting multiplications by zero or one.
Signal processing and control problems have characteristics that are not shared by 
standard numerical algorithms. In particular, different algorithms may be acceptable to the 
designer. The usual tradeoffs during the design stage are a clear example of the principle. 
For example, a linear filtering problem may be constrained to banded matrices and still 
yield acceptable results ( FIR filters). A specialized architecture for this implementation 
displays clear advantages over a general matrix multiplication array [43]. This additional 
degree of freedom can be used to advantage to improve on the arrays computational speed 
for specific realizations of an algorithm.
71.3. An Overview of Our Approach
The foundation of our approach can be traced to the work done by Aravena [87-88] that 
introduced the concept of array constrained design with applications to digital filters. The 
work is motivated by the observation that the compute time and the computing device 
determine the class of algorithms that can be used in a given application. The research 
views the array operation as a transformator parameterized by the compute cycle number. 
The array generates a collection of transformations on the processed data elements as time 
progresses. The class of algorithms implementable on a single array is determined during 
the course of evolution of the partial results in the array.
The first step for analyzing array operations as a function of the compute cycle is to 
establish a proper model of the phenomena of data flow in an array. In this dissertation, an 
array is viewed as a directed weighted computing graph, with nodes corresponding to the 
array processing cells, and the arcs as the medium of transporting the processed data. 
Based on this model, a general graph theoretic approach is proposed that models the flow 
of data and the interconnection network in the array. The procedure is flexible enough to 
encompass a large class of planar, non-planar, as well as switchable arrays.
To express the evolution of partial results in an array as a function of the compute cycle, 
the precedence timing diagram ( PTD ) is used. The PTD is a modification of the 
familiar precedence diagram. It is a valuable tool that displays the array operations at each 
compute cycle. The PTD can be mapped into what is called a timing graph ( TG ). 
The TG provides a visual aid to the user of the evolution of computations in the array as a 
function of the compute cycle.
8A mathematical approach is then introduced that models the recursive transformations 
implemented in linear arrays at each compute cycle. In addition, a constructive graph 
theoretic approach is discussed that analyzes the array TG, that leads to the determination 
of the collection of algorithms implementable on a single array. Novel features of the 
methodology include the ability of taking advantage of the nature of data dependencies of 
the algorithm at hand. Based on this observation, data reordering is proposed to speed up 
execution of algorithms characterized by random sparsity in their matrices on a given array.
1 .4 . Outline of the Dissertation
The dissertation is organized as follows: chapter 2 describes the basic characteristics in 
the mathematical description of a systolic algorithm. It provides techniques for recognizing 
such an algorithm, and its properties. Chapter 3 develops a formal definition and model 
of a systolic array based on a graph theoretic approach. The array is viewed as a directed 
weighted computing graph, with nodes corresponding to the array processing elements, 
and the arcs as the medium of transporting the processed data. In addition, the chapter 
introduces to the reader the systolic precedence timing diagram ( PTD ), which is a 
valuable tool that expresses the state of an array at a given time. A procedure is developed 
that maps the PTD into a timing graph (TG). The timing graph representation of an array 
operation offers a visual aid to the user of the evolution of partial results in the array as a 
function of time. Proper analysis of the TG allows the determination of all possible fast 
realizations executable on the array in a reduced number of cycles.
9Chapter 4 introduces a new general graph theoretic approach that can be used to 
analyze the timing graph representation of the array. The significance of the methodology 
to the subject of modification of array operations can be summarized as follows:
1- It can be used to determine the collection of all possible fast foims implementable 
on a single array.
2- It can be used to determine the best results of array constrained design.
3- It uses to advantage the nature of algorithm data dependence vectors to allow 
reordering of array operations to achieve fast realization.
4- It allows the design of what are called smart structures ( flexible time 
varying arrays ), as a means of increasing the class of implementable 
algorithms on a single array.
Chapter 5 develops a mathematical model of the evolution of partial results in arrays 
implementing a class of linear algorithms that can be expressed in the form of matrix- 
matrix or matrix-vector operations. The concept of array constrained design is formalized. 
Necessary and sufficient conditions for the existence of fast realizations for bilinear 
algorithms are established. The methodology is also extended to arrays implementing the 
triple matrix product algorithm.
Chapter 6 concludes with a brief summary of the results of the research and some 
suggestions for future work in related topics.
CHAPTER 2
SYSTOLIC ALGORITHMS
Conventional high level language ( HLL) description of an algorithm is, in general, 
not useful for the direct design of systolic architectures. This description must be modified 
to make efficient use of concurrences and to satisfy technological constraints of the 
implementation. An algorithm that has been transformed to a form directly mappable into 
a systolic array will be called a systolic algorithm.
This chapter describes the basic characteristics in the mathematical description of a 
systolic algorithm. The objective is to provide techniques to recognize a systolic algorithm 
and use its basic properties. The chapter is divided into four sections. Section 2.1 
outlines a general high level description of algorithms suitable for VLSI implementation. 
Section 2.2 discusses data dependencies and pipelining. Section 2.3 gives several 
examples of systolic description of some of the most common algorithms. Section 2.4 
discusses handling of input data to an array.
10
11
2.1. Algorithm Description
An algorithm can be viewed as a structured set of computations which operate on a set of 
input variables ( and /  or a set of variables generated as a result of other operations) to 
produce a set of output variables. A large class of algorithms that are suitable for systolic 
array implementation can be represented in a HLL as a set of nested loops characterized by 
a set of assignment statements and possibly some conditional branch statements. The 
following HLL description of an algorithm has been considered in [59]:
for ij = lj to Ui do 
for i2 = l2 to u2 do
* * *
for »rn = L  to um do 
if (condition ( j )  )
Sj1
•  4k
Sjk!
(1)
elseif( condition ( j  ) )
•V
S2kz
elseif ( condition ( j ) )
Sq1
• 4  •
S fo
w here,
12
• m is the depth ( number of nested loops) of (i ).
• j  = ( i; , . . . ,  im ) is an integer m-tuple. The range of all loops specifies its 
domain, Jm.
• /,• and Ui t i =1 , , m are integer valued linear expressions of outer
indices, representing the initial and final values of the indices of each loop. Sjk, 
k = 1 , ,  kj , j  = 1, . . . ,  q, are assignment statement of the form
ak iyj  ( j ) ) =  Fjk [ f l j ( y / ( j ) )  ....» (ym(J ))>••■*
<*k(yn ( j ) )  ......  («fc(3>(j))  »••• »
M M J ) )  b j ( c h ( j ) ) , . . . ,
^ ( c r ( j ) )  bx (cu Q ) ) ] ,
where,
• Jk ( j )  » Ce(j) are integer valued linear expressions of j  e Jm , 
aq(ys ( i ) )  and br (ce ( j  ) )  arc variables described by the indicated 
integer expressions, and
• Fjt is an algebraic ( memoryless) function that involves operations such 
as multiplication, division, and addition.
13
On both sides of the assignment statements there are some input and some output 
variables specified by the algorithm requirements. A variable is defined as input if it 
appears on the right hand side of the assignment statement, and is defined as output if it 
appears on the left hand side of the assignment statement. A variable can be both an input 
variable and an output variable, meaning that it may appear on both sides of the assignment 
statements. To differentiate between the states of the variables, a variable is defined as 
used if it appears on the right hand side of the assignment statement, and is generated if 
it appears on the left hand side of the assignment statement
As an example, variable aq ( ys ( j  ) )  in assignment statement S f  appears on both 
sides of the assignment statement described by two different linear expressions of j  . 
Thus, in the right hand side of the assignment statement it is an input variable used, and 
on the left hand side of the statement it is an output variable generated. On the other hand, 
variable br described by ( ce ( j  ) )  appears on the right hand side of the assignment 
statement only. In this case, this variable is only an input variable ( used ).
A sequential ordered execution of the loops defines a lexicographical ordering of the 
index space, Jm in loop (7 ). This is an induced ordering imposed by the way the loops 
are arranged. This ordering is not unique and may be changed by rearranging the order of 
the loops.
2.2. Data dependency and Pipelining
Between variables generated at different index points in loop (7 ), there are some 
dependencies that describe the algorithm's internal structure and dictate its communication
14
requirements. These dependencies are described in what is known as data dependence 
vectors.
A formal definition of data dependence vectors can be found in [59]. For the sake of 
clarity, a brief description is provided here. For this, let X  and Y be two variables 
whose indices are described by two integer functions f j  and f 2 respectively. 
Furthermore, let the variable X  ( /7 ( j  ) )  be generated in statement S; and variable 
Y {f2 ( j  ) )  be generated in statement Sjk . Let j : , j 2 e Jm , be two index tuples 
that belong to the algorithm domain. In this case, variable Y ( f2 ( j 2) )  is said to 
be dependent on variable X  (/} ( jj ) )  if and only if the following conditions hold:
1- j j  is less than j 2 in the lexicographical sense, and
2- f j  CJj W 2 (j2  )•
The data dependence vector is then defined as the difference between the two tuples so that
d =  h  ■ i i  •
There are several types of data dependence vectors. A data dependence vector is defined 
as fixed if all its components are constant, otherwise, it is defined as variable. A data 
dependence vector is defined as flexible if it allows a computation performed at index 
point j ; to precede a computation performed at index point jj , for all jj , jj e Jm , 
where jj > jj in the lexicographical order of the index space. Otherwise, the vector 
will be defined as inflexible.
15
In some instances, the algorithm data dependence vectors need to be modified to avoid 
broadcasting of data variables. For an algorithm with dimension m ( depth of the loops 
in loop (i ) ), broadcasting typically exists if the dimension of the tuple describing a used 
or generated variable is less than m. Variable pipelining is a technique that is used to 
eliminate all possible data broadcasts which may exist in the original algorithm [59]. This 
step is very essential for systolic arrays, since locality of communication among the PEs 
is always required [1-4]. To pipeline the variable, artificial statements are introduced to 
complete the dimension of the variable describing tuple. According to [59], an algorithm 
can be pipelined by implementing the following steps:
1- Find the dimension m of the algorithm.
2- Find the dimension, n , of the tuple describing the variable that need to 
be pipelined.
3- Equate each of the tuple coordinates to a constant to obtain a system of equations.
4- If the system of equations describes a straight line in the m-D space, the new 
tuple describing the variable will be defined by that line.
5- If the system of equations does not define a line, break it into sub-systems such 
that each describes a line. The tuple is replaced by the set of new tuples 
described by the set of equations defining the lines.
6- Repeat steps 2 - 5  for all variables with n<m.
16
For some algorithms, the components of the tuples describing their variables are a 
function of the algorithm size. These algorithms will result in variable data dependence 
vectors. In [59] it was shown that a data dependence with / variables can be replaced 
by a set of 21 +1 fixed dependence vectors. This leads to the regularization of data flow in 
the algorithm.
An algorithm described in HLL will be suitable for systolic array implementation if its 
variables are pipelined and has all of its data dependencies transformed into a set of fixed 
data dependence vectors. In this case, the algorithm is referred to as a systolic algorithm. 
The previous observations are summarized below.
Lemma 2.1: An algorithm written in HLL is systolic if
i) The dimension of the tuples describing each generated or used variable in 
the assignment statements is equal to the algorithm dimension.
ii) The data dependence vectors are fixed.
2.3. Examples of Systolic Algorithms
This section presents several common algorithms and their corresponding systolic 
representation.
17
2.3.1. Matrix-Vector Multiplication
Consider the operation y = Ax , x , y are nxl vectors, and A is an nxn 
matrix. The i th component of y can be found by using the recurrence
y£  = y f l + ai k xk t k =1 ,  ... , n.
In this recurrence, the used variable jc* has the index i missing. To pipeline the 
variable, an artificial new statement xtf = x f 1 is introduced into the recurrence to 
eliminate the need of broadcasting x*. The new form of the pipelined recurrence will be
for i =1 to n do 
for k - 1  to n do 
begin
SI: x  (i,k) = x ( i- l ,k )
S2: y  (ijc) = y  (ijc-l) + a (itk) x (i,k) 
end.
In the HLL program, there are two distinct pairs of < generated, used > variables. 
One involving x  , and the other involving y. Let dx be the dependence vector 
corresponding to the pair < x (  i , k) , x  ( i-1, k ) > and dy be the dependence vector 
corresponding to the pair <y ( i , k )  , y ( i , k-1) >. The data dependence vectors can be 
found by subtracting the coordinates of the < generated, used> pair describing tuples, 
in the same order as they appear in the HLL nested loops. Therefore, the two data 
dependence vectors are given by
dx = ( i - ( i - l ) , k - k ) = { 1 , 0 )  ,
18
dy = ( i - i  , k - ( k - 1 ) )  = ( 0 , 1 ) .
The components of dx and dy are constant. The order of execution of the index 
points in HLL can be changed without affecting the overall final values. Hence, the 
resulting data dependence vectors of the algorithm are fixed and flexible. The resulting 
statements defines a systolic algorithm for matrix-vector multiplication.
2 .3 .2 . Matrix-Matrix Multiplication
Consider the problem of multiplying two dense matrices A = [ ay ] ,  X = [ xy ], to 
get a matrix Y = [ yy ] , where A , X and Y are n x  n matrices. A typical 
element of Y , yy can be found by using the following recurrence
y ;f = y if'1 + <*ik *kj •
In this recurrence, the two variables a# and x^  have missing indices. The index j  is 
missing in and the index t is missing from j<jk. To pipeline the algorithm, two new 
statements have to be introduced to compensate for the missing indices in the recurrence. 
The resulting pipelined version of the algorithm can be expressed in HLL as
for i -  1 to n do 
for j  = 1 to n do 
for k -  1 to n do begin 
SI: a (ij,k) -  a(ij-l,k)
52: x (ij,k) = x  (i-lj,k)
S3: y  (ij,k ) -  y  (ij,k-l) + a (ifik) x(ij,k) 
end.
19
In these assignment statements, there are three distinct pairs of < generated, used > 
variables. Mainly, <a  ( i , j , k  ) , a ( i , j - l  , k )  > , <x  ( i , j , k ) , x  ( i - l , j , k ) > 
and <y  ( i  J  , k ) , y  ( i  , j  , k-1 ) >. The data dependence vectors are found by 
subtracting the coordinates of the < generated , used > pair describing tuples, in the 
same order as they appear in the loops. The data dependence vectors are given below
da = ( i - i j - ( j - l ) , k - k )  = ( 0 , 1 , 0  ) ,
dx = ( i - ( i - l ) J - j , k - k )  = ( 1 , 0 , 0  ) ,
dy = ( i - i , j - j t k - ( k - l ) ) = ( 0 , 0 , 1 )  .
The components of da , dx and dy are constant. The order of the execution of the 
index points can be changed without affecting overall values. Hence, the resulting data 
dependence vectors of the algorithm are fixed  and flexible. The statements define a 
systolic algorithm for the matrix-matrix multiplication.
2.3.3. LU Decomposition
Consider the problem of factoring an nxn matrix, as the product of a lower triangular 
matrix L = [ ly ] and an upper triangular matrix U = [ uiy ] .  The entries of the two
matrices Uj and «,y can be found using the following recurrence based on Dolittle's 
method
a /  = a f 1 - I* ukj ,
20
he =  1
<*ikkukkJ
0 i < k 
i = it 
i > k
i > k 
i < k ,
The pipelined version of the LU decomposition algorithm is given by
for k -  1 to n-1 do 
for i = k to ft do 
for j  = Jt to n do 
begin 
i f ( i =j  = k)  then 
SI: u(k,ij) = 1 !a(k,ij) 
elseif ( i  = k )  then 
S2: u(k,ij) = a(k,ij) 
elseif( j  = k )  then 
S3: I ( k,ij) = a( k,ij) * u(k,ij) 
else 
begin
S4: u(k,ij) = u(k,i-lj)
S5: l(k,ij) = l(k,ij-l)
S6: a(k,ij) = a(k-l,ij) - l(k,ij-l) u(k,i-lj) 
end
end
There are three < generated, used > pairs, for the variables a , I and u . The data 
dependence vectors of the representation are fixed and given by
21
di = ( i - i , j - j , k - ( k - l ) )  = ( 0 , 0 , 1 )  ,
du = d - i , j - ( j - l ) , k - k )  = ( 0 , 1 , 0 )  ,
da = ( i - ( i - l )  J - j , k - k )  = ( 1 , 0 , 0 )  .
The resulting data dependence vectors are fixed and inflexible.
2.3.4. QR Decomposition Algorithm
Given an nxn matrix A , this algorithm finds an nxn orthogonal matrix Q , such
that QA = R , an upper-triangular matrix. The matrix Q is formed as the product of
plane rotations. Each plane rotation Q,y, with i < j ,  is an orthogonal matrix with 
all its elements identical to those of the unit matrix except that q& = q^= c and 
<7ij = - qyt = s , where
d = a-2 + ay? • 
c = an f  Vd , and
s = aji H d  .
Qy will eliminate ay(- below the diagonal. It will only affect rows i , and j . The 
modified rows will be given by
aik -  c aik + s ajk 
ajk ~ c aik “ s ajk
22
The pipelined HLL description of the algorithm is given by
for i = 1 to n-1 do
for j  = i+1 to n do
for k = i to n do 
begin 
i f ( i - k )  then
c (ij,k) = a  Cij-l,k) I [ { a  (ij-l,k) ) 2 + ( a  (ij,k) )2 ] «
5 a i k )  =  a  / [ ( a  (ij-l,k) )2 +  ( a  ( /</ ,*)  ) 2 )0J
a OV,*) = [ («  0 > U )  )2 + ( a 0V,fc) )2 F  
a (i+ ljjc) = 0 
end 
else 
begin 
c (ij,k) = c ( )  
s ( i j ,k ) = s ( i j ,k - l )
a OV,*) = c (ij,k-l) a ( i ) + * (i ) a (ij,*)
a (i+ lj,k ) = -s (ij,k-l) a (ij- l,k ) +c (ij.ft-7) a (ij,*) 
end.
The QR decomposition algorithm results in five < generated, used > pairs. The two 
variables c and s produce the same data dependence vector. The variable a results in 
three distinct pairs. The data dependence vectors of the QR systolic algorithm are
dc = ds = ( i - i , j - j , k - ( k - l ) )  = ( 0 , 0 , 7 )  ,
da = ( i - i  , j - U - l ) ,  k - k )  = (0 , 2 , 0 )  ,
da = ( / - (  i - 2  ) ,  j - j , k - k  ) = ( i , 0 , 0 )  ,
23
da = ( i ‘ ( i - l ) J - ( j - l ) , k - k )  = ( 1 , 1 , 0 )  .
All the dependence vectors of the QR factorization algorithm are fixed and 
inflexible.
2 .4 . Handling of Input Data
At each index point j in loop (1), a subset from the set of all variables will be 
involved. The elements can be partitioned into two different groups. The first group 
corresponds to the used (input) data elements, namely those elements appearing on the 
right hand side of the assignment statements. The second group corresponds to the 
generated output data elements, basically, those elements that appear on the left hand side 
of the involved assignment statements.
In some cases, a variable will only appear on the right hand side of the assignment 
statement, hence it is always used. A variable that is always used and is never generated 
will be called a free variable. A dummy data dependence vector can be introduced to 
describe the contribution of a free variable at each point j  e Jm. This will be useful later 
in the modeling of data flow in an array.
To introduce a dummy data dependence vector to the HLL representation of the 
algorithm, simply re-index the nested loops in (1). The steps necessary to obtain a 
dummy data dependence vector are:
1- Find the number of free variables in the algorithm, let this number be u.
24
2- Increase the depth of the nested loops in the algorithm to m' =m + u.
3- For each variable with a tuple dimension n< m' do
i) Equate each of the tuple coordinates to a constant to obtain a system 
of equations.
ii) If the system of equations describes a straight line in the m' -D space, 
the new tuple describing the variable will be defined by that line.
iii) If the system of equations does not define a line, break it into sub-systems 
such that each describes a line. The tuple is replaced by the set of new 
tuples described by the set of equations defining the lines.
To illustrate the principle, consider again the matrix-vector algorithm. The variable aik 
is a free variable. A dummy dependence vector is introduced that results in this HLL form 
of the algorithm:
for i = 1 to n do 
for k ~  1 to n do 
for j=  k to k  do
SO: a ( i ,k j)  = a (  i ,k j - l ) 
SI: x  ( i ,k j ") —x  ( i - l ,k j ) 
S2: y  ( i ,k j ) = y ( i ,k - l j ) + a ( i , k j ) x  ( i,k j ) 
end.
The resulting data dependence vectors are given by:
25
da = k - k j - u - i  ) )  = ( o , o , n  , 
dy = ( / -  i , k -  ( k - l ) J - j  ) = ( 0 , 1 , 0 )  , 
dx = ( i  - ( i - / )  , k - k , j - j  ) = ( 1 , 0 , 0 )  .
The addition of dummy data dependence vectors to a systolic algorithm ensures the 
pipelining of all the variables in the HLL language representation of the algorithm. As a 
result, data elements of each variable will be pipelined along lines specified by the tuple(s) 
describing each variable. Arrays constructed by this methodology will be characterized by 
a uniform data flow.
CHAPTER 3
MATHEMATICAL REPRESENTATION OF 
ARRAY OPERATIONS AS A FUNCTION OF 
THE COMPUTE CYCLE
Chapter 2 gives a high level language ( software description ) of a systolic algorithm. 
This chapter assumes that the algorithm is implemented in a given systolic array. The most 
significant result of this chapter is the modeling of the computations in an array as functions 
of the compute cycle. The technique is intended for analysis up to the normal stop; i.e ., 
for compute cycle k < T , where T is the total number of cycles needed by the array 
to produce the results of the computations.
The chapter is organized as follows: Section 3.1 introduces a hardware model of an 
array. Section 3.2 discusses the notion of space-time transformation implemented by an 
array on the input data elements. Section 3.3 derives the concept of data paths in the 
array. Sections 3.4 and 3.5 formalize the modeling of array operations as a computing 
graph. Section 3.6 introduces a procedure that models the flow of data in an array, while 
in sections 3.7 and 3.8 the precedence timing diagram and the timing graph are discussed. 
Section 3.9 models the operations of several examples of switchable as well as non- 
switchable arrays.
26
27
3.1. Array Hardware Model
A systolic array consists of a set of processing elements ( PE) operating in a lock-step 
fashion, and an interconnection network that binds the various PEs together. This 
section formalizes a model of the array connectivity network based on a graph theoretic 
approach. This model is independent of the type of the algorithm being evaluated or the 
functions performed by the PEs.
3.1.1. Modeling Array Connectivity
In order to view a general model of the array connectivity network, it is convenient to 
depict the array as a directed graph, whose nodes correspond to the array processing 
elements, and the arcs correspond to the network links. With this representation, 
techniques from graph theory can be used to analyze the properties of the array.
Definition 3.1: A systolic array hardware model is defined as a directed graph. 
The nodes of the graph represent the array processing elements, and the arcs correspond 
to the links between the processing cells.
Let N  be the total number of processing elements in the array. To construct the 
graph representation of the array, the following steps are needed:
28
Procedure Graph
1- Enumerate the PEs in the array with unique integer numbers that are greater 
than zero.
2- In the graph, connect node i to node j  if and only if their corresponding 
PEs are connected by a link.
Once a graphic description of an array is determined, its interconnection network can be 
described by the directed graph’s adjacency matrix. In graph theory, the adjacency 
matrix of an N  - node graph, is defined as long as the graph contains no parallel 
edges [89].
Definition 3.2: The connectivity network P of a systolic array is defined as the 
adjacency matrix of the resulting graph representation of the array. P = [ p;j ] is an 
NxN matrix whose element
Pij = 1 , if there is a directed arc from node i to node j ,
= 0 , otherwise.
For non-switchable arrays, the interconnection matrix P is fixed. However, for 
switchable arrays it will be a function of the compute cycle.
29
3.1.2. Processing Element Model
A typical processing element ( PE ) in an array consists of the following: a set of 
input links, a set of output links, and a set of memoiy registers. In addition, each PE is 
characterized by a set of memoryless functions that define its operation. The set of 
algebraic functions associated to each processing element defines the arithmetic and logic 
expressions that the element is capable of performing.
A processing element in the anay acts on data that is either stored in its memory registers 
or received on its input links from neighboring PEs. The results of the operations are 
delivered by the PE after one unit time delay. The results may be either stored inside the 
PE or transmitted on a set of output links to neighboring PEs. The element is assumed to 
have a set of temporary buffer registers to hold partial computations.
For the analysis, it is convenient to assume that all the data available to the processing 
element on the input links are stored in internal memory registers. Furthermore, it is 
useful to think of all the PE memory registers as a stack of registers. At the start of the 
compute cycle, all data elements are stored in proper locations on the stack. The PE 
performs an algebraic function that may involve all or a subset of the registers. Data that 
must be transmitted by the PE is then selected from appropriate locations on the stack and 
sent on the output links.
A typical processing element can be represented as in Figure 3.1.
30
" T O
FjTl | g
lIl
PTI
ITTv,T v j  v j
Figure 3.1. PE Model
Referring to Figure 3.1, g is the identification number of the processing element, n 
is the number of input links, q is the number of storage registers, and m is the 
number of output links. For convenience, data elements on the links are represented as 
vectors; thus r (g  ,k )  is the q -vector of data available in the PE memory registers at 
the start of the compute cycle, u ( g , k ) is the n -vector of data available on the PE 
input links at the start of the compute cycle and v (g  ,k )  is the m - vector of data that 
must be transmitted on the output links.
The above notations allows the establishment of a hardware model of a typical 
processing element irrespective of the algorithm being evaluated. Hence, the processing 
element hardware model is given by
31
v ( g , k )  = Fg (n  ( g , k ) , r  ( g , k ) ,  k  ) , 
where, Fg is a memoryless function and k is the compute cycle number.
However, if the array is executing a particular algorithm then the processing element 
will map a unique used element belonging to one variable that is available on an input link, 
to a unique generated element belonging to same variable and transmits it on a particular 
output link, at a certain compute cycle. This operation will be referred to as the PE 
mapping function.
Definition 33:  The PE mapping function is defined by the map yr: Ii A given
by
yr(z  ( k ) )=  a{ j k ) ,
where, z represents data elements belonging to algorithm A that are available on the PE 
input links and memory registers at a unique compute cycle k . The PE will transmit a 
( j k ) on an output link to a neighboring PE .
32
3.2. Description of Array Operations
The HLL description of a systolic algorithm introduced in chapter 2, is static, 
meaning that it does not provide any information about the execution time of the 
computations in the loops. An ordered execution of a systolic algorithm generates the 
following 5-tuple
A ( J m , J I , J O , F I , D )
where,
• J m is the algorithm domain.
• JI is the set of the algorithm used data elements.
• JO is the set of the algorithm generated data elements.
• FI = { f j  , f 2 , ... , f r }; is a set of memoiyless functions needed
to map the algorithm input data to the algorithm output data; jfj is a 
memoryless function such as multiply, add,.. etc; r = total number of 
computation functions needed to map the input data to the output data.
• D = { dj , d2 , ... , dy } is the set of data dependence vectors of the 
systolic algorithm.
33
From the description of the algorithm model, the sets defining the tuple can be generated; 
however, the converse is not necessarily true.
An ordered execution of the HLL description of a systolic algorithm produces a set of 
used elements and a set of generated elements at each index point of the algorithm space. 
A systolic array implementing the algorithm must produce the same sets, with the 
exception, that the sets are generated at different locations in space and are distributed in 
time. The array imposes an execution ordering on the algorithm variables characterized by 
a time parameter. This ordering is a function preserving transformation, meaning that 
the following must hold:
i) If a ( j )  e JI , j  e J m is an algorithm used element then, it must
be an input element to an appropriate PH in the array at a unique compute
cycle k .
ii) If a O') e JO , j  e 3m is an algorithm generated element then, it must be
an output element from an appropriate PH in the array at a unique compute 
cycle k .
iii) To any operation in the original algorithm such as addition, multiplication, 
etc., an identical operation must be performed by the array processing elements.
iv) Data dependencies of the systolic algorithm are preserved in the array.
34
3.2.1. Example to Illustrate Algorithm Generated Tuples
Consider the HLL program covered in chapter 2 , for the matrix-vector multiplication 
algorithm, for the case when m = n=2. The algorithm index space J 2 is given by
J 2 = { ( 1 , 1 , 1 ) , ( 1 , 2 , 2 ) , ( 2 , 1 , 1 ) , { 2 , 2 , 2 ) ) .
At index point ( 1 , 1 , 1 ) ,  data element x ( 1,1,1 ) is generated in assignment 
statement S I , and is used in statement 52. On the other hand, y ( 1,1,1) is generated 
in statement 52 , a ( 1,1,1 ) is generated in SO and used in 52. At index point 
( 1 , 2 , 2 ) ,  data element x  (1,2,2 ) is generated in statement SI , and used in 52 . 
Output data element y  ( 1 ? 2  ) is generated in statement 52 , While, data element 
a ( 1,2,2 ) is used in statement 52. Similarly, at index point ( 2 , 1 , 1 ) ,  x  ( 2,1,1), 
is generated in S I ,  y (2, 1, 1)  is generated in 5 2 , and a (2, 1, 1)  is used in 
statement 52. At index point ( 2 , 2 , 2 )  , x ( 2 , 2? )  is generated in 51, 
y (2 ? ? )  is generated in 52 , and a ( 2, 22)  is used in 52.
From the example, the set of input data elements to the algorithm is given by
JI = { ( a  (1,1,0 ), x (0,1,1) ,  a (1,1,1), x (1,1,1), y (1,0,1) ) ,
( a (1,2,0), x (0 2 2  ), a (1 2 2  ) *  (1 2 2  ), y  (1,1,2)) ,
( a (1,1,0), x (1,1,1 ) ,  a (2,1,1 ), x  (2,1,1 ), y  (2 ,0 ,1) ) ,
( a (1,2,1), x (1,2,2 ), a (22,2 ), x (222  ), y  (2,1? ) )  I
Similarly, the set of output data elements of the algorithm is given by
35
JO = ( a (1,1,1 ) ,x  (1,1,1) ,y  (1,1,1) , a  (1,2,2), 
x (1X 2  ), y (1,22 ), a {2,1,1) , j: (2,1,1),
y  (2,1,1) ,  a (2,2,2 ), jc (2,2,2 ), y (2,2,2 ) }.
At each index point in assignment statement S2, there is a multiplication and an addition 
involved. Therefore, the set of memoryless functions is composed of
FI = { ( * .  + ) ) .  
where, ' * ' indicates multiplication and ' + ' the addition operation.
3,3. Array Data Paths
Let C = [ 1 , 2 ,  ..., r } be a linear chain of connected PEs in the array. In this 
chain PE ( i ) is connected to PE (i+I ) by a link. The boundary PEs are connected to 
external sources; i.e ., PE (1 ) receives data from an input source and PE ( r ) transmits 
data elements to a receiving source. A linear chain where every data element belong to the 
same variable defines a data path for that variable.
The collection of all linear chains in the array defines the data paths for the variables. In 
the array, data elements belonging to the same algorithm variable, will follow non­
intersecting paths. To distinguish between the paths of elements belonging to different 
variables in the array, assign a weight w , which is a unique integer number greater than 
zero, to the path(s) of the elements of each variable in the array. To assign weights to the 
data paths, the following must be performed:
36
Procedure Data Path
1- Identify the collection of linear chains in the array that transports the data elements 
of variable a ( . ) .
2- In the chain, identify the input and the output links in each PE that canies the 
elements of the same variable.
3- Assign the same weight w for those links.
3.4. Modeling Array Operations as a Function of the Compute Cycle
This section formalizes a general model of the array operations as a function of the 
compute cycle, irrespective of the algorithm being evaluated. In order to view a general 
model of the array operations as a function of the compute cycle, it is convenient to think 
of the array as a computing graph [89], with nodes that perform computations on data 
elements that are transported on the arcs. The arcs of the graph correspond to the array 
connectivity network.
A computing graph is one that is capable of transporting and processing data elements. 
The nodes ( vertices ) in the graph are computational nodes that perform algebraic 
functions on the data present on the input arcs. The arcs arc the means by which data is 
transmined between the nodes. The graph starts with a certain state at period k , which is 
characterized by data associated with all the nodes. In the next period, processed data 
elements on node i are conveyed to node j  on arc ( i j ) . The graph enters a new state
37
typified by the new data elements in all the vertices. The process may be repeated again. A 
directed computing graph is a computing graph with arcs that assume a specific direction 
between the nodes at a particular period. A weighted directed computing graph is a 
directed computing graph with weights assigned to the arcs. This leads to the following 
definition of a systolic array operation.
Definition 3.4: A systolic array operation as a function of the compute cycle, is 
defined as a directed weighted computing graph ( DWCG). The nodes of the DWCG 
correspond to the processing elements and the weighted directed arcs correspond to the 
array network links.
The weight of an arc in the DWCG , is a unique integer greater than zero, that is equal 
to the weight assigned to the data path(s) of the elements of an algorithm variable, in the 
array. To construct the DWCG representation of an array, the following steps must 
performed:
Procedure DWCG
1- Enumerate the PEs in the array with unique integer numbers that are greater than 
zero.
2- Draw the graph of the array with each node representing a processing element. In 
this graph connect node i to node j  if and only if their corresponding PEs are 
connected by a link. The direction of the arc must be consistent with the direction 
of data flow from PE i to PE j .
38
3- Assign weights to the arcs in accordance with the weights assigned to the link's 
data paths.
During the array operations, its connectivity is defined by the DWCG weighted 
adjacency matrix. The definition of the connectivity matrix is given below.
Definition 3.5: The connectivity matrix P of a systolic array during its operation, 
is defined as the weighted adjacency matrix (WAM) of its resulting DWCG, where, 
P is an NxN matrix whose element ptj  is given by
p,j = u , u >0, is the weight of the arc from node i to node j  
= 0 , otherwise .
For non-switchable arrays, the interconnection matrix P is fixed; However, for 
switchable arrays it will be a function of the compute cycle. In this case, several matrices 
will be needed to describe the array connectivity at different compute cycles.
3.5. Timing Functions
In the DWCG , a node is a conceptual representation of a systolic processing element. 
A node can have a set of internal memory registers, but in no event can it be a generator. 
This means that the node must have inputs in order to generate outputs. The degree of the 
node is defined as the number of weighted arcs incident at the node. Boundary nodes in 
the graph are those nodes that receive data from external sources, or transmit data to the 
outside world. The completion of the degree of the boundary nodes in the graph, requires
39
the assumption that external data elements are supplied to those nodes by user induced 
input nodes. Similarly, boundary output nodes are assumed to feed data to user induced 
receiving nodes. Hence, the DWCG representation of an array can be viewed as a sub­
graph of a more general graph that includes the user induced nodes. The complete graph 
will be referred to as the array graph ( AG).
In the AG description, it is convenient to think that the user induced input nodes supply 
their data elements to the array, in the form of a timing function. Let 71 kw (n) be the 
timing function of input boundary node n , that supplies data elements corresponding to a 
data path of weight w . The timing function is defined by the following map:
Definition 3.6: Let a ( . )  be a variable of algorithm A , belonging to a data path 
of weight w . Furthermore, let T be the total number of compute cycles for the 
array to output the results of the computations. The map 7Zkw given by
7tkw : k -» A , such that
7r*w ( n )  = a ( / j f c ) , a ( . ) e A ,  * ,
defines the timing function of the user induced input node n .
To distinguish between the user induced nodes and the rest of the nodes, in the AG 
description of the array, separate input ( output) weighted tables must be constructed to 
describe the connectivity of those nodes to the rest of the nodes in the DWCG.
40
3.6. Modeling of Data Flow in the Array
Introduced in this section is a systematic procedure that can be programmed to model the 
flow of data in a systolic array. The procedure models the flow of data elements, with no 
regard to the flow of control signals. The control signals are assumed to be generated by a 
proper unit. In order to incorporate the change of operation modes in the array PEs, one 
must impose limits on the compute cycle, and the tuples describing the input data elements. 
The steps necessary for modeling the flow of data in an array are summarized below:
Procedure Data Flow
1- Let N  be the total number of PEs in the array. Enumerate the PEs with unique 
integer numbers that are greater than zero. This numbering process may be 
arbitrary, but once a configuration has been decided, it is assumed to be fixed 
thereafter.
2- Construct the DWCG representation of the array.
3- Complete the graph representation of the array by constructing the AG.
4- Derive the weighted lookup tables for the user induced input and output nodes.
5- Derive the timing functions for each user induced input node.
6- Derive the node (PE)  mapping functions, for all nodes in the DWCG
41
representation of the array. These are the memoryless functions performed by the 
node at each compute cycle, that relate the inputs to the outputs from the node.
7- Derive the connectivity matrix P of the DWCG representation of the array.
8- To model the flow of data in the array, execute the following:
For k = 0 to T  do in parallel 
begin
a) For n = 1 to N  do in parallel
begin
i) From the weighted lookup I /O  tables, check if node n 
has arc(s) of weight w receiving input from a use induced 
node. If yes , then evaluate the proper timing function for that 
node.
ii) Use column i of P of node n as a lookup table to identify the 
node originating the input arc to node n incident weighted 
arc.
iii) Evaluate the originating nodes mapping functions to identify the 
data elements that node n receives at compute cycle £.
iv) Evaluate node n mapping functions.
42
end
end.
Remark: The procedure introduced is flexible enough to model the flow of data in 
dynamically switchable arrays. In these arrays, the following may be a function of the 
compute cycle: i) the connectivity matrix, ii) the node mapping functions, and, iii) the 
input timing functions. Procedure data flow  can be used to model switchable array 
operations with some minor modifications. The changes required are summarized below:
1- Define a lookup function M(k) that identifies the node that changes its mode of 
operations at compute cycle k. M(k) must indicate the mode of operation of 
node n at a particular compute cycle.
2- Derive the different connectivity matrices P (k) for the total allowable 
number of array operations.
3- Derive the input timing functions for the user induced input nodes, in 
this case n j 1 ( n ) could be a function of the compute cycle.
4- In procedure data flow, before evaluating any of the steps (8  i) to 5 iv)) ,  
use the proper tables for that particular compute cycle.
43
3.7. Precedence Timing Diagram
The operations of the PEs in the array can be described by the precedence timing 
diagram (PTD). The PTD is an adaptation of the familiar precedence diagram [90]. The 
PTD is a timing diagram that consists of a total of T time levels, where T is the 
allowable number of compute cycles. Each time level in the PTD consists of a number of 
nodes. A node in the PTD represents the state of its corresponding PE at that compute 
cycle. The state of a PE includes all the data elements stored in its stack of registers. Only 
active nodes will be represented at each level in the PTD, i.e. those nodes in the DWCG 
that change state for that specific cycle.
In the PTD each node is assigned a number that is equal to the number of the node it 
relates to in the DWCG. Furthermore, since each node in the PTD represents the state of 
a node in the DWCG, all the information needed to define the state of that node must be 
included in the representation. The PTD must also provide information about the flow of 
data elements in the DWCG. To incorporate that, the arcs of the nodes are assigned 
weights. The weight of an arc incident to a node is equal to the node number it originated 
from. Similarly, the weight of an arc originating from a node is equal to the number of 
the destination node. The weights are found by using the weighted connectivity matrix of 
the AG as a lookup table.
3.7.1, A Procedure for Constructing the PTD
Given the array DWCG, which includes its timing functions, the nodes mapping 
functions, the connectivity matrix P , and the allowable number of compute cycles T ,
44
the PTD of the array can be constructed. The following are general rules that a PTD 
should adhere to
i) Each node in the PTD represents the state of a DWCG node for a particular 
compute cycle.
ii) Each node in the PTD is assigned an identification number that is equal to the 
node number to which in the DWCG it relates. Each node must consist of input 
and output arcs in accordance with the prospective DWCG node. All data 
available to the nodes are stored in internal registers.
iii) Input and output arcs to each node are assigned weights that are function of the 
compute cycle. The weight of the node's incident arc at cycle k , is equal to the 
number of the node from which it originated. Similarly, the weight of the node's 
output arc at cycle k , is equal to the number of the node on which it is incident
Using the concepts developed in procedure data flow  , the necessary steps for 
constructing the PTD are listed below:
Procedure PTD
1) Use the I/O lookup tables from procedure data flow to identify the boundary 
nodes in the DWCG representation of the array.
2) Use procedure dataflow to identify the timing functions of the input nodes.
45
3) For k = 0 to T , do in parallel
begin
i) Execute step 8 a) of procedure dataflow for each node in the DWCG 
representation of the array.
ii) Place a DWCG node in the PTD at time level k  if and only if it changes 
state during that compute cycle.
iii) Assign the arcs of each node the proper weights.
iv) For every node added to the PTD, each internal register must indicate the 
data element it holds.
end.
4) The construction of the PTD stops at time level T. This indicates that all nodes 
in the DWCG becomes inactive.
3.8. Timing Graph Representation of Array Operations
The previous section introduced the concept of precedence timing diagram . In this 
section, a method for mapping the PTD into an acyclic timing graph (T G ) is derived. 
The approach represents the evolution of the array computations as a connected graph. The 
representation will prove to be beneficial for the analysis of the array operations that is 
pursued in the next chapter.
46
The timing graph represents the states of the DWCG of an array as a function of the 
compute cycle. The graph displays the evolution of partial results in the array as time 
progresses. The TG will consist of a total of T time levels. Each time level depicts the 
change in the state of a DWCG node, and the direction of data movement from (in to) 
the node for that compute cycle. Given the PTD of an array, the TG can be easily 
constructed.
The mapping of the PTD to the TG should retain the structure of the array operations, 
its connectivity and the integrity of the flow of data elements between the PEs. In the TG , 
a time axis must be defined that connects the state of each node in the PTD at different time 
levels. This axis will be referred to as the node direction vector and for node n is denoted 
by dn . Furthermore, all data elements a ( . )  that belong to the same algorithm 
variable, follows lines that are parallel to a direction vector denoted by Va .Let q be the 
total number of data paths in the DWCG, then there must be q direction vectors in the 
TG. The‘direction vectors can be assigned by the user at his own convenience. The 
mapping of the PTD into TG consists of the following steps:
1) Construct the set W = {Wt- }, i = 1 , . . . ,  T  , here, Wt represents the
the nodes in the PTD at time level i .
2) From the I /O  lookup tables in procedure data flow, identify the input/output
nodes in each set W ;.
3) Define the direction vector Va for each of the direction functions describing the
47
flow of data elements between the PTD nodes. All elements belonging to the 
same data path, move on lines that are parallel to that direction vector. 
Furthermore, let V = { Vi ,. . . ,  Vq } be the set of all direction 
vectors.
4) For each node j  e N ( number of nodes in the DWCG), define a direction 
vector dj. Here, dj will connect node j  at time levels i and i+1 if 
j  £ Wj and Wi+I . This will be the time axis for the node in the graph. All 
time axes for the nodes must be parallel to each other.
In what follows, a procedure is presented that constructs the TG from the PTD.
Procedure TG
1) For the boundary input/output nodes, assign a unique direction vector Vw 
for each weighted input link.
2) Place the nodes of Wq at time level 0 in the TG, with the input edges to each 
node on lines that are parallel to the corresponding weighted direction vector Vw.
3) For any node « in W^, derive the node direction vector dn.
4) For k -  1 to T  do in parallel
i) If node i e Wk has a boundary input /output edge, connect the weighted
48
edge to the node on a line that is parallel to Vw.
ii) If node j  e Wk-i and Wk , place node j  in time level k  on the line
defined by the direction vector dn, else, define a direction vector that is 
parallel to dn for the node.
iti) Choose a specific direction vector Va from the set V .
iv) Connect node i e  Wk,} to node j  e Wk on a line that is parallel to
Va , if the nodes have arcs holding elements of variable a (.). If this is not 
possible, then
a) For kk = 0 ,  k do
b) Renumber the nodes in each time level kk such that 
if node j e  then it is connected to node 
j  e W/&+J on the line that is parallel to Va.
5) For I =1 , q do
a) For k = 1 , T-l do in parallel
b) Connect node i £ Wk to node j  £ Wk + 1  on line that is parallel 
to V ,.
6) Connect the nodes at time level T  to the user induced output nodes on edges
49
that are parallel to the vector Vw , where w is the weight of the output arc 
connecting the output boundary nodes in the DWCG to the user induced output 
nodes.
3 .9 . Examples to Illustrate the Introduced Procedures
This section applies the introduced procedures to model the operations of several arrays 
as a function of the compute cycle. Examples include switchable as well as non- 
swithchable arrays.
3.9.1. Matrix-Vector Array
The example considers an array designed to perform matrix-vector multiplication of the
nxn n
form y = Ax , where A e R , x and y e R . The array for n =3 is depicted 
in Figure 3.2. The array total execution time is ( 3 n -1 ) cycles in general.
o
o
a13
a12
0
a 23
a 22
a21
a 33
» 32
a31
0
Figure 3.2. Matrix-Vector Array, n -  3.
50
From procedure data path , there are three data paths in the array corresponding to the 
elements of the variables a , x  , and y . Assign the weight w = 1 for jc,* data 
elements, the weight w =2 for the a,j elements, and the weight w =3 for the 
yik elements. The resulting data paths are given in Figure 3.3.
Figure 3.3. AG Matrix-Vector Array.
node / node 1 2 3
4 1 0 0
6 2 2 2
Table 3.1. Input Nodes for Matrix-Vector Array.
51
node / node 5
3 1
Table 3.2. Output Nodes for Matrix-Vector Array.
From procedure DWCG , there are three PEs in the array, that are assigned the 
numbers { 1 ,2 ,3  } . The resulting DWCG and AG of the array are shown in Figure 
3.3. From step 4, of procedure data flow, the weighted I/O lookup tables are given 
in tables 3.1 and 3.2 respectively. Utilizing step 5 of the procedure, the boundary 
nodes in the DWCG are nodes { 1 , 2 , 3  }. The input timing functions for the 
nodes are listed below
J i ( 1) - x  U . k + 1 )  , 0 < j <4 ,  
iz2k (1 ) ( 0 , k + 1 ) , 0  < j < 4 ,
k2 H 2 )  = a ( 2 ,  k )  , 0 < j < 4  ,
7t2 k (3 ) = a ( 3 ,  k - 1 )  , 0 < j <4 .
From step 6, of the procedure, the resulting node mapping functions are given by
52
v *  ( n ) = u f  ( n )  
m3 k + 1  ( n ) = ( /i ) * u k ( n ) + ( n ) .
From step 7 of the procedure, the resulting connectivity matrix of the array is given in 
table 3.3.
node / node 1 2 3
1 0 1 0
2 0 0 1
3 0 0 0
Table 3.3.
Connectivity Matrix for Matrix-Vector Array.
Utilizing procedure PTD , the array total execution time is 5 cycles, thus the final 
value of T is 4 .  The range of allowable number of compute cycles is 0 <  k <4. 
From step 3 of the procedure, at k = 0 the mapping functions for node n = l  are 
only defined, thus, it is the only node in the graph that changes its state. There are two 
data elements on the node that are given by
53
Ih° 0  ) = a ( 1 ,1 ) = an , Jtj° (1 ) = x ( 0 , 1)  = x} .
From the node mapping functions the content of the internal register of the node is given 
by
m3 0  ( 1 ) = yj 1 .
At compute cycle 1,  nodes 1 and 2 change their states. For node 1 one can write
f y 1 (7 ) = X2 , f y 1 U  ) = a-12 •
The content of the internal register of node 1 is given by
m31 ( J )  = yi 2  ,
and for node 2 ,
JU21 (2 )=  021 , Uj1 (2 ) xi  .
The content of the internal register of node 2 is given by
m31 ( 2 ) = yz1
54
In a similar manner, the rest of the time levels in the PTD can be constructed The final 
result is depicted in Figure 3.4.
Utilizing step 1 of procedure TG , the nodes 1 1 , 2  ,3 } are boundaiy input nodes, 
that receive data elements on weighted input link w =2 carrying a (.). Let the direction 
vector Va be the direction vector of this data path.
In addition, node 1 is a boundary node that receives Xik elements on input link of 
weight w =1 , that is assigned the direction vector Vx . From step 2, reading time 
level zero in the PTD produces the set W0 = {1  }, hence, node 1 is placed at
level k =0  in the TG. The input data element ajj is placed on an edge that is parallel 
to Vfl and the input data element xj  is placed on an edge that is parallel to Vx . From 
step 3 , let d1 be the node direction vector.
The PTD at time level 1 has two nodes, hence, Wj = [ 1 , 2  }. From step 4, choose 
Vx as the direction vector. Node 1 is placed in time level 1 in the TG , on the line 
defined by d2 . Node 2 at level 1 receives xi from node 1 at level 0 ,  from step 
4, the input edge holding xj  must be parallel to Vx . To meet this condition, node 2
must be placed on the right of node 1 in time level 1 .
Utilizing step 5 of the procedure, every node at each level receives an input data element 
belonging to the variable a ( . ) ,  thus, all the edges must be parallel to V . From step 6,
node 3 at time level 4 has an edge that is incident on a user induced output node. This 
edge must be parallel to Vx, the resulting TG is depicted in Figure 3.5.
55
**12 XI
a22 Xj
®3 3
K
Figure 3.4. PTD Matrix-Vector Array.
Figure 3.5. TG Matrix-Vector Array.
57
3.9.2. Matrix Multiplication on Planar Mesh Array
Consider a planar mesh array designed for matrix-matrix multiplication of the form 
Y = AB , where Y , A , B are nxn matrices. The array total execution time is ( 3/i-i ) 
in general. The array is depicted in Figure 3.6 for n = 2.
From procedure data path , there are two data paths in the array. The first path consists 
of the fly data elements, while the second path consists of the by data elements. The data 
elements corresponding to y y  are accumulated inside the processing elements. Let the 
weight w - 1  be assigned to the data paths corresponding to ay elements and the 
weight w =2 be assigned to the data path corresponding to by elements. The weight 
w = 3 is assigned to the path of yy elements. The resulting data paths and their 
weights are given in Figure 3.7.
From step 1 of procedure DWCG , there are 4 PEs that are assigned the numbers 
shown in Figure 3.7. Steps 2 and 3, result in the DWCG and AG representation of the 
array, depicted in Figure 3.7.
From procedure dataflow , the input nodes in the DWCG are nodes ( 1 , 2 , 3 ) ,  
while the output nodes are ( 2 , 3 , 4 ) ,  the weighted lookup tables of the I/O nodes are 
given in Tables 3.4 and 3.5 respectively.
*
58
b i i
bu  o
a22 a 21 0 2221
12
Figure 3.6. Mesh Array, n = 2 .
node /  node 1 2 3
5 1 0 1
8 2 2 0
Table 3.4. Mesh Array Input Nodes.
Figure 3.7. AG for Mesh Array.
node / node
Table 3.5. Mesh Array Output Nodes.
60
Step 5 of procedure dataflow results in the input timing function for node n = 1 ,
on link of weight w - 1  given by
3Cjk (n ) = a ( i , j ) ,
= a ( n , n  + k ) , 0  <i  £ 1  , 1  £ j  < 2  ,
= nil , otherwise ,
and for node 3,
(n ) = a ( /
= a ( n  -1 , n ) , 1  < i < , 2  , 1  ^  £ 2 ,
= nil , otherwise .
Similarly, the input timing function of node 1 of weight w =2 is given by
zr2* («) = b ( i , j ) ,
-  b ( n + k , n ) ,  1  £ i £ 2 , 0 < j < l ,
= nil , otherwise ,
and for node n = 2  ,
%2k («) = b ( i , j  )
= b ( k , n  ) , 1  £ i  <2 , 1  < j < 2 .
= nil , otherwise.
61
Applying step 6 of the procedure, the corresponding node mapping functions are listed 
below
tfi3 k ( n )  = Ujk (tt )*u2k ( n ) + m3kml ( n ) ,
Vj = Uj '
v2 = «2 .
Step 7 produces the connectivity matrix depicted in table 3.6.
node / node 1 2 3 4
1 0 1 2 0
2 0 0 0 2
3 0 0 0 1
4 0 0 0 0
Table 3.6,
Connectivity Matrix for Mesh Array.
To model the state of node 4 at compute cycle k one must applying step 8 i). From 
the connectivity matrix P node 4 has two weighted links incident on it, namely, w =1 
and w =2 . Using entry P3 4  as a lookup table, one finds
62
«;* (4  )=  v ,* (J  ) ,
«2* ( 4  ) =  v2* ( 2 )  .
The content of the memory register is
m3k (4  )=  ujk (4  )u 2k (4 ) + mikml (4  )
The outputs from the node are
vIk ( 4  ) = ujk (4  ) ,
v2k ( 4  ) = u2k (4  ) .
The array total execution time is 5 cycles. The the final value of T  is 4, and the 
range of the allowable number of compute cycles is 0 £ k  £ 4  . From step 3 of 
procedure PTD , at k = 0 , node 1 changes its state. There are two data elements on 
the node that are given by
%}° (1 ) -  an  , it2° V  ) = b n  .
From the PE mapping functions, the internal register content of the node is given by
A t cycle k = 1 , nodes 1 , 2  and 3 change their state. For node 1 this is  given by
jcj1 ( / )  =a 1 2  , K2 1 0 0  = b2 1  .
The content of the internal register of node 1 is
W  ( 1 ) = y n 2  ■
for node 2 ,
Uj  ^ ( 2 )  =^ u  » ^2^  (2 ) ~ ^ 2 1  » (2  ) = »
and for node 3,
( 3 ) =bn  , (3 ) =n2/ , m3J (3 ) = y2/  .
Continued evaluation of the procedure produces the rest of the time levels in the FED. The 
final result is depicted in Figure 3.8.
Utilizing step 1 of procedure TG, nodes { 1 , 2 , 3 }  are boundary input nodes. 
They receive elements from the data paths corresponding to the variables a (.) and b (.). 
The vectors Va and Vb are assigned for those paths. From step 2 , the PTD at time 
level 0 results in W0  = { 1 }. Hence, node 1 is placed in the TG at time level 0 .
64
Ko
K,
K2
K,
b n
12 22,22
22
Figure 3.8. PTD for Mesh Array.
65
From step 3 , let dx be assigned as the node vector. In step 4 , choose Va as the 
direction vector. Reading level 1 in the PTD , Wj = { 1 , 2 , 3 } ,  hence level 1 , in 
the TG will consist of three vertices. Continued evaluation of the procedure produces the 
TG depicted in Figure 3.9.
3.9.3. Matrix Multiplication on Cylindrical Array
This example considers the problem of multiplying two matrices of the form 
Y = AB, where, Y , A and B are nxn matrices , on a cylindrical array. The 
array total execution time is 5 cycles ( ( 3 n - l) in general). The array for n = 2 is 
depicted in Figure 3.10.
Procedure data path results in two data paths for the elements of the variables a^ 
and bij , with weights w =1 and w =2 respectively. The elements of variable y,y 
are assigned the weight w = 3 . The resulting paths are displayed in Figure 3.11. From 
procedure DWCG, the resulting AG of the array is depicted in Figure 3.11.
Procedure data flow  results in nodes mapping functions that are the same as the mesh 
array node mapping functions. The I/O lookup tables are given in Tables 3.7 and 3.8 
respectively.
66
Kb
Ki
k 2
*5
Figure 3.9. TG for Mesh Array.
b2i aj2
* > n  a u
± y
i is
21
S T
*>22 a 22  
*>21 a 2?
J S
22
12
S T
Figure 3.10. Cylindrical Array, n = 2 .
67
From step 5, the input timing functions on nodes n =1 and n - 2  
w - 1  , are given by
Kik («) =b ( i j )
= ( n ,k  +1 ) ,  1 £  i <2 ,7  <>j <, 2 .
= nil , otherwise .
The input timing functions on nodes ( 7 , 2 )  of weight w =2 are
jt2k («) = f l ( i j )
= ( n,  Jfc+7), 7 S i  £  2 , l < t j < 2  .
= nil , otherwise.
Step 7 results in the connectivity matrix given in Table 3.9.
node / node
, of weight
Table 3.7 Cylindrical Array Input Nodes .
Figure 3.11. AG of Cylindrical Array .
node / node 7 8
3 2 2
4 2 1
Table 3.8 Cylindrical Array Output Nodes .
69
node / node 1 2 3 4
1 0 0 1 2
2 0 0 2 1
3 0 0 0 0
4 0 0 0 0
Table 3.9.
Cylindrical Array Connectivity Matrix.
To find the state of node 3 at compute cycle k , then applying step 8 i) of procedure 
dataflow, using column 3 of P , there are two links incident on the node. Using entry 
Pj3  as a lookup table, one finds that the input on link of weight w = 1 originates 
from node 1 . Thus,
«j* U ) =  Vj* ( 1  ) , 
u2k ( 3 ) =  v2* (1 ) .
The content of the memory register is
k ( 3 )  = Ujk (3 )u2k ( 3 )  + m3kmI ( 3 ) .
The output from the node is given by
70
Vi* (3  ) = «;* ( 4 ) t
v2k (3  ) = «2* ( 4 ) .
From step 3 of procedure PTD t at k = 0 t nodes 1 and 2 change their state. 
There are two elements on each node, hence
%x° ( 1  ) = b1} , jt2° (1 ) = fl77 , %i° (2) = bI 2  and n2° (2 ) = a2 1  .
The content of the memoiy registers of the nodes are
m3° ( 1 ) = yn * andm3° ( 2 ) = y2 2 1 .
At compute cycle 1 , all nodes in the array change their state. Hence,
*il U ) = b1 2  , n2l (1 ) = a2 1  , Kj1 (2) = b2 2  and 7i2J (2 ) = a2 2  .
From the mapping functions,
uj1  ( 3 )  =fli y , «2; (-? ) = b1 2  , u i H4  ) =  a2 1  and n27 (4  ) =  bn  .
The content of the memory register of each node is given by
m3 1 V  ) = y n 2  . W  (2 ) =y222 , m3 1 (3 ) =y 2 1 1 and m3J {4) =y 1 2 1 .
71
At time level 2 , nodes 3 and 4 change their state. The data on the nodes are given by 
Uj2  ( 3 ) = aI2 ,u 2 2 ( 3 )  = b2 2  , ut2 {4  )=  a2 2 ,u 22(4  )=  .
The content of the memory registers are
m3 2 (3 )= y 2 J 2  and m3 2  (4 ) = y1 2 2  .
The FIT) of the array is depicted in Figure 3.12. Procedure TG results in the array 
timing graph depicted in Figure 3.13.
3.9.4. LU Decomposition Array
Consider an array designed to decompose an ran dense matrix of the form A = [ a,y ] ,  
into an upper triangular matrix U = [ tty ] and a lower triangular matrix L = [ ] such
that
A = LU .
The array for performing the LU decomposition fort n = 3 is depicted in Figure 3.14. 
There are two types of processing elements in the array. The array total execution time is 
(3n-l ) cycles in general.
72
22
12
Figure 3.12. PTD Cylindrical Array.
73
*0 
K i
k 2
Figure 3.13. TG Cylindrical Array.
74
Applying procedure data path , assign the weight w =1 for the ay data path, the 
weight w  = 2  for the ly  datapath, and the weight w = 3  to the u-y data path.
Utilizing procedure DWCG, from step 1 , there are 6  processing elements in the 
array that are enumerated as in Figure 3.15. The array AG is displayed in Figure 3.15. 
From procedure dataflow , nodes ( 1 , 2 , 3 )  are input nodes. They receive data from 
the user induced node number 8  , on input arc of weight w = 1 . On the other hand, 
nodes ( 3 , 5 , 6 )  are output nodes. They transmit data elements to the user induced 
output node 7 . The resulting I/O tables are given in Tables 3.10 and 3.11.
node / node 1 2 3
8 1 1 1
Table 3.10. LU Array Input Nodes.
node / node
Table 3.11. LU Array Output Nodes.
75
23
0 0
a2 1 
a3 I
al 2 
a 2 2 
a3 2
0
ai3
a 23a3 3
a  )
r A
*1 m J-*~x 2 © ~ x
*y
for first yj then for first y then
S it &
>*iiE
e lse e lse
y2= yx - Xjin x = y /  m
*2= *1 end
end
b )
Figure 3.14. a) LU Array, b) PE Models.
Figure 3.15. AG for LU Array.
77
Step 5 results in the input timing function for node 1 , on link of weight w = 1 , given 
below
Kjk ( / )  -  a ( l  + k , l )  , 0 < i  <S 3 ,
-  0 , otherwise.
For node n - 2  , the timing function is
Jtjk (2 ) = a ( k , 2 ) , 0 < i <= 3 ,
~ 0  , otherwise,
and for node n = 3 , the timing function is
7tjk { 3 ) =a(k- l  ,3 )  , 0 < i  £ 3  ,
= 0 , otherwise.
The PEs have two modes of operation. Hence, a lookup table must be defined to 
distinguish between the modes of operations of the PEs. Let M ( k ) , with rttj = i if 
PE j  is in mode t of operation, be the lookup table. For nodes n = 1 , 2 , 3  , this is 
given by
mn = 1  , if k = n - 1 , 
mj = 2  .otherwise,
and for nodes n = 4 , 5 , 6  ,
78
mn = 1  , if k = n - 2 , 
mj = 2  , otherwise.
From step 6, the mapping functions for nodes { 1 , 4 , 6 } ,  for mode 1 of operation are
m3k ( n )  = Ujk ,
and for mode 2 of operations
v2k( n ) = u 4 ,
For nodes [ 2 , 3 , 5  }, the mapping functions for mode 1 of operations are
m3k ( n  ) = u j * ( n ) ,  
vjk ( n ) = nil , 
v2k ( n )  = u4k ,
and for mode 2 of operations,
Vjk ( n ) = u4k , 
v2k ( n ) = Ujk .
79
From step 7, the resulting connectivity matrix of the array is given in Table 3.10. As 
an example, to simulate the operation of node 2 , at compute cycle 1 , then, from the 
input nodes lookup table, using mode 1 of operation, one finds
Uj1 (2  ) =  Jtj1 =  aj2  •
Column 2 of P indicates that u2] is originating from node 1. Evaluating node 1
mapping function at k = 0  results in u2} = hi ■ The content of the memory register is
given by
m 3 2  ( 2  )  =  « J 2  •
From the definition of mode 1 of operations, the outputs are given by
vj* ( 2 )  = v2'  (2 )  = nil.
Utilizing step 3 of procedure PTD , at k  = 0 , node 1 changes its state, hence,
V ) = •
From the node mapping functions, the memory register content is given by
m3 ^  (  ^) = au  •
At compute cycle 1 nodes ( 7 , 2 )  change their states. Hence,
80
Tij1 Cl ) = a 2 1  and t t /  (2 ) = a1 2  , 
and from the mapping function
U2 1 (1 ) = l2I and (2 ) = u1 2  .
The PTD of the array is depicted in Figure 3.16. Procedure TG results in the timing 
graph depicted in Figure 3.17.
node / node 1 2 3 4 5 6
1 0 2 0 0 0 0
2 0 0 2 1 0 0
3 0 0 0 0 1 0
4 0 0 0 0 2 0
5 0 0 0 0 0 1
6 0 0 0 0 0 0
Table 3.12 
Connectivity Matrix for LU array.
81
K|i
Kt
K,
Ki
K„
K r
Ki
Si­
lt
S
• i , ,
a
&
M ,
§
2 t 12
31 0 2 2
8V v f
«I3
*21
31 a i F  h i  
°3  2^  * il
© 
i f sv
fl3 2 121
4
“ i f  i
>21 /» 2
O  O
X 13V ‘2 1 “ 22^ 1  O 1
3^V* ^ ‘31 J * ”
® V O  G)
1/*  t* « - J 5fl /33 t 5  fM-* 23'
*3 2  1 >32
"3 3 \ 3 / 4
©,
“ l l  >31 s / 33
G )
“ 33f6
Figure 3.16. PTD for LU Array.
Figure 3.17. TG for LU Array.
CHAPTER 4
FAST ALGORITHM REALIZATIONS
This chapter introduces the concept of fast algorithm realizations. These realizations 
are forms computable in a reduced number of cycles. The objective is to increase the 
efficiency of an array, and eventually enhance its flexibility. The concept is formally 
identified in the following:
Definition 4.1: Given a systolic array designed to evaluate an algorithm A in T  
compute cycles, a fast realization for A is a special realization ( specified by a 
particular set of parameters) that can be evaluated in k < T  cycles.
To find a fast realization one must determine the collection of operations that, for a given 
set of parameters, can be omitted without affecting the final numerical value. This objective 
is accomplished by analyzing the timing graph ( TG ) representation of the array that 
was introduced in the previous chapter. The restriction of operating an array for k cycles 
requires that the total number of levels in the TG be reduced to k .
83
84
Section 4.1 discusses the technique of timing graph compression used to reduce the 
number of time levels in the TG to k . Section 4.2 develops the basic tools needed to 
manipulate the timing graph. Sections 4.3 and 4.4 determine the fast forms for matrix 
arrays, with examples of some of the most popular arrays. Sections 4.5 and 4.6 
introduce the concept of timing graph cut and paste. Section 4.7 develops fast forms for 
non-linear arrays. Section 4.8 applies the developed techniques to modify the operations 
of some generic arrays, such that they can handle band matrix multiplications, triangular 
and randomly sparse matrices.
Relevant features of the approach proposed here include:
• It can be used to determine the collection of all possible fast forms implementable 
on a single array.
• It uses to advantage the nature of algorithm data dependence vectors to allow 
rearrangement of operations in the array to achieve fast realizations.
• It can be used to enlarge the class of algorithms implementable in a 
given array.
85
4.1. Timing Graph Compression
The TG representation provides important information about the evolution of partial 
results in the array. Each time level in the graph displays the nodes that change their state at 
that instant of time. Reading the input arcs of the boundary nodes in the TG produces the 
timing functions of the boundary nodes.
The timing graph of an array consists of a total of T time levels. To obtain exact results 
from an array in k <T  cycles, the number of time levels in the TG must be reduced by 
q = T -k  levels. The concept of TG compression can be used in this case as the vehicle 
for determining fast forms of linear arrays. It is possible to distinguish between the 
following basic compression strategies.
Bottom-up compression of the TG corresponds to the identification of the collection of 
data elements that do not contribute to the partial results in the later cycles of the array 
operation. This is equivalent to determining the collection of data elements that are 
irrelevant to the final results for the time interval [k  +1 ,T  ] in the TG. In this case, 
the array can be stopped at cycle k . The results available at that time are considered as the 
final results.
Stopping the array before its normal execution time may result in improper draining of 
the array, where some of the data elements remain in the PEs. Bottom-up compression of 
the TG does not require any change in the timing functions of the input boundary nodes. 
However, adjustments might be needed in the way the output data elements are extracted 
from the array.
86
Top-down compression of the TG corresponds to the determination of the collection of 
entries that can be skipped from the input data to the array to meet the deadline of k 
cycles of operation. This can be accomplished by compressing the TG from the top by 
a total of k levels. Hence, time levels 0 to k-1 in the TG will contain the elements 
that are irrelevant to the partial results. The omission of data in early levels in the TG 
must not violate the available bandwidth of the array. Top-down compression of the TG 
does not cause improper draining of the array. However, it requires the modification of 
the input timing functions that supply data elements to the boundary PEs . In addition, 
adjustments might be needed in interpreting the output results from the array.
Top-bottom compression of the TG is a combination of the above two methods. This 
might be required in some instances to meet the deadline of k cycles. In this case, the 
collection of data elements that must be skipped in the early stages as well as in the later 
stages of the array operation needs to be identified. Top-bottom compression of the TG 
requires adjustments in the input timing functions as well as modifications in the way 
output data from the array are extracted.
The compression of the TG may require a modification in the way the data elements are 
fed to the array. This will lead to a change in the timing functions of the user induced input 
nodes. In the next section, proper tools are introduced that allow the determination of the 
form of the modified input timing functions for those nodes.
87
4.2. Tools for Modifying the Timing Graph
In the TG , elements that belong to a data path travel on lines that are parallel to the 
vector Va defining the path direction. Consider a node g , represented in the TG at 
time levels i , i + 1 , ..., j . The collection of elements available on the node's input 
arc q , that cany elements belonging to the same data path ( specified by the direction 
vector V a ) for those time levels , defines a timing sequence, denoted 
by X  ( Va , g , j - i  ) .  •
Consider for example, the TG of the matrix-vector array, depicted in Figure 3.5. 
Reading the elements available on node 1 , on lines that are parallel to the x  data path 
direction vector, for time levels [ 0 , 1 , 2 , 3  , 4  }, the timing sequence { Xj , x2 , x3  , 
0 , 0  } results. The sequence indicates that at time 0 the input node contains data 
element X j , at time 1  it contains data element x2 , and so forth.
Timing sequences on all lines must be of length T  , where T  is the total number of 
time levels in the TG . If one sequence is shorter than the other, zero entries must be 
appended to the sequence in the proper time slots, such that the resulting sequence is of 
length T . In what follows, the necessary steps for deriving the timing sequence on a 
line parallel to a particular direction vector are presented.
Procedure T
This procedure constructs the timing sequence of node g of data elements a ( . )  
belonging to the data path specified by the direction vector Va . In this procedure , N
88
is the number of nodes in the DWCG, T  is the total number of time levels in the TG ,.
and L  ( n ) is a vector that holds the time of the initial element in the sequence. The
procedure generates sequences of length T  for all the nodes.
1- Construct the array TG.
2- For n = /  to N  do 
Begin
Initialize count to zero.
For i = 0 to T do 
Begin
If node n has an input arc q parallel to Va then include the 
element a ( . )  available on that arc to X  ( . ) .  
count = count + 1.
L  ( n ) = i , if count = 1 .
End.
2- For n = 1 to N  do 
Begin
j  - L  (n  ) , jj  = j  +count .
i) For kk — 0 to j-1 do 
Begin
Insert zeros to X  ( . )  .
End.
89
ii) For kk = count to T  do 
Begin
Append zeros to X ( . )  .
End.
E nd .
For the purpose of this study it is fundamental to notice that timing sequences can be 
manipulated. It is possible to slide the sequence up or down on the time axis. For 
example, consider the sequence {X j , x2, x3  , 0 , 0  } derived earlier, for the TG of the 
matrix-vector array. By sliding the sequence up by one cycle on the time axis, the 
sequence {xj  , x 2 ,x 3  , 0 , 0 , 0  } results. The interpretation is that at time slot -1 the 
data element xj  is available, at time slot 0  the data element x2  is available, and so 
forth. However, to keep the sequence of the same length as the original the Xj data 
element must be deleted. Hence, the new sequence is given by [x2 ,x3  , 0 , 0 , 0  ]. The 
notation U (X  ( . ) ) ,  will be used to indicate sliding the sequence up, on the time axis by 
one cycle. Sliding a sequence up on the time axis can be used in a cascaded fashion. To 
slide the sequence up n times, one can write
U a ( X ( . ) )  = U ( V n-} ( X ( . ) ) )  .
Sliding up a sequence of length T  by one cycle is equivalent to deleting the first 
element in the sequence and appending a zero to its last entry. To track those elements that 
are deleted from a sequence as a result of sliding it up , one must construct a set that stores 
those entries. Let, X  ( . )  be the new sequence resulting from sliding up the sequence
90
X ( . )  n times. Then the collection of elements x ,• that are deleted from X  ( . )  to 
generate X  ( . )  form the set p. ( X  ( . )  , n ) .  Hence,
P U ( . ) , »  ) = U -  : xi e X ( . )  a  Xi « U n ( X ( . ) )  }.
Timing sequences can also be slided down . Sliding a sequence down by one cycle is 
equivalent to inserting a new element in the first time slot, and deleting the last element in 
the sequence. The notation D (X  ( . ) )  will be used to indicate sliding a sequence down 
on the time axis by one cycle. Sliding a sequence down can be used in a cascaded fashion. 
To slide a sequence down n times, one can write
D « ( X ( . ) )  = D ( D * ' 1  ( X ( . ) ) )  .
%
Let, X  ( . )  be the new sequence resulting from sliding down the sequence X  ( . )  n
\
times. Then the collection of elements *,* that are deleted from X  ( . )  to generate X  ( . )  
form the set B ( X  ( . ) ,  n ). Hence,
B ( X ( . ) , / i  ) =  [xi : Xi e X ( . )  a  xi « D n ( X ( . ) )  }.
The introduced tools can be used effectively to derive the modified timing functions of the 
user induced input nodes, when the TG of the array is compressed.
91
4.3. Fast Realizations of Matrix Arrays
In this section, the collection of the fast realizations allowable on arrays implementing 
matrix operations of the form Y = AX or Y = XA , is determined. Here, it is 
assumed that the designer has the ability to choose the structure of the linear transformator 
specified by A .
The first stage of the analysis proceeds with the assumption that the designer does not 
have any control over the choice of the data entries xtj  e X. If those entries can also be 
controlled, a broader selection of fast forms can be determined. Utilizing the concept of 
TG compression of an array, a complete description of the fast forms of A that admits 
exact results on an array in k cycles, is now possible. In what follows, procedures 
are introduced that determine the nature of the matrix A that allow exact computations on 
linear arrays in a reduced number of cycles. The procedures are applicable to matrix- 
matrix and matrix-vector arrays.
Procedure ABU
The procedure is based on the concept of bottom-up compression of an array TG. It 
produces the fast form of A for k cycles of operations.
1- Construct the TG of the array .
2- Identify the direction vector Va that defines the movement of the a# data 
paths.
92
3- For j = 0 to k do 
Begin
Construct the set M ( i ) = { n j} , where nij are the nodes at time 
level i in the TG.
End.
4- Determine the set of all vectors that arc parallel to Va in the TG . Let r  be 
the number of vectors in the set
5- Using the array TG , execute the following :
i) For j  - 1  to r  do 
Begin
ii) For i = 0  to T do
Begin
EvaluateX(Va, g ,  w ) for all nodes «• e M ( /  ) , 
at each time level i .
E nd .
End.
6- To determine the collection of the elements afy £ A that need to be zero, 
one must execute the following:
i) For i = T down to fc+1 do 
Begin
ii) For j  =1 to r  do
93
Begin
For each node nt e M  ( i  ), evaluate its set
End.
End.
Procedure ATD
This procedure is based on the concept of top-down compression of the TG . It finds 
the fast form of A that admits exact results in k  cycles. The procedure modifies the data 
paths corresponding to the xtj  entries for matrix-matrix arrays. However, this will not be 
the case for matrix-vector arrays, because the a# entries are free elements. The necessary 
steps for the procedure are:
1- Construct the TG of the array.
2- Identify the direction vector Va of the atj  datapaths.
3- For i -  0 to k do 
Begin
Construct the set M ( i ) = { n} } , where nj are the nodes at time level i 
in the TG.
End.
4- Determine the set of all vectors that are parallel to Va in the TG. Let r be
94
the num ber o f  vectors in the s e t
5- Using the TG of the array execute the following:
i) Initialize count to zero.
ii) For / = 0 to k  do 
Begin
Check if any node nt e M ( i  ) attune level / is not a boundary 
node, if yes go to step 6, else increment count by 1.
End.
6 - If count £ k ,  go to step 7 , else the procedure is not applicable for the 
current value of k , stop.
7- Using the array TG execute the following:
i) For j  =1 to r  do
Begin
ii) For / = 0 to T do 
Begin
Evaluate X ( Va , n , w ), for all nodes n; e M ( t ) .
E nd .
End.
8- To determine the collection of ay elements that need to be skipped from
95
the input data, execute the following:
i) For i =0 to k do 
Begin
ii) For j - 1  to  r  do 
Begin
For each node »■ e M ( /  ), evaluate the set (i =U k (X  ( . ) ) .  
End.
End.
9- If the a-tj  entries are not free elements, then the new input timing functions for 
the Xij entries are given by :
For i = 0 to k do 
Begin
for each node n shifted down, evaluate 
ffx ' ( . )  = D * ( ( f f x (.)).
End.
Procedure A TB
This procedure is based on a combination of procedures ABU and ATD to produce the 
required fast forms of A . The necessary steps are listed below:
1- Construct the TG of the array.
2- Identify the direction vector Va of the a;j datapaths.
96
3- For / = 0 to k do
Begin
Construct the set M { i )  -  { nj } , where nj are the nodes at time 
level i in the TG.
End.
4- Determine the set of all vectors that are parallel to Va in the TG . Let r be 
the number of vectors in the set.
5- Using the array TG execute the following:
i) Evaluate steps 4 and 5 , of procedure ATD, save the value of count.
ii) For i = 0 to count do 
Begin
a) Evaluate step 7 of procedure ATD .
b) For kk = i to k do
Begin
Evaluate step 6 of procedure ABU .
End.
c) The new input timing functions for nm z M ( i  ), is given by 
it x ' ( . )  =D j ( (  tcx ( . )) ,  where j  = count - i .
End.
97
The collection of all matrices that meet the deadline of k cycles, are determined by 
applying procedure ATB for all the combinations of q = T - k .
4.4. Examples to Illustrate Procedures for Determining Fast Realizations
To illustrate the use of the procedures introduced in the previous sections, two examples 
are considered. The arrays are the matrix-vector array and the cylindrical array for matrix- 
matrix multiplications.
4.4.1. Matrix-Vector Array
Consider the 3x3 matrix-vector array, depicted in Figure 3.2. The number of compute 
cycles for the array is 5 cycles, that ranges from [ 0 , 4  ]. The TG of the anay is given 
in Figure 3.5. Assume, that exact results must be delivered in 4 cycles, then it is 
required to determine the fast form of A that makes this possible.
Applying procedure ABU , from step 1 the TG of the array is depicted in Figure 3.5.
From the TG at time level k4  node 3 is operational. From step 2 of the procedure
node 3 receives the ay data entries on vectors that are parallel to Va of weight w = 6 . 
From steps 3 ,4  and 5 the following timing sequence results.
x  ( • ) = t 0 , 0 , OjI , » a 33 }  ■
Evaluating step 6 generates the following set
98
Hence, the fast form of A is given by
Procedure ATD does not apply in this case, because at time level 1 , node 2 is a 
boundary node.
4.4.2. Matrix-Matrix Multiplication on the Cylindrical array
In this example, the collection of fast forms implementable on a cylindrical array 
designed to perform matrix-matrix multiplication is determined. The array execution time is 
3 cycles that ranges from [ 0 , 2  ]. The TG of the array is given in Figure 3.13. 
Assume, that the array has to output the results in k = 2 cycles. Procedures ABU and 
ATD result in the following collection of fast forms for A
99
4.5. Timing Graph Cut and Paste
The procedures covered in the previous sections, considered limiting matrix-vector 
arrays operations for a certain number of compute cycles. The analysis was based on the 
concept of compressing the TG to meet the deadline of k  cycles. The results were based 
on those entries in the data elements, that could be skipped in the early ( or ) later stages 
of operations. To increase the class of the constrained forms of the matrix A , a 
combination of the two methods was also used.
In what follows, new procedures are introduced that take advantage of the nature of the 
data dependence vectors of the algorithm at hand. In particular, the method solves the 
problems that arise when the data entries that are irrelevant to the final results are located in 
non boundary nodes in the AG, at the constrained cycle. This problem was encountered 
in procedure ATD , where the procedure was deemed inapplicable when that condition 
was detected.
Even though these procedures are established to the particular algorithm, they indicate a 
possible alternative for a more general problem. The generalization of the results will be 
given in a subsequent section when the problem of random sparsity in matrices is 
considered.
To illustrate the idea, consider the TG representation of the matrix-vector array that 
was analyzed earlier. Assume, that the array operations are restricted to 4  cycles, with 
the requirement that the entries {ajj, , a 1 3 , a3 1 , a3 2 , 0 3 3  } must not be zero. In
100
this case, none of the compression techniques will produce the fast form of A that 
admits exact realization in the required number of cycles.
Careful analysis of the TG indicates that if the operations performed by nodes 2 and 3 
are interchanged, exact results can still be achieved in 4 cycles. Hence, the scope of fast 
realizations implementable in the array is increased. Here, interchanging the operations 
of the nodes is possible because the algorithm at hand is characterized by fixed-flexible data 
dependence vectors. Hence, the fast form of A is given by
* 1 1 0 * 1 ,
* 3 1 0 * 1 1
* 3 1 0 O n
The way the input data is fed to the array must be modified, in particular columns 2 
and 3 of A need to be interchanged. This will modify the timing functions that generate 
the fly data elements. For the example under study, the elements corresponding to the 
aij data paths are characterized by a dummy data dependence vector . Hence, in this 
case, it is possible to interchange the nodes’ operations without imposing any restrictions 
on the data paths corresponding to the x tj  elements. However, this is not true for 
matrix-matrix arrays.
Interchanging the nodes'operations in the TG of a matrix-vector array can be pursued 
in two different ways. The first approach interchanges the nodes on lines that are parallel 
to the node direction vectors. The second approach interchanges the operations on lines
101
that are parallel to the direction vector of the X;j data elements. The combination of the 
two methods expands the collection of all fast realizations implementable on the array in k 
cycles. Interchanging the nodes' operations must not violate the available bandwidth of the 
array. This condition is satisfied by restricting the internal nodes that can be interchanged 
with the boundary input nodes. Next, two procedures are introduced that interchange the 
nodes' operations based on the two options.
Procedure MV
This procedure interchanges two nodes' operations in the TG of a matrix-vector array, 
on lines that are parallel to the nodes' direction vector. Here, dn is the node direction 
vector and Vx is the direction vector of the x  data paths. The necessary steps are :
1- Construct the TG of the array.
2- For i = 0 to k do
Begin
Construct the set M  (i ) = { nj } , where nj are the nodes at time 
level i in the TG.
b ( i  ) holds the number of nodes at time level i .
End,
3- Using the TG of the array execute the following:
For k =0 to T  do
102
Begin
For kk = 1 to b(k  ) do 
Begin
count = 0 .
For j  = k to T  do
Begin
If nj zM  ( j  ) then increment count.
End.
Evaluate X  (Va Jck .count) and X (Vx,kk ,count).
End.
End.
4- To interchange the operations performed by nodes nt- and nj, then
i) Let the first occurrence of node nt- be at time level i and the first 
occurrence of node nj be at time level j  . Node can replace node 
n4- if and only if j  > i .
ii) The modified timing sequences on node nj are given by: 
1 7 ^ ( X ( V a ,n. ) )  and Ui'* ( X ( Vx ,ny- , .  ) ) .
iii) The modified timing sequences on node n(- are given b y :
D '"  (X  ( Va,n,., . ) )  andD #  (X  ( Vx,n4. , . ) ) .
103
Procedure M V1
This procedure interchanges two nodes’ operations in a matrix-vector array on lines that 
are parallel to the xy elements direction vector. The steps are given next:
1- Construct the TG of the array.
2- For i = 0 to & do
Begin
Construct the set M {i ) = { nj ) , where nj are the nodes at time 
level i in the TG.
b ( i  ) holds the number of nodes at time level i .
End.
3- Let Vj and V2 be the two direction vectors, parallel to Vx , specifying the 
nodes to be interchanged. Using the array TG perform the following:
i) Using Vx ,
For k -  0 to T  do 
Begin
Form the set Ml (Jt ) = { n -} , nj is a node on Vj .
Save / , which is the number of the time level of the first 
node occurrence.
End.
104
ii) U sing V 2 ,
For k = 0 to T  do 
Begin
Form the set M2(k ) -  { nm ) , nm is a node on V2 .
Save o , which is the number of the time level of the first 
node occurrence.
End.
ii) count 1= 0  , count2 = 0
For k = 0 to T  do 
Begin
If n; e Mj (k ) is a boundary node, then count 1 = count 1+1
If nm e M2 (fc ) is a boundary node, then count 2 = count 2 +1
End.
4- To interchange the modes of Afy (k ) and M2 (k ) then
i) If o > i and count 2 £  count 1 then
ii) Interchange the nodes of M2 (k ) with the nodes of Mj (k ) .
5- The new input timing function for the set Mj (k ) is Jt' (.) = U ° * ( 7t ( .)).
6- The new input timing function for the set M2 (k ) is %' (.) = D °l ( tc (.)).
105
4.5.1. Examples to Illustrate Timing Graph Cut and Paste
Consider the TG of the matrix-vector array that was analyzed in a previous section . 
The array total execution time is 5 cycles. Assume, that 4 cycles are available for the 
array to output its final results. Procedure MV leads to the following collection of 
matrices that admits exact results in 4 cycles.
~ a :i 3i: a “ *4
1
*11 *\7
a» a , , a j , /
0 0 0
0 0 0 a ». a , .
— — — —
“ o 0 0“ _ a n c
®ll a n a »
/
a n an a »
a *l a , . a,* a u a n 0
— mm
Procedure MV1 produces the following collection of fast forms.
_ a n a ,. 0~ *11 0 ” 0 *11
an a n 0 Bn 0 ajj 0 ®J2/ /
a , i a „ 0 a >. 0 Bn 0 a ,  * a ,j
— —
106
4.6. Fast Realizations of Non-Linear Arrays
This section considers arrays implementing systolic algorithms characterized by fixed- 
nonflexible data dependence vectors. Such algorithms are sensitive to the order in which 
the computations are performed. Hence, data re-arrangement is not possible. One class of 
algorithms that is of particular interest contains unary algorithms that include: LU  
decomposition, QR triangularization and transitive closure. Arrays implementing such 
algorithms will be referred to as non-linear arrays.
The approach pursued for determining the collection of fast realizations executable on 
linear arrays is not appropriate for arrays implementing such algorithms. These algorithms 
are sensitive to the order in which computations are performed because their data 
dependence vectors are non-flexible. However, the TG can still be used to determine the 
collection of fast forms allowable on a single non-linear array. Here, a different course of 
action must be pursued. The approach is based on decomposing the array operations into 
sub-arrays implementing problems of smaller sizes. The concept of TG decoupling will be 
used to identify the collection of fast realizations executable on a single non-linear array.
4.6.1. Timing Graph Decoupling
Timing graph decoupling is defined as the process of determining those entries in the 
input data to non-linear arrays that can be omitted or forced to a particular pre-set value, 
that lead to decomposing the array operations into sub-arrays implementing smaller size 
problems. Array decomposition speeds up the response time of the generic array in two 
different ways. First, it skips those entries from the input data that are irrelevant to the
107
final results. Second, it allows the trading of a delay element in the input data with an 
entry from the input elements to be processed.
Due to the nature of the algorithms at hand, it is important to note that if an entry is 
omitted or forced to a particular value, such as zero, care must be taken to insure the 
numerical correctness of all elements that are dependent on i t . As an example, suppose the 
array to be analyzed is implementing QR triangularization of a matrix. In this case, none 
of the entries on the diagonal of the matrix, can assume the zero value because the QR 
algorithm would fail.
Analyzing the TG of such arrays is not as simple as the analysis of the TG of linear 
arrays. To determine the collection of entries that can be omitted from the array, one must 
ensure that the weight of each node that lies on the direction vector carrying those entries is 
defined. Once this condition is satisfied, the TG can be properly analyzed and the 
collection of all fast forms can easily be determined. In the next section, a procedure is 
introduced that marks all the nodes of the array that can skip their input data. Once the 
collection of those elements are determined, another procedure is presented that 
decomposes the array into sub-arrays.
Procedure D
This procedure determines the collection of entries in a nonlinear array that can be 
assigned a particular value , such as zero, and skipped from the array input . The 
necessary steps are listed below:
108
1- For i = 0 to k do
Begin
Construct the set M (i ) = { «■} , where nj is a boundary node at 
level i in the TG.
b ( /  ) holds the number of nodes at time level i .
End.
2- Let Va be the direction vector that carries the input elements to the 
boundary nodes.
3- For kk = 0 to it do
Begin
a) For j  = 1 to b ( kk ) do
i) For each n- b M( kk)  do
Begin
Evaluate the node mapping functions at cycle kk for 
the modified numerical value of the element. If the 
weight of the node is not defined, mark the node as 
ineligible for omission. Go to step 3 a ) .
End.
ii) Let m be the total number of nodes that lie on a line parallel 
to Va and starting at node nj .
109
For i = 1 to m 
Begin
Evaluate the mapping functions of node .
If the weight of the node is not defined, mark the node 
as ineligible for skipping.
End.
End.
Procedure N SA
This procedure determines the structure of the modified sub-arrays and their resulting 
input timing functions. The procedure depends on the array connectivity matrix P that 
was introduced in chapter 3. The necessary steps are summarized below:
1 - For each boundary node marked as eligible for skipping in the TG identify the 
vector Va that carries the input data elements.
2- If all nodes belonging to Va are eligible for omission, then find i and j
the first and last time levels of the nodes.
i) For k = i +1 to j  do 
Begin
a) Move the boundary node and all the nodes that lie on Va to 
time level k-l if and only if all the nodes along the lines stay
110
properly connected as specified by the connectivity matrix P.
b) The modified input timing function of the boundary node is 
(.) =U  W (7t ( . ) ) .
End.
3- Repeat step 2 , for all boundary nodes that are eligible for omission.
The collection of fast realizations executable on non linear arrays can be enhanced if the 
delay entries in the input can be traded with entries from the elements to be processed. This 
objective is achieved by the recursive decoupling of the TG. The technique identifies the 
collection of fast realizations allowable on a single non-linear array.
Recursive decoupling of the TG is defined as the process of decomposing a non-linear 
array operation into sub-arrays and the determination of the collection of fast forms of the 
sub-arrays. Next, a procedure is presented that determines the collection of fast 
realizations implementable on a given non-linear array. The necessary steps are:
Procedure TGD
1- Invoke procedure D for the original full size array.
2- Invoke procedure NSA for the original full size array.
3- Re-label the nodes in the TG into separate sets, such that each set
I l l
corresponds to  the nodes o f  one sub array. L et /  be the num ber o f  the sets.
4- For U = 1  to / do 
Begin
i) For i = 0 to k do
Begin
Construct the set M (i ) = { nj } , where n;- is a boundary
node at level t in the TG, of the set 11.
b ( i  ) holds the number of nodes at time level i .
End.
ii) For each boundary node, identify the vector Va that carries the input data 
elements.
iii) For kk = 0 to k do
Begin
For j  = 1 to b ( k k  ) do 
Begin
For each /iy- eM  (kk) do 
Begin
Evaluate the node mapping functions for cycle kk 
for the modified numerical value of the element. If 
the weight of the node is not defined, mark the node 
as ineligible for skipping.
End.
112
End.
End.
iv) Let m be the total number of nodes that lie on a line parallel 
to Va and starting at node nj .
For i -  1 to m 
Begin
Evaluate the mapping functions for node nt .
If the weight of the node is not defined, mark the node 
as ineligible for skipping.
End.
v) If all nodes belonging to Va are eligible for skipping, then find i 
and j  , the first and last time levels of the nodes.
For k = i +1 to j  do 
Begin
a) Move the boundary node and all the nodes that lie on Va 
to time level k-1 if and only if all the nodes along the 
line stay properly connected as specified by the 
connectivity matrix P.
b) The modified input timing function for the boundary node 
is n' (.) =U k' \ n  ( . ) ) .
113
End.
c) Repeat step 2 for all boundary nodes that are eligible 
for skipping.
End.
4.7. Examples to Illustrate Non-Linear Array Decomposition
In this section two examples of non-linear arrays, namely, the LU decomposition 
array and the QR factorization array , are considered. The developed concepts are 
applied to decompose these arrays into sub-arrays implementing problems of smaller 
dimension. In addition, the collection of all fast forms executable on the arrays are 
determined by the recursive decoupling of the TG.
4.7.1. LU Decomposition Array
In this example, the collection of data entries that decompose the LU array into sub­
arrays for the case n - 4  are determined. The array is depicted in Figure 4.1 and the 
partial TG is given in Figure 4.2. Procedure D produces the nodes that can skip a zero 
entry in their input data. The collection of nodes that permit the omission of their input data 
entries are shown in the dotted boxes in the figure. Procedure NSA produces the modified 
array depicted in Figure 4.3. The array is decomposed into two sub-arrays. The first sub­
array consists of the PEs accumulating { un  ,u 12 , u22 }• The second sub-array 
consists of the PEs accumulating { u33, u34, }. The modified array produces results
114
in 2n cycles , while the generic airay requires 3n cycles. The fast form of A is given 
by
* x t 0 0
S j i 0 0
a , . a » i a * .
j
Recursive decoupling produces the following collection of fast forms for the first 
sub-array:
1
A i a _ a l l t
i
1
a n
f
**x
® 3 1 0 0
_ 0 0  _ _  0 0  _
1
t* O
1 1
fi M o
1
® j i t 0  a n
0  0 0  0
- 0  0  _
_ 0  0
115
u ? 4
r n S - h A  U l4  [
a u 0 0 0
a 21 a 12
0 0
a 3 I fl22 °13 0
a 4 I *32 a 2 3 fl14
a 4 2 a 33
a 43
a 2 4 
a3 4 
a 44
Figure 4.1. LU Array , case n = 4 .
116
Figure 4.2. Partial TG for LU Array .
a u 0 a3 3
a u a l 2 ^43
a 31 a 22
a 41 B3 2
»4 2
Figure 4.3. Fast LU Array .
118
Similarly, the collection o f  fast form s for the second sub-anay arc given by
* *
a » 0
0 * 4 4
/
0
_
4.7.2. QR Triangularization Array
Consider the QR array for n =4 depicted in Figure 4.4. The partial TG of the array 
is given in Figure 4.5. Procedure D results in the collection of nodes that can skip a 
zero entry. The nodes are identified in Figure 4.5 in dotted boxes. The decomposed array 
is depicted in Figure 4.6 with the modified input timing functions. The array consists of 
two sub-arrays. The first sub-array is composed of the PEs accumulating {q1} yq12 J * 
The second sub-array consists of PEs accumulating { q33 , q34 , q ^  }. The fast form of 
A is given by
® n
0 0 B .1
- 0 0 * 4 »
119
Recursive decoupling produces the following collection of fast forms for the first sub- 
array
— — **
a t. a „ a n a il 0 0
9 9
a i, a» 0 0 a n a „ 0 0
— — — —
— — — —
an 0 0 0 a*. 0 0 0
9 •
a
1 1 a „ 0 0 0 a „ 0 0
- ! - --
For the second sub-array the following collection of fast forms are obtained
0
9
An 0
a„ a4 4 0 a 4 4
L_ _
120
2221
3231 ,33
41 42 43 44
4 3 
42 
4134>3 3‘2 4 
*23'14
« f  y  d ®
d = (xj + y i } |
c = yi /d  y
s = *  x l  d  y  = x
else *
X, = c x j * s y t { y  i f  x )
y1 = s * i + c j ,
endif
Figure 4.4. a) QR Array , b) PE mode Definitions .
121
Figure 4.5. Partial TG for QR Array .
14 13 12 u
2221
3231 .3334 33
41 42 43 44
‘3 3
Figure 4.6. Fast QR Array .
122
4.8. Applications: Design of Smart Structures
The techniques presented in this chapter can be used to design more flexible arrays. The 
approach expands on the concept of using the compute cycle number as an additional 
degree of freedom to design a new breed of arrays that are better adapted to the changing 
nature of the application. Here, the main thrust of the work focuses on the expansion of 
the class of algorithms executable on a single array.
The concepts introduced in the previous sections are applied to implement different 
forms of band matrix multiplications on the generic mesh array in a reduced number of 
cycles. The modified array can evaluate band matrices with execution time that is equal to 
the time of a specialized architecture. The problem of multiplying randomly sparse 
matrices more efficiently on a generic array is also considered.
4.8.1. Band Matrix Multiplications on the Mesh Array
The mesh array for matrix multiplication can be modified to implement band matrices. 
Here, the objective is to trade a delay element from the input data with a zero entry from 
the band matrix. The collection of all band matrices executable on the array is determined 
by employing procedure ATD . The execution time of the modified array is a function of 
the bandwidth of the matrices to be multiplied
For example, consider the mesh array for the case n =3 depicted in Figure 4.7. The 
partial TG of the array is given in Figure 4.8. Compressing the TG by one level from the 
top and one level from the bottom leads to the following band matrix multiplication:
123
0 i ” 0 b » b , T
A  =
a . t ^ 1 1 s :
b „ b u b u
a „ 0 b n b u 0
Compressing the TG by two levels from the top and one level from the bottom, results in 
the following structure of band matrix multiplication:
0 0 “ 0 0 b "
A  s
0 * 1 1
/ B =
0 b n b > i
8 | 1 a » , 0 b i t b n 0
—
Compressing the TG by one level from the top and two levels from the bottom, leads to 
this form of band matrix multiplications:
" o < 7 “ 0 b ta b T
A  =
a *« i n 0
/ B =
b , t b u 0
8 j i 0 0 b n 0 0
— _
124
The modified array execution time is equal to the time of an array specially designed to 
solve the problem [43]. The specialized structure is a hexagonal array with PEs of fan-in 
and fan-out of 5 ,  while the mesh array PEs are characterized by fan-in and fan-out 
of 2 . To illustrate the concept, the modified partial TG for the third case is given in 
Figure 4.9. The resulting array is depicted in Figure 4.10. The execution time of the 
modified array is 5 cycles, while the generic array requires 9 cycles.
4.8.2. Multiplication of Triangular Matrices on the Mesh Array
The generic mesh array can be modified to implement triangular matrix multiplications 
more efficiently. This objective is easily met by using the TG to identify the time level k 
at which all the entries in the triangular matrices have contributed to their partial results. 
Time constraining the array operations for k cycles generates the modified array. The 
steps necessary for evaluating triangular matrices on the mesh array are given in the next 
procedure.
Procedure Triangular
1- Construct the TG of the array.
2- Apply steps 2 ,3 ,4  and 5 of procedure ABU , with the constraint on the 
nonzero entries in the triangular matrices.
3- Determine the time level & in the TG that the nonzero entries contributes to 
the partial results. Stop the array operations at that time level.
125
The mesh array depicted in Figure 4.7 can be modified to implement triangular matrix 
multiplication. The new form of the array operations is determined from the partial TG 
shown in Figure 4.11. In the figure, the TG consists of a total of six levels.. Hence, 
stopping the array at compute cycle k = 6 generates the proper results of the 
multiplication. The generic array requires 9 cycles to perform the same computations.
The previous examples modified the mesh array to implement band ( triangular) matrix 
multiplication. However, some arrays will not admit such modifications. For example, the 
orbital array [11] does not allow for any zero entries in the matrices that can produce 
results in a reduced number of cycles. The next lemma provides necessary conditions to 
determine if an array admits faster realizations.
Lemma 4.1: A linear array designed to implement matrix-matrix multiplication will 
not admit fast realizations if the resulting TG of the array has nodes corresponding to all 
the DWCG nodes at all time levels.
Proof : If the TG has nodes corresponding to all PEs in the array then, all data 
elements are involved at each cycle. Constraining the operations at any time level produces 
the zero matrix as the solution for achieving exact results in a reduced number of cycles.
%*»
15.1
?a*n # '1 0
fot tW«
3*3
Ko
*2
•=3
Figure 4.9. Partial TG for Band Matrix Multiplication.
bit bu
Figure 4.10. Band Matrix Mesh Array .
128
Figure 4.11. Partial TG for Triangular Matrix Multiplication .
129
4.8.3. Random Sparsity in Matrices
This section considers the problem of multiplying two randomly sparse matrices on a 
generic array. This problem has received considerable attention [91-92]; several solutions 
have been proposed, however, all the methods assume a special form of sparsity in the 
matrices.
In this dissertation a method based on fast realizations is proposed. First, the collection 
of fast forms executable on the given array is determined. Each form has an associated 
sparsity pattern. Second, row /  column permutations are used to attempt the embedding of 
the given patterns into one associated to a fast form. The technique assumes that reordering 
is feasible and can be accomplished off-line. The questions related to on-line shuffling will 
not be discussed here.
Each row/column permutation is characterized by an orthogonal permutation matrix. If 
the product to be computed is
Y  =  A B  ,
and all row/column changes in the output are feasible, the collection of possible outputs is 
given by
P Y P 1 =  ( P A Q  ) ( Q * B P j  ) .
130
In this expression , P , Pj and Q denote permutation matrices. The next result 
follows immediately.
Lemma 4.2: Two randomly sparse matrices can be multiplied faster in a generic 
array if and only if there exist permutation matrices , P , Pj and Q , that map the
sparsities in A and B to forms compatible with fast realizations.
Before establishing necessary conditions for the existence of such permutation matrices 
for the mesh array, the need arise to define the image of a matrix. This is given next
Definition 4.2 : The image of an nxn matrix A is an nxn matrix A' , whose 
element a)j is given by
a'- = 1 if ay *  0 ,and 
= 0 , otherwise.
The fast forms of the mesh array consist of a collection of band matrices, with a 
minimum of n entries on the opposite diagonal of the matrix. Let A be a fast form of 
the mesh array and A* be its corresponding image. Furthermore, let B be a randomly 
sparse matrix whose image is B \ Let the n -  vectors ra and rb be the vectors formed
by finding the summation of entries in each row of A' and B* respectively. 
Similarly, the n - vectors ca and are the vectors formed by finding the summation of
the entires in each column of A' and B* respectively. Necessary conditions for the 
existence of the permutation matrices that map B1 to A1 are in lemma 4.3.
131
Lemma 4.3 : The following are necessary conditions for the existence of
permutation matrices that embed a randomly sparse matrix into its fast form
1- There must be a match between the entries of the vectors ra and rb . This
means that one vector must be a permutation of the other.
2- There must be a match between the entries of the vectors ca and cb . This
means that one vector must be a permutation of the other.
3- The absolute value of the determinant of A' and B* must be equal.
Proof: Let P and Q be two permutation matrices, then one can write PA'Q = B '.
*
Each permutation matrix is orthogonal, hence Q Q  = I ,  where I is the identity matrix.
*
Conditions 1 and 2 follows immediately from the identity PA1 = B'Q . In addition, 
the absolute value of the determinant of a permutation matrix is equal to one, hence 
condition three results.
The above conditions must be satisfied before an attempt can be made to find the 
permutation matrices P ,  Pj and Q.
The chapter has formalized the concept of fast realizations. Systematic procedures have 
been introduced to analyze the operations of a large class of arrays. The concept enlarges 
the class of algorithms executable on a single array. The modified arrays are more adaptable 
to the changing nature of the applications.
CHAPTER 5
FAST LINEAR ALGORITHM 
IMPLEMENT AION S
The previous chapter used the concept of Timing Graph to analyze the operation of an 
array as a function of the compute cycle. In particular, several fast realizations have been 
found for linear arrays . Our interest in matrix-matrix arrays results from the observation 
that many signal processing and parameter estimation algorithms utilize those operations. 
For example , a linear filtering problem can be expressed in the form of a matrix product. 
The filter can be constrained to different forms of band matrices and still yields acceptable 
results (FIRfilters).
The concept of fast realizations restrict the operations to remain in the same class. A 
matrix-matrix multiplication remains as such; Only selected entries are omitted to speed-up 
the execution time. This chapter takes the concept of fast realizations one step further. The 
basic constraint is the number of compute cycles. As long as that constraint is satisfied 
there is no condition on the actual mathematical representation. For example, it has been 
shown that an orbital array with complete preloading does not have fast realizations of 
matrix products. It is clear, however, that the operation of the array can be stopped earlier. 
The resulting computation is a linear transformation but it is not a matrix product.
132
133
A mathematical model of a transformation is presented. The proposed representation is 
general enough to encompass a large class of signal processing and control applications. At 
the same time, it provides sufficient structure for a systematic treatment. The array 
implementation is considered and the operations are parameterized by the number of the 
compute cycle. A model of the evolution of partial transformations in such arrays, as time 
progresses, is developed.
The chapter is organized as follows : Section 5.1 introduces the class of algorithms. 
Section 5.2 develops the necessary and sufficient conditions for the existence of fast 
realizations of bilinear algorithms. Section 5.3 reviews the results of parallel 
implementation of linear algorithms. Section 5.4 analyzes the operations of time 
constrained arrays. Examples include structures for matrix-matrix computations and the 
triple matrix product. Section 5.5 gives examples of fast forms for the triple matrix 
product.
5.1. The Class of Algorithms
This section considers a class of compute bound algorithms that can be modeled by the 
map
y = A ( p , x ) .  (5.1 )
In this expression, x e£ "  represents external data to be processed. The vector p z E m 
corresponds to the parameters defining the map. The proposed map is general enough to 
encompass a large class of algorithms arising from signal processing applications. The
134
family of algorithms has sufficient structure for a systematic mathematical treatment. This 
class of algorithms admits implementations in regular arrays.
Clearly, algorithms based on matrix-vector and matrix-matrix multiplications are a 
special case of the class introduced. For example, if in the above equation , A 
corresponds to matrix-multiplication then, the vectors p and x would contain the 
entries of the matrices to be multiplied.
Assume that A ( p , x ) is a bilinear transformation, i.e. , the maps A ( . ,  x ) 
and A ( p , . )  are both linear. It is well known [93-96] that in this case there exists a
m  n
unique linear transformation L , defined on the tensor product space E ® E , 
satisfying
y = L ( p ® x ) . (5.2 )
The notation p ® x indicates the tensor product of the corresponding vectors. Using 
standard properties of the tensor products one can write
p ® x -  (p ® I ) (1  ® x ) ,
where I is the nxn identity matrix. Replacing the expression for y in Equation 
(5.2 ) gives
y = [ L ( p  ® I ) ] x (5.3 )
135
The matrix L is of dimension nxttm , while p ® I is of dimension mnxn . Hence, 
the expression in the square bracket is an nxn matrix. Thus, the following is 
established.
Lemma 5.1: The collection of bilinear algorithms of the form
n  m
y = A ( p ,x  ), x , y  in E , p in £
corresponds to the class of matrix-vector multiplications of the form
y = A ( p ) x .
Eachentiy of the matrix A ( p ) is a linear function of the vector p.
To allow for a systematic treatment of the representation, one must determine the general 
expression of the entries in the matrix A . From Equation (5 .3 ),
A ( p ) = L (p  ® I  ) . (5.4 )
The mnxn matrix p 0  I has the representation
p ® In = Diag ( p , p , ... , p ) .
136
If the nxnm matrix L is written in a partitioned form using lxm  row matrices 
I- , then the indicated product in equation ( 5.4 ) enables one to show the following
result.
Lemma 5.2: For the bilinear algorithms described in Lemma 5.1 , the matrix 
A ( p  ) has entries
( P ) = < lij . P > •
Remark: The result in this lemma is particularly useful for the analysis of possible fast 
forms . It is apparent that the constraints required by the fast realization define a linear set 
of equations on the parameter vector p .
5.2. Fast Realizations of Bilinear Algorithms
The previous section has established that the collection of bilinear algorithms defined by 
the map A ( . , . )  corresponds to a class of matrix-vector multiplication. In this regard, 
it is possible to assume that a matrix-vector array is the vehicle used to implement the 
computations. The array can be thought of as an architecture characterized by factors such 
as : the allowable number of compute cycles ( T ) ,  the possible functions executable in 
each processing element (F I  ) ,  the array connectivity matrix ( P  ) and the available 
bandwidth (10  ) capabilities.
In view of lemmas 5.1 and 5 2 , for the case of bilinear algorithms, the determination 
of fast realizations is reduced to the problem of finding fast forms for the matrix-vector
137
array executing the multiplication . The conditions for finding fast forms for bilinear 
algorithms are summarized in the next theorem.
Theorem 5.1: Assume that the matrix vector product y = Ax can be executed
faster in a given array if
ai} = 0 for ( i j  ) e S ,
where S defines the sparsity pattern associated to the fast form. The bilinear algorithm 
y = A ( p , x ) will have a fast realization in such an architecture if and only if there 
is a nontrivial solution to the set of equations expressed by
<1;. , p > = 0 , ( i , j  ) e S .
Proof: The proof follows directly from Lemmas 5,1 and 5 2  .
In this case any nontrivial solution for p leads to a fast bilinear realization executable 
on the architecture in the same number of compute cycles. The vectors 1- are determined
as in Lemma 52  .
5.2.1. Example to Illustrate Fast Bilinear Algorithms
In this example, Lemmas 5 .1 , 52  and Theorem 5.1 are applied to derive a fast 
form of a typical bilinear algorithm. Given the results in Lemma 5 .1 , one can assume a 
matrix-vector array as the vehicle to execute the computations. One possible array that
138
performs the operation y =TS.x for n = 3 , was introduced in chapter 3 , Figure
3.2. The execution time of the array is 5 cycles.
Assume that the bilinear transformation is of the form
* ,y = c  x b ,
m
where, c , x , b e E and' * ' means the vector transpose. The transformation can 
be expressed in this form
y,
v,
y,
c,  b,  c,  b,  c,  b,
c i  b , c i  bi  e , b a
c ,  b,  c ,  b , c . b ,
P.
P
P,
P. P.
P.
If 4 cycles are available to output the results, then a fast realization for the array is 
possible provided that a33 = 0 . Depending on the number of parameters one may have a
subspace of solutions. However, if each entry in L ( p )  can be selected independently 
9
then, p e E , and p is given by
139
P Pj  *P2 » P3 * P4 * P5 *P6 * P? 'P& *P9 )
1*27
Using the properties of the tensor product the matrix L e E is given by
L =
100 000 000 010 000 000 001 000 000
000 100 000 000 010 000 000 001 000
000 000 100 000 000 010 000 000 001
27x3
From equation 5.4 , ( p ® I  ) e E and from lemma 52  , the following can 
be established
a33 ~  < *33 > P >  “  ^
The row vector 133 has the following form
133 — ( 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 ) .
Then, if p9 =c3 b3 = 0 ( either c3 =0  or b3 = 0 ) a fast realization of the 
bilinear algorithm can be implemented exactly on the matrix-vector array in 4 cycles.
As a second example, assume that a 3x3 cylindrical array is implementing the 
operation. Here, the computations of different vectors could be interleaved. Assume that
140
4 cycles must produce the final results. The fast form of the array is achieved provided 
that a1 3  = a2 3  = <133 -  0 . Here, 113 , 1^ and 133 are given by
I13 = ( 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 ) ,
123= ( 0 , 0 , 0 ,  0 , 0 ,  0 , 0 ,  1 , 0  ) ,  
133 = ( 0 , 0 , 0  , 0 ,  0 ,  0 , 0 ,  0 ,  1 ) .
From Theorem 5.1 , solving the system of equations results in
^  I13 1P ^  = Cj b3  ~ 0  ,
< *23 ’ P > = c2 b3  = 0  , and
< I33 , p > = c3  b3  -  0  .
Thus, if b3  = 0 a fast realization of the bilinear algorithm does exist in 4 cycles.
5.3. Parallel Implementation of Linear Algorithms
Many signal extraction, parameter estimation and automatic control algorithms utilize 
matrix-vector operations. The most efficient arrays perform matrix-vector operations in 
O ( n ) time units. For very critical real time applications, and large value of n , even
141
this response time may not be sufficient. This section presents a procedure to enhance 
concurrency and increase compute speed
Recently, in [39] it has been shown that every linear algorithm can be decomposed 
as a finite sum of a special class of algorithms. This special class of linear algorithms 
permits computations in 0  (Vn ) time. The approach considers the matrix-vector 
operation
y = A x , ( 5.5 )
fJXft
where, A e E and y , x are n - vectorswith n =p q  .
pxq
Let X e E be a matrix formed by mapping the vector x on a two dimensional 
array. This can be done by using segments of x as columns of X . Similarly, let
pxq
Y e E be a matrix formed from the vector y. In [39], it is shown that equation 
5 J  can be represented as a parallel implementation of triple matrix products in the form
Y = £  LjXRj  , / = 1 ...... / . (5 .6 )
2 2 i t o j  pxp
with / 5 min(p , q ) , R. e E and L. z E  . Inequation 5 .6 , the
matrices L. and R£ are determined by the the linear transformator A. In many
important cases the number of terms I is small and all triple products can be executed 
concurrently.
142
If pq multipliers are available, and all the data elements can be loaded simultaneously to 
the array, equation 5 5  can be evaluated in Tj = pq  cycles. On the other hand, the 
triple matrix product can be executed in parallel in a total of T2 ~ q +2p -1 cycles. 
However, if only one array is used then the total execution time is T2 -  I ( q + 2p -1 )
cycles. Even the sequential execution of the triple matrix product may be faster than the 
matrix-vector case. If p = q =^ln , then T2 is less than T1 provided that
/ < Vn/J . The representation will be time efficient if / (q  +2p -1 ) «  p q  .
The speed of the sequential execution of the triple matrix product representation of the
matrix-vector multiplication is a function of the parameter I . A smart choice of / , fora
given p  and q , leads to a faster execution. In [39], an algorithm is presented that 
finds the minimum number of left matrices { L; }.
5.4. Time Constrained Implementations
This section develops a recursive model of the evolution of linear transformations in 
linear arrays, as a function of the compute cycle. The objective here is the determination of 
the form of the transformation implemented by the array at each cycle.
The initial development will consider the simple case of linear transformations modeled 
as matrix-matrix products. The analysis will then be expanded to model the evolution of 
transformations represented by triple matrix products. Since these later maps can be used 
for a parallel representation of general linear transformations, it will be possible to 
implement time constrained operations for the general case.
143
5.4.1. The Basic Cases
In this section , our attention is focused on arrays implementing linear transformations 
that can be described as the product of two matrices. Utilizing the fact that matrices can 
also be viewed as elements in their corresponding vector space [96], it is convenient to 
represent a matrix as an n vector by arranging the data columnwise. Using the 
properties of the tensor products [96], the linear transformation expressed in equation
5.6 can be described as
y°= ( I  0  A ) x° , (5.7)
© o
= A x ,
where A is an n 2x n  2 matrix , '® ' indicates the tensor product, I  is the nxn
* O O m2
identity matrix and x , y e E are the vector representation of the matrices X , Y 
by arranging the data elements columnwise.
A similar equation can be derived for the linear transformation of the form Y = X A, 
which is given by
y‘ = ( A ® I  ) x , (5.8)
o o
= A x .
Given the time sequential operations of the systolic array, it is possible to obtain a
recursive representation of the evolution of partial transformations in the array,
parameterized by k . Given the linearity of the transformations it is possible to write
144
O o 0
y k = A k x
Here, A k represents the transformation on the input data after k  cycles. The final 
transformation performed by the array on the input data elements is given by A T = A,
where T  is the total number of allowable compute cycles. The evolution of entries in the 
selection matrix A k ,k  = 0 , 1 , ,T  , is a function of the architecture evaluating the
linear transformation. In the following, the partial transformations implemented by the 
mesh and the cylindrical arrays are considered.
5.4.1.1. Mesh Array Partial Transformations
The example presents the partial transformations of a mesh array evaluating the product 
Y = AX , where A represents a transformation to be performed on the elements of the 
data matrix X . For simplicity the case n ~ 2 is considered. The mesh array is depicted 
in Figure 3.6 . The PTD of the array is given in Figure 3.8, Reading the arcs of the 
nodes holding the a-t- entries at levels { 0 , 1 ,2  , 3 }, results in the partial
transformations listed in Figure 5.1 ( a , b, c , d ) .  The transformation implemented at 
compute cycle k =3 corresponds to the matrix product operation.
5.4.1.2. Cylindrical Array Transformations
This example derives the partial transformations of a cylindrical array designed to 
implement the linear transformation Y = A X, for n =2 . The array is depicted in 
Figure 3.9 and the resulting PTD is given in Figure 3.10. Reading the arcs of the nodes 
holding the a- entries at levels [ 0 , 1 , 2  J produces the total possible transformations
145
executed by the aiTay. This is illustrated in Figure 5.2 ( a , b , c ) .  The transformation 
generated at cycle k = 2 corresponds to the matrix product.
A similar analysis leads to the collection of partial transformations depicted in Figure 
5.3 ( a , b , c ) , for the case when the array is implementing the linear transformation 
Y = XA . The transformation at k = 2 is equal to the indicated matrix product
Remark : A simple examination of the partial transformations implemented by the 
arrays in the previous examples for lc < T  , where T  is the number of compute cycles, 
indicates that none of the transformations can be expressed as a tensor product Hence, if 
the operations are constrained to k cycles , where Jfc < T  , a transformation different 
from the matrix product is performed by the array . If exact results are required in k 
cycles, then the transformation A k describes the structure of the matrix A that makes
this possible.
5.4.2. Triple Matrix Product
In this section, the triple matrix product identity of the form Y = LXR is considered. 
Here, the two matrices L and R e E represent a linear transformation to be
performed on the nxn data matrix X .
146
0 0 0 ®i 1 ®1. 0 0
Ao =
0
0
0
0
0
0
0
0 > l
l
®*i
0
0
0
0
®11
0
0
a)
0 0 0 0
b )
0 0 0 0
"a,i ®.2 0 0 "»1, ®,» 0 0
A2 =
a*i
0
®32
0
0
®,1
0
*.*
I As s
® 2 1 
0
® 2 2 
0
0
«»1
0
®1 3
c)
0 0 ®310
d)
0 0 a2, ®23
Figure 5.1. Mesh Array Y = AX , Transformations
a) k = 0 , b) k = 1 , 
c) k s  2 , d) k = 3 .
a)
a „  0
0 0
0
0
0 0
0 0
0 o
8,1 0
, A , =
®t 1 an  0 0
a , ,  0 0 0
0 0 a , , 0
0 0 8*i 8ia
b)
A , =
a,, a^ 0 0
an a* a 0 0
0 0 ®11
0 0 ®n a« c)
Figure 5.2. Cylindrical Array Y = AX 
Transformations . a) k = 0 , 
b) k = 1 , c) k = 2 .
A . -
a)
a ,i 0 
0 0
0
0
0 0 
0 a , ,
0 0 
0 0
A , =
> A, =
a , ,  0 a , , 0
0 a , ,  o a , ,
a ,, o 0 0
0 0 0 a , ,
b )
a,, 0 a,, o
0 a „  0 a , ,
a*, 0
0 a
aii
n
0 a G)
Figure 5.3. Cylindrical Array Y = XA 
Transformations . a) k = 0 ,
b) k = 1 , c) k = 2 .
148
The parallel representation of the matrix-vector identity requires efficient architectures for 
the computation of a triple matrix product. The most convenient form of implementing the 
product consists of spliting the operations into two stages. In the first stage, the product
Z =X R ,
is performed. The second stage implements the product
Y =L Z  .
Using the same convention of representing the matrices as column vectors by arranging 
the data entries columnwise and utilizing the properties of the tensor product, the 
sequential algorithm for evaluating the triple matrix product ( TMP ) identity can be 
expressed as
z  = ( R ® I ) x° , (5.9)
y = ( I 0  L ) (R  0  I )  x° ,
o o o
y = A x ,
°  2  2where A is an n x n  matrix , ' 0 '  indicates the tensor product, I is the nxn
O O O n2
identity matrix and x , y , z e E are the vector representation of the matrices 
X , Y and Z by arranging the data elements columnwise. After k cycles the recursive 
evolution of partial results in the airay is expressed by
o o o
y k = A k x •
149
The form of A k is architecture dependent. Special cases will be analyzed in the next 
secdon.
5.4.3. Time Constrained Forms of the Triple Matrix Product
The cylindrical array can be modified to implement the sequential TMP algorithm. 
Minimization of I/O operations is achieved by using PEs with two distinct modes of 
operations . In the first mode, the PE is capable of multiplying the incoming data and 
accumulates the partial results. In the second mode, the PE multiplies the incoming data 
by a stored constant and transmits the partial results. The array for n = 2 and the PE 
modes of operation are depicted in Figure 5.4 .
The operation can be pipelined for better PE utilization. This is achieved by switching 
the first row of PEs in the array to mode 2 as soon as the first data elements from the first 
row of Z are computed. Similarly, row 2 is switched to mode 2 as soon as it finishes 
contributing to the partial z,y results. In total, it takes n cycles to operate the array in 
mode 2 . However, to keep the architecture in its simplest form, it is assumed that the 
array will operate in mode 1 for 2n-l cycles to compute the matrix Z, and then it will 
switch globally to mode 2 to compute Y. In this case, the total execution time is equal to 
4n-2 cycles.
The evolution of partial transformations in the array for Z = XR as a function of the 
compute cycle is depicted in Figure 5.5 ( a , b , c ), for compute cycles k = 0 , 1 , 2 .  
Similarly, the evolution of partial transformations for Y = LZ is depicted in Figure
5.6 ( a , b , c ), for compute cycles k = 3 , 4 , 5 .  If the array operations are to be
150
constrained, then by a simple inspection one can conclude that none of the partial results 
for k < 5  can be expressed in the form of a tensor product. Hence, a transformation 
different from the triple matrix identity is performed by the array.
In a similar fashion, the orbital array can be modified to implement the sequential TMP 
algorithm. The array for computing the TMP for n - 2  is depicted in Figure 5.7. The 
PEs will have two mode of operations. The PEs will operate in mode 1 for the first n 
cycles to evaluate the product Z = XR. The PEs will switch globally to mode 2 of 
operations for the next n cycles to evaluate the product Y = LZ. In general, it takes the 
preloaded array 2n cycles to evaluate the sequential TMP algorithm.
The partial transformations for the Z = XR case are depicted in Figure 5.8, and the 
partial transformations for the case Y =LZ are depicted in Figure 5.9. Similarly, by a 
simple inspection one can conclude that none of the partial results for k < 3 can be 
expressed in the form of a tensor product Hence, a transformation different from the triple 
matrix identity is performed by the array.
Two common techniques to increase the power of an array are reconfiguration capability 
and multi-function PEs. Further use of these techniques provides greater flexibility. The 
ability of the PE to change its function can be used to advantage in generating new forms 
of transformations. As an example, for the orbital array implementing the sequential TMP 
algorithm, if it is possible to alternate the PE mode of operation at any time, a large number 
of new transformations can be generated.
151
r n  x  11 * i i  *  i i
H i  x  i i  * i i  * 1 1
12
22
21
a) 0 £  k  £  2
m ode  1
z =  z  + x  f  , 
x , =  x , 
r , =  r  .
0 1 h  0 1 i i
b)
22
21
I  i  
m ode 2
y , = y + i f  
»,= • .
Figure 5.4. Cylindrical Array for TMP.
152
■* IU —
0 0 0 Ii. 0 r., 0 fi. 0 r., 0
0 0 0 0
i A i =
0 •I, 0 0 • A 2 —
0 r.< 0 r..
0 0 0 0 r., 0 0 0 r., 0 I , 0
0 r,. 0 0 0 r,. 0 r.. 0 r,i 0 ra>■_J — —
a )  b )  c )
Figure 5.5. Cylindrical Array Transformations for ZsXR, 
a) k = 0 , b) k -  1 , c) k = 2 .
r „ l „ 0 fill,, 0 • i . l , , fit 111 fill,, f illll
0 fi. 0 1 ,
'  A« =
fiilt , •ii f t,l i, rl ,
r< .I .  i r » fl.l ii •ill ii f i l l l l fit!,. 1 ,1 .,
0 r , . 0 ri> 0 f i . l . l 0 f i l l . ,
b )
I*. 1111 1*11 l it fill,, f ill . .
a s =
Tiil,, •ii l « fill 11 f ill. . ■
Til ii Till ii f i J n ft.l It
e  )
JJ>|I|1 fill i i fillll T.iIh .
Figure 5.6. Cylindrical Array Transformations for Y=LZ,
a) k = 3 , b) k = 4 , c) k = 5 .
153
Assuming that the array is operated in such a way that the PEs will alternate their mode 
of operations every other cycle, the transformation depicted in Figure 5.10 will be 
generated. This transformation is different from the ones generated previously. Hence, 
the power of an array can be increased by taking advantage of the ability of the processing 
elements to change their mode of operations.
5.4.4. Fast Realizations of the Triple Matrix Product Algorithm
The matrix-vector identity can be expressed as the summation of small TMP products. 
The speed of the implementation depends on the number of products to be performed and 
the number of arrays that are performing the operation. Let p  be the time required to 
compute one TMP and m be the number of available arrays. Furthermore, let Tp be
the time of computing the I TMPs. Then, the lower bound for Tp is p  for
m = 1 . However, when / > m , the required time is Tp = [I /m ] (3 , where
[ .  ] indicates the largest integer function. The upper bound is Tp = I (3 for m -  I.
Hence, in general P < Tp < I (3 for 1 <, m £ I .
If the operations are to be constrained to k < T p cycles, a simple inspection of the 
partial transformations presented in the previous section indicates that a transformation 
different from the TMP is instead executed by the array. However, if the transformation 
at compute cycle k must be exact, then it is possible to find the fast forms of the 
TMP. In this regard, the TG can be used to derive the fast forms of the array. The 
objectives here are the determination of those special forms of the left and right matrices 
(L and R) that admits computations in P' < P for each TMP.
154
■W
21
x 21
12
ii mode 1z = z + x , r 
> ■
X . =  X ,  
r ,=  r .
2 £ k £ 3
b)
21
12
22 mode 2
Figure 5.7. Orbital Array for TMP.
155
r„ 0 0 0 f„ 0 r.. 0
► o II
0 0 0 r.,
> A , =
0 f,, 0
0 0 rn 0 9 1 r,, 0 r» 0
0 r,i 0 0 0 r.i 0 r>>
a )  b )
Figure 5.8. Orbital Array Transformations for Z=XR, 
a) k = 0, b) k = 1 .
r, , 1 , , 0 rnl ii 0 r . , l„ •ii l i t ruin fnlii
0 •ill ,1 0 •‘. . I . . » a 3 = •iil*t r.i l .i ful n fiiln
r„ r.ilu I'll t i l . > ft J n rul ii ftiln t . I . i
r» •ill ti r .. j J l t •iilii fiiln r.ilji
a )  b )
Figure 5.9. Orbital Array Transformations for Y=LZ, 
a) k = 2 , b) k = 3 .
In In 11* ®ii
A = o 
0
',1 "tl
0
0
o 0
lit I1,, III In r,, 
I it In  Tit l>> rn
Figure 5.10. Orbital Array Final Transformation with 
Alternating PE Modes of Operations.
156
Consider the partial TG of the cylindrical anay for the TMP depicted in Figure 5.11. 
The array requires 6 cycles to evaluate the product for n =2 . Assume that only 5 
cycles are available to perform the computations. Due to the sequential nature of algorithm 
evaluating the indicated product, the saving in time can result from constraining the Z 
matrix operations and ( o r ) constraining the Y matrix operations. The analysis is based 
on the assumption that the designer does not have control over the data matrix entries.
Utilizing the concept of TG compression, the fast forms obtained by constraining the 
matrix R for compute cycles 0 £ k £ 2 are given by
In the TG, time levels 3 < k <> 5 represent the contribution of the e L 
entries to the e Y partial results . The PEs at these time levels are operating in 
mode 2. To constrain the Y matrix operations , two choices are available. First,
compressing the TG by one level from the bottom results in the following two conditions:
0 h i -  0 or Zjj = 0 and y2/  = ^ * an(*
ii) l21 ~ 0 or z12 = 0 and y2 2  = 0 .
157
However, y2} = 121 zu  implies that y2] is zero provided that l21 = 0 or
zn  = 0. Similarly, y22 = l22 z22 and y22 are zero provided that l22 = 0
or z22 = 0. In this case, the two conditions can be restated as
i) l22 = 0 or zu  = 0 and l2l = 0 or zn  -  0 , and
ii) l2J = 0  or z12 = 0 and l22 = 0 or z22 = 0 .
The condition zu  = 0 is satisfied provided that ru  = r2l -  0. The entries zI2
and z22 will be zero if r2i — r22 = 0.
The second option considers compressing the TG by one level from the top. In this 
regard, the following conditions must hold:
i) ln  = 0 and l12 = 0 or ~ 0 and z22 = 0 ,
ii) z21 = 0 and yu * = 0 , and
iii) z12 = 0 and yI2 =0 .
The collection of L and R matrices that results in realizations in 5 cycles is
determined by taking the combinations of all the above conditions. Certainly, a wide range
of choices is obtained. The same analysis can be carried for different values of the 
constrained cycle k . The overall savings in time is a function of I , m and /?'. 
Examples of fast foims for bottom up compression of the TG are given in Figure 5.12.
# 3
s '* l
fop ?Afp
Ofi
C#J 4 /* i
» V
Figure 5.12. Fast Forms for TMP Cylindrical Array , 
a) Constraining matrix L , b) Constraining L and R matrices.
160
The chapter has derived a recursive model of the transformations generated in a class of 
linear arrays as functions of the compute cycle. It was shown that, by taking advantage of 
the compute cycle as an added degree of freedom, different families of transformations can 
be generated by an array. This technique enhances the power of the array as a 
transformation machine.
CHAPTER 6
CONCLUSIONS AND FUTURE WORK
The dissertation has established the basis of a formal methodology that can be used for 
the development of computational structures more suitable for the changing nature of real 
time signal processing and control applications. The work views an array as a 
transformational device as opposed to a computational one. The contributions of the 
dissertation to the area of more flexible arrays can be summarized into two basic concepts:
1- Algorithm fa s t realizations
This concept determines special realizations of an algorithm that can be executed more 
efficiently on a generic array. The concept restricts the operations to remain in the same 
class, but with reduced execution time. The dissertation developed the necessary tools 
needed to model the evolution of partial results in the array, parametrized by the number of 
the compute cycle. As a consequence of the analysis, a proper model of the phenomena of 
data flow in an array was developed. The array is viewed as a directed weighted 
computing graph, with nodes conesponding to the array processing cells, and the arcs as 
the medium of transporting the processed data. The model is general enough to encompass 
a large class of planar, non-planar, as well as switchable arrays. The precedence timing
161
162
diagram ( PTD ) was used to express the evolution of partial results in an array as a 
function of the compute cycle. It is a valuable tool that displays the array operations at each 
compute cycle. Systematic procedures were introduced to construct the PTD of an array. 
The PTD was mapped into a timing graph (TG ). The TG is a valuable visual aid of 
the evolution of computations in the array as a function of the compute cycle. A 
constructive graph theoretic approach was introduced that analyzes the array TG. This 
approach determines the collection of fast realizations implementable on a single array. The 
methodology takes advantage of the nature of data dependencies of the algorithm at hand. 
Data reordering is used to achieve faster execution of algorithms characterized by random 
sparsity in their matrices on a given array.
2- Time Constrained Operation
This principle takes the concept of fast realization one step further, where the basic 
constraint is the compute cycle number. As long as that constraint is satisfied their is no 
condition on the actual mathematical representation. A mathematical approach based on 
concepts from multilinear algebra, is introduced that models the recursive transformations 
implemented in linear arrays at each compute cycle. The proposed representation is general 
enough to encompass a large class of signal processing and control applications. A 
complete analytical model of the linear maps implementable by the array at each compute 
cycle is developed. By taking advantage of the number of the compute cycle as an 
additional degree of freedom and the ability of the processing element to change its mode of 
operations, it is determined that an array generates a large number of transformations 
during it course of operations.
163
Throughout the dissertation, we have demonstrated our methodology by using different 
examples of existing arrays. We have analyzed many airays, planar and nonplanar, for 
the matrix-vector multiplication, matrix-matrix multiplication, LU decomposition, 
QR factorization and the triple matrix product The methodology resulted in the ability 
of modifying the operations of those arrays to achieve faster execution for special 
realizations of an algorithm. Our emphasis was focused on analyzing full size arrays.
Lessons learned ffom analyzing existing arrays were used to design more efficient 
arrays for special realizations of the algorithm. By using the number of compute cycle as an 
extra dimension, the class of transformations that can be generated by a single array is 
enlarged. The principle results in the ability of performing triangular and band matrix 
product in an optimum time on a class of airays designed to perform dense matrix 
multiplication. Application of the methodology includes the design of flexible time varying 
structures, and the ability to decompose a full size array into sub-arrays implementing 
smaller size problems.
With regard to future work, the following topics are suggested:
1- Extension of the concepts to fixed-size and fixed-depth arrays.
2- The development of a mathematical model for the class of algorithms generated by 
alternating the modes of operations of the processing elements as a function of the 
number of the compute cycle.
164
3- The development of efficient algorithms that embed a randomly sparse matrix 
into its fast form.
4- The establishment of a global model that relates the < generated, used > elements 
available on the FE input / output links at the constrained cycle back to the high 
level language description of the algorithm. This will allow a systematic treatment 
of the fast forms.
5- Establishing the relation between the array data paths and the algorithm data 
dependence vectors and using it for the automatic reconfiguration of the airay such 
that it can executes several algorithms.
REFERENCES
1- H.T. Kung," Why systolic architectures ? ", Computer, vol. 15, Jan. 1982, 
pp. 37-46.
2- H.T. Kung, " Let's design algorithms for VLSI systems ", Proc. Conf. Very 
Large Scale Integration: Architecture, Design, Fabrication, Cal. Tech., Jan. 1979, 
pp. 65-90.
3- H.T. Kung and C.E. Leiserson," Systolic arrays ( for VLSI )" , SIAM, 1979, 
pp. 256-282.
4- M.J. Foster and H.T. Kung, " The design o f special purpose VLSI chips ", 
Computer, vol. 13, no. 1, Jan. 1980, pp. 26-40.
5- H.T. Kung," Special-purpose devices for signal and image processing : an 
opportunity in VLSI", Proc. SPIE, vol. 241, July 1980, pp. 76-84.
6- H.T. Kung and P.L. Lehman," Systolic ( VLSI) arrays for relational database 
operations", Proc. ACM-Sigmod 1980 Int’l Conf. Management of Data,
May 1980, pp. 105-116.
165
166
7- C.E. Leiserson, " Systolic priority queues ", Proc. Conf. Very Large Scale 
Integration: Architecture, Design, Fabrication, Cal. Tech., Jan. 1979,
pp. 199-214.
8- C.S. Yeh, I.S. Reed and T.K. Truong, ” Systolic multipliers for finite fields ", 
IEEE Trans, on Computers, vol. C-33, no. 4, April 1984, pp. 351-356.
9- R.P. Brent and H.T. Kung, " VLSI arrays fo r linear-time GCD computation", in 
VLSI 83, F. Anceau and E J . Aas (eds.), North Holland, 1983.
10- W.A. Porter and J.L. Aravena," Cylindrical arrays for matrix multiplication ", 
Proc. 14th Allerton Conf., Allerton, IL, Oct. 1986, pp. 595-692.
11- J.L. Aravena and W.A. Porter,1 Nonplanar switchable arrays ", Circuits Systems, 
Signal Processing, vol. 7, no. 1, 1988.
12- W.A. Porter," Nonplanar array processing ", Proc. 20th Asilomar Conf., Ca, 
Nov. 1986.
13- J.L. Aravena, " Triple matrix product architectures for fast signal processing ", 
20th Asilomar Conf., Nov. 1986, pp. 53-57.
14- A. El-Amawy, W.A. Porter, and J.L. Aravena, " Array architecture for iterative 
matrix calculations ", Proc. IEE, vol. 134, no. 3, May 1987, pp. 149-154.
167
15- A. El-Amawy," Fast LU-decomposition o f dense matrices on an optimal array", 
20th Asilomar Conf., Nov. 1986, pp. 518-522.
16- A. El-Amawy," A systolic architecture for fast dense matrix inversion", IEEE 
Trans, on Computers, vol. 38, no. 3, March 1989, pp. 449-455.
17- H.T. Kung and S.W. Song, " A systolic 3-D convolution chip ”, in 
Multicomputers and Image Processing: Algorithms and Programs, K. Perston and 
L. Uhr (eds.), Academic Press, 1982, pp. 373-384.
18- H.T. Kung, L.M. Ruan and D.W.L. Yen, ” A two-level pipelined systolic array for 
convolutions ", Proc. CMU Conf. on VLSI Systems and Computation, Oct. 1981, 
pp. 255-264.
20- P.R. Cappello and K. Steiglitz," Digital signal processing applications o f systolic 
algorithms", Proc. CMU Conf. on VLSI Systems and Computation, Oct. 1981, 
pp. 245-254.
21- G.R. Arce and P.J. Warter, " A median filter architecture for VLSI 
implementation ", Proc. 22nd Annu. Allerton Conf. on Comm. Control and 
Computing, 1984.
22- K. Oflazer," Design and implementation o f a single chip 1-D median filter", IEEE 
Trans, on ASSP, vol. 31, no. 5, Oct. 1983.
168
23- J.L. Guibas and F.M. Liang," Systolic stacks, queues and counters ", Proc. 
Conf. on Advanced Research in VLSI, 1982.
24- F.C. Lin and W u," Systolic arrays for transitive closure algorithms ", Proc. Int. 
Symp. VLSI Sys., Nov. 1986.
25- Y. Robert and D. Try strain," An orthogonal systolic array for the algebraic path 
problem ", Res. report 553, TIM3/IMAG, France, 1985.
26- S.Y. Kung, S. Lo, and P. Lewis," Optimal systolic design for the transitive 
closure and the shortest path problem", IEEE Trans, on Computers, vol. C36, 
no. 5, May 1987, pp. 603-614.
27- S.Y. Kung and S. Lo, " A spiral systolic architecture/ algorithm for transitive 
closure problems", Proc. IEEE Int. Conf. on Computer Design, 1985.
28- H. Barada and A. El-Amawy, " A fixed-size systolic system for the transitive 
closure and shortest path problems ", Proc. Int. Conf. on Advances in 
Communication and Control Systems, Oct. 1988, pp. 623-633.
29- S. Y. Kung et al.," Mapping Digital Signal Processing Algorithms onto VLSI 
Systolic/Wavefront Arrays ", Proc. 20th Asilomar Conf. on Signal, System & 
Computers, pp. 6,1986.
169
30- K. Y. Liu, " A Pipelined Tree Machine Architecture for Computing a
Multidimensional Convolution ", IEEE Trans. CAS, Vol. 29, No. 4, April 1982.
31 - E. H. Wold and A. M. Despain, " A Pipeline and Parallel-Pipeline FFT
Processors for VLSI Implementation ", IEEE Trans. Comp., Vol. c-33,
No. 5, May 1984.
32- W. A. Perera," Architectures for Multiplierless Fast Fourier Transform Hardware 
Implementation in VLSI ", IEEE Trans. ASSP, Vol. 35, No. 12, Dec. 1987, 
pp. 1750-1760.
33- C. Zhang and D. Yun," Multidimensional Systolic Networks for Discrete Fourier 
Transforms " , Proc. ICCD, pp. 215-222,1984.
34- H. S. Hou," A Fast Recursive Algorithm for Computing the Discrete Cosine 
Transform ", IEEE Trans. ASSP, Vol. 35, pp. 1455-1461, Oct. 1987.
35- H. G. Yeh, " Systolic Implementation o f Kalman filtering ", IEEE Trans. ASSP,
Vol. 35, No. 9, pp. 1514-1517, Sep. 1987.
36- C. L. Nilrias, A. P. Chrysafis and A. N. Venetsanpoulos, " The LU 
Decomposition Theorem and its Implication to the Realization of2-D Digital 
Filters " , IEEE Trans. ASSP, Vol. 33, No. 3, June 1985.
170
37- M. J. Chen and K. Yao, " Linear Systolic Arrays for Least Square Estimation ", 
1988 IEEE Int. Conf. on Systolic Arrays, pp. 83-93.
38- J. L. Aravena and W. A. Porter," Triple Matrix Product Architectures for Fast 
Signal Processing ", 20th Asilomar Conf. on Signal Syst. and Comp., Non. 1986.
39- J. L. Aravena," Triple Matrix Product Algorithms and Architectures for Fast Signal 
Processing ", IEEE Trans. Circ. Syst., Vol. CAS-31, No. 1, Jan 1988.
40- A.L. Fisher and H.T. Kung, " Synchronizing large systolic arrays ", Proc. of the 
10th Annual Int. Symp. on Computer Architecture, 1983, pp. 54-58.
41- J. Backus," Can programming be liberated from the Von Neumann style? A 
functional style and its algebra o f programs", ACM, vol. 21, Aug. 1978, 
pp. 613-641.
42- L. Johnson and D. Cohen," A mathematical approach to modeling the flow of data 
and control in computational networks", in VLSI Systems and Computations, ed. 
H.T. Kung et al„ Computer Science Press, Oct. 1981, pp. 213-225.
43- V. Weiser and A. Davis, " A wavefront notational tool for VLSI array design ", in 
VLSI Systems and Computations, ed. H.T. Kung et al., Computer Science Press, 
Oct. 1981, pp. 226-234.
171
44- C.E. Leiserson, F.M. Rose and J.B. Saxe, " Optimizing synchronous circuitry by 
retiming ", Proc. 3rd Caltech Conf. on VLSI, ed. R. Bryant, Computer Science 
Press, 1983, pp. 87-116.
45- J.A.B. Fortes, K.S. Fu and B.W. Wah," Systematic approaches to the design o f 
algorithmically specified systolic arrays", Proc. ICCD, 1985, pp. 300-303.
46- P.R. Cappello and K. Steiglitz, " Unifying VLSI array designs with geometric 
transformations ", Proc. Conf. Parallel Processing, 1983, pp. 448-457.
47- M. Lam and J. Mostow," A transformational model o f VLSI systolic design", 
Proc. 1FIP 6th Int. Symp. Comput. Hardware Description Lang. Appl., 
Pittsburgh, PA, May 1983.
48- H.T. Kung and W.T. L in," An algebra for VLSI algorithm design ", Proc. Conf. 
Elliptic Problem Solvers, Monterey, CA, 1983.
49- J.M. Jover and T. Kailath, " Design framework for systolic type arrays ", Proc. 
ICASSP, 1984, pp. 851-854.
50- D.S. Schwarts and T.P. Barnwell E l," A graph theoretic technique for the 
generation o f systolic implementations for shift-invariant flow graphs ", Proc. of 
ICASSP 84, 1984, pp. 8.3.1-8.3.4.
172
51- I.V. Ramakrishnan, D.S. Fussell and A. Silberschaitz," Mapping homogeneous 
graphs on linear arrays", IEEE Trans, on Computers, vol. C-35, no. 3, March 
1986, pp. 189-209.
52- L. Guo-Jie and B. Wah," The design o f optimal systolic arrays ", IEEE Trans. 
Computers, vol. 34, no. 1, Jan. 1985, pp. 66-77.
53- H.D. Cheng, W.C. Lin and K.S. Fu, " Space-time domain expansion approach to 
VLSI and its application to hierarchical scene matching ", Proc. 8th Int. Conf. 
Pattern Recognition, Montreal, Canada, Aug. 1984.
54- D.I. Moldovan," On the analysis and synthesis o f VLSI algorithms ", IEEE 
Trans, on Computers, vol. C-31, no. 11,1982, pp. 1121-1126.
55- D.I. Moldovan," On the design o f algorithms for VLSI systolic arrays ", Proc. of 
the IEEE, vol. 71, no. 1, Jan. 1983, pp. 605-612.
56- D.I. Moldovan," ADVIS: A software package for the design o f systolic arrays", 
84 Int'l Conf. on Computer Design: VLSI in Computers, Oct. 1984, pp. 158-164.
57- J.A. Fortes," Algorithm transformations for parallel processing and VLSI 
architecture design ", PhD dissertation, University of Southern California, Los 
Angeles, CA, Dec. 1983.
173
58- P. Quinton,11 Automatic synthesis o f systolic arrays from uniform recurrent 
equations ", Proc. 11th Ann. Sym. Comp. Architecture, 1984.
59- H. R Barada," A comprehensive methodology for algorithm characterization, 
regularization and mapping into optimal VLSI arrays ", PhD dissertation, 
Louisiana State University, Baton Rouge, Louisiana, Auguest 1989.
60- M. Chen," A Design Methodology for Synthesizing Parallel Algorithms and 
Architectures ", Journal of Parallel and Distributed Computing,
pp. 268-273, 1985.
61- K. K. Chung and O. Wing," Mapping Strategy for Automatic Design o f Systolic 
Arrays ", 1988 IEEE Int. Conf. on Systolic Arrays, pp. 285-294.
62- M. Payer," Formal Derivation o f Systolic Arrays, A Case Study ", 1983 IEEE Int. 
Conf. on Systolic Arrays, pp. 331-340.
63- H.W. Nelis," Automatic design and partitioning o f systolicfwavefront arrays for 
VLSI", Circuit, Systems and Signal Processing, vol. 7, no. 2,1988.
64- S. Y. Kung," VLSI Array Processors ", Information and System Science Series, 
Prentice Hall, 1988.
174
65- K. Jainandunsing," Systematic Design o f Fixed Size VLSI Systolic Processor 
Arrays Applied to the Orthogonal Faddeeva Equations Solver ", ECCTD,
Sep. 1987
66- J. A. B. Fortes and D. I. Moldovan, " Parallelism detection and Algorithm 
Transformation Techniques Useful for VLSI Architecture Design ", J. Parallel 
Distribution Comp., Vol. 2, No. 3, pp. 277-301, 1985.
67- C. Guerra and R. Melhem," Synthesis o f Systolic Designs ", Proc. Int. Conf. 
Parallel Proc., pp. 765-772,1986.
68- W. A. Porter and J. L. Aravena" Array Resolution ", Circuits Systems and Signal 
processing, special issue VLSI, 1986.
69- L. Snyder, " Introduction to the configurable high parallel computer", IEEE 
Comp., pp.47-55, Jan. 1982.
70- M. Annaratone et al., " The warp computer; architecture, implementation and 
performance " , IEEE Trans. Comp. vol. C-36, pp. 1523-1537, Dec. 1987.
71- G. He and H. Henry," A versatile systolic array for matrix computations ", 72th 
Symp. Comp. Archt., pp. 315-321, June 1985.
72- A. Nash et a l.,"  A general purpose array processor based on the Faddeeva 
algorithm ", IEEE Comp., pp. 55-62, Jan. 1984.
175
73- M  Sami and R. Stefanelli," Reconfiguration Architectures for VLSI Processing 
Arrays ", Proc. of the IEEE, May 1986, pp. 712-722.
74- J. H. Kim and S, M. Reddy," On Easily Testable andReconfigurable Two 
Dimensional Systolic Arrays", Intl. Conf. on Parallel Process., Aug. 1988.
75- P. P. Sanjay and M. A. Bayoumi," A Reconfigurable VLSI Array for Reliability 
and Yield Enhancement ", 1988 IEEE Int. Conf. on Systolic Arrays, pp. 631-642.
76- R. Hughey and D. P. Lopresti," Architecture o f a Programmable Systolic Array ", 
1988 IEEE Int. Conf. on Systolic Arrays, pp. 41-49.
77- D.I. Moldovan and J.A. Fortes," Partitioning and mopping algorithms into fixed 
systolic arrays ", IEEE Trans, on Computers, C-35, no. 1, Jan 1986, pp. 1-12.
78- D. Heller," Partitioning big matrices for small systolic arrays ", in VLSI and 
Modem Signal Processing, pp. 185-199, ed. T. Kailath, Prentice Hall, NJ, 1985.
79- K. Jainandunsing," On optimal Partitioning Schemes for SystoliclWavefront Array 
Processors ", Proceedings ISCAS, San Jose, 1986.
80- J. J. Navarro, J. M. Llaberia and M. Valero," Partitioning; An Essential Step in 
Mapping Algorithms into Systolic Array Processors ", IEEE Computer 20(7), 
pp. 77-89, Jul. 1987.
176
81- J. H. Moreno and T. Lang," On Partitioning the Faddeeva Algorithm ", 1988 IEEE 
Int. Conf. on Systolic Arrays, pp. 125-135.
82- J. Nash, S. Hansen, and K. Przytula, " Systolic Partitioned and Banded linear 
Algebraic Computations", SPIE Real-Time Signal Processing IX, pp. 10-16,1986.
83- J. J. Navarro, J. M. Llaberia, and M. Valero," Computing Size Independent 
Matrix Problem on Systolic Array Processors ", 13th Int. Symp. on Comp.
Archt., pp. 271-278, Jun. 1986.
84- A. O. Barbir and J. L. Aravena," Fast systolic algorithm realizations", Third 
Parallel Processing Symp., Fullerton California, March 1989.
85- J. L. Aravena and A. O. Barbir," Time constrained operations o f systolic arrays", 
21st Southeastern Symp. on System Theory, Tallahassee, Florida, March 1989.
86- C. D. Johnson and D. W. Thornes, " The1 Paralysis by Analysis' Problem in 
Resource Allocation Control with Application to SDI " , Proc. 26th CDC,
Los Angeles, CA, Dec. 1987, pp. 1670-1697.
87- J. L. Aravena and W. A. Porter," Array Constrained Design o f Digital Filters", 
(under review ASSP Trans.).
88- J. L. Aravena and W. A. Porter;" Array Based Design o f Digital Filters", Proc. 
ComCon 88, Oct. 1988, pp. 1372-1379.
177
89- M.N. Swamy and K. Thulasiraman, Graphs, Networks and Algorithms, Wiley, 
New York, 1981.
90- A. V. Aho, J. E. Hocroft and J. D. Ullman, The design and analysis o f computer 
algorithms, Adison Wesly, MA 1974.
2
91- H. Amano et a l.," (sm) -II: A new version o f the sparse matrix solving 
machine ", J2th Symp. Comp. ArchL, pp. 100-107, June 1985.
92- K. Onaga and T. Takechi," On design o f rotary array communication and 
wavefront-driven algorithms for solving large scale band limited matrix equations ", 
/5th Annual Inter. Symp. Comp. Architecture, pp.308-315, June 1986.
93- J. L. Aravena and W. A. Porter," Multidimensional Aperture Control ", IEEE 
Trans. ASSP, Vol. 33, No. 1, Feb. 1985, pp. 185-194.
94- . J. L. Aravena and W. A. Porter, " Algorithms for Fast Multidimensional Signal
Extractions ", J. Math. Anal, and Appl., Vol. 107, No. 1, April 1985, 
pp. 300-315.
95- R. R. Mohler, Bilinear Control Processes. New York: Academic Press, 1973.
96- R. M. Hirschom," Control o f bilinear systems ", SIAM J. Contr., vol. 10, no. 2,
May 1972.
VITA
Abdulkader Omar Barbir was bom in Beirut, Lebanon on October 30,1959. He earned 
his Bachelor of Science and his Master of Science degrees in Electrical Engineering from 
Louisiana State University, in May of 1982 and December of 1983, respectively. He was a 
teaching assistant ( January 1984 to August 1984 ) in the department of Nuclear Science 
and ( August 1985 to December 1986, January 1988 to August 1989 ) in the department of 
Electrical and Computer Engineering at Louisiana State University. He was a research 
assistant ( January 1987 to December 1987) in the department of Electrical and Computer 
Engineering at Virginia Tech and from ( January 1986 to December 1986, January 1988 
to August 1989 ) in the department of Electrical and Computer Engineering at Louisiana 
State University.
He is currently with the department of Mathematics and Computer Science at Western 
Carolina University. He is a member of Alpha Theta Kappa , Eta Kappa Nu , IEEE , 
IEEE Computer Society and ACM.
178
