A new general purpose systolic array for matrix computations by Le, Hai Van Dinh
Portland State University 
PDXScholar 
Dissertations and Theses Dissertations and Theses 
1988 
A new general purpose systolic array for matrix 
computations 
Hai Van Dinh Le 
Portland State University 
Follow this and additional works at: https://pdxscholar.library.pdx.edu/open_access_etds 
 Part of the Electrical and Computer Engineering Commons 
Let us know how access to this document benefits you. 
Recommended Citation 
Le, Hai Van Dinh, "A new general purpose systolic array for matrix computations" (1988). Dissertations 
and Theses. Paper 3796. 
https://doi.org/10.15760/etd.5680 
This Thesis is brought to you for free and open access. It has been accepted for inclusion in Dissertations and 
Theses by an authorized administrator of PDXScholar. Please contact us if we can make this document more 
accessible: pdxscholar@pdx.edu. 
AN ABSTRACT OF THE THESIS OF Hai Van Dinh Le for the Master 
of Sciences in Electrical Engineering presented May 18, 1988. 
Title: A New General Purpose Systolic Array for Matrix 
Computations. 
Marek P~rkowski, Chairman 
Erasto Kashoro 
Mohammad #,ha 
It has been conservatively estimated that 75 percent 
of all scientific applications involve some form of matrix 
computations. In general, matrix computations are very 
expensive in term of processing time. For real time 
operation required by such applications as robotics, signal 
processing and computer graphics animation, the processing 
power of serial computers is simply inadequate. 
2 
In this thesis, we propose a new systolic architecture 
which is based on the Faddeev's algorithm. Because 
Faddeev's algorithm is inherently general purpose, our 
architecture is able to perform a wide class of matrix 
computations. And since the architecture is systolic based, 
it brings massive parallelism to all of its computations. 
As a result, many matrix operations including addition, 
multiplication, inversion, LU-decomposition, transpose, and 
solutions to linear systems of equations can now be 
performed extremely fast. In addition, our design 
introduces several concepts which are new to systolic 
architectures: 
- It can be re-configured during run time to 
perform different functions with the uses of 
various control signals propagating 
throughout the arrays. 
It allows for maximum overlaps of processing 
between consecutive computations, thereby 
increasing system throughput. 
There have been other architectures proposed for this 
problem. However, a thorough analysis performed in this 
thesis reveals that they suffer from serious drawbacks, 
design inefficiencies or even errors. Thus, they are 
impractical for actual implementation. On the other hand, 
the new architecture is free from all of these weaknesses 
3 
while offering many important advantages, some of which are 
~ 
listed as follo~f 
It is truly problem size independent, i.e. 
matrices which are arbitrarily large can be 
easily decomposed to be processed by a fixed 
size array. 
- It can solve sparse matrix problems 
efficiently without requiring system re-
configuration. 
- It provides the same level of performance as 
the known architectures using a smaller 
number of cells and arrays. 
- It is fully expansible, i.e. linear 
performance improvement can be achieved by 
simple addition of identical component 
arrays. 
- Because of its simplicity, it can be 
implemented inexpensively and with very 
little effort. 
We also describe in this thesis several extensions to 
Faddeev's algorithm which are ideally suited for problem 
size independent systolic architectures such as ours. These 
extensions~classified as horizontal, vertical, and two-
dimensional~not only increase a system throughput from two 
to four fold but also enhance the inherent programmability 
of Faddeev's algorithm. This allows our architecture to 
perform very complex matrix calculations. 
4 
An example of 
this enhanced programmability for complex matrix calculation 
is presented as well. 
A NEW GENERAL PURPOSE SYSTOLIC ARRAY 
FOR MATRIX COMPUTATIONS 
by 
HAI VAN DINH LE 
A thesis submitted in partial fulfillment of the 
requirements for the degree of 
MASTER OF SCIENCES 
in 
ELECTRICAL ENGINEERING 
Portland State University 
1988 
TO THE OFFICE OF GRADUATE STUDIES: 
The members of the Committee approve the thesis of Hai 
Van Dinh Le presented May 18, 1988. 






~ ...... '-<.l~.t:' ...... ~v .. , ,_i1a.r'r, Department of Electrical Engineering 
Bernard Ross, Vice Provost for Graduate Studies 
ACKNOWLEDGMENTS 
Faddeev's algorithm became infinitely clearer after 
Dr. Robert Broussard illustrated a simple example. For 
this, and also for the several office visits during which he 
patiently answered my technical questions, I am sincerely 
grateful. 
I also wish to give thanks to Dr. Roy Rathja for 
providing me with some important articles on the subject of 
parallel computer architectures. They have been quite 
valuable and are greatly appreciated. 
Special thanks are due to Dr. Marek Perkowski, my 
thesis advisor, who has suffered the indignation of many 
long hours going through several "final" versions of this 
thesis; his extensive detailed comments and suggestions have 
helped me fix innumerable errors and omissions. 
Finally, I would like to express my deepest gratitude 
to my wife and best friend, Tuyet Uong, for her devotion and 
unwavering support through all these years. Without her, my 
entire higher education would have been impossible. This 
thesis is lovingly dedicated to her. 
TABLE OF CONTENTS 
PAGE 
ACKN'OWLEDGMENTS • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • iii 
LIST OF FIGURES . ........................................ vii 
CHAPTER 
I INTRODUCTION. . . . . . . . . . . . . . . • . . . . . . . . . . . . . . . . . . . . . . 1 
Ways and Obstacles in Speeding Up 
Digital Systems................................ 2 
The Systolic Architecture Concept.............. 6 
Systolic Architecture Design Criteria.......... 9 
Organization of This Thesis ...•................ 10 
II FADDEEV'S ALGORITHM AND MATRIX 
TRIANGULARIZATION SYSTOLIC ARRAYS ................. 13 
Faddeev's Algorithm .............•.............. 14 
Systolic Arrays for Matrix Triangularization ... 17 
Gaussian Elimination With 
Neighbor Pivoting........................... 19 
Orthogonal Triangularization ..•............. 20 
III SYSTOLIC IMPLEMENTATIONS OF FADDEEV'S ALGORITHM ... 23 
Nash's Implementation •.•....................... 23 
Chuang and He's Implementation ................. 30 
Input Decomposition ..........•.............. 35 
Feedback System for Parallel Decomposition .. 37 
Feedback System for Vertical Decomposition .. 42 
Feedback System for Hybrid Decomposition .... 45 
Sparsity in Matrices ........................ 46 




IV A NEW SYSTOLIC ARRAY ARCHITECTURE ....•.•.......... 54 
Architectural Description .•...••............... 54 
PEs' Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 
Control Signals Interconnections .......•.... 59 
Control Interface With Host ................. 62 
Data Flow Description ......•................... 64 
Storage and Feedback of 
Modification Factors ...••...•..•............ 65 
Solving Size Independent Problems .•............ 66 
Input Decomposition and 
Vertical Feedback Path ...................... 67 
Controls and Horizontal Feedback Path ....... 72 
Multiple Arrays Configurations .............. 74 
Intermediate Results Storage ................ 79 
Processing of Sparse Matrices .................. 80 
Overlaps in Processing Between Problems ........ 83 
V EXTENSIONS TO FADDEEV'S ALGORITHM AND CONCLUSION .. 86 
Horizontal Extension to Faddeev's Algorithm .... 87 
Vertical Extension to Faddeev's Algorithm ...... 93 
Two-Dimensional Extension to 
Faddeev's Algorithm ............................ 97 
Concluding Remarks ............................. 104 
REFERENCES. . • . . • • • • • . . . • • • • • • . . . • . • • • . • • • • • • . . . . . . . . . • . • 10 6 
APPENDIX A EXAMPLES OF FADDEEV'S ALGORITHM ............. 109 
Using Ordinary Gaussian Elimination ...... 109 
Using Gaussian Elimination With 
Neighbor Pivoting .•.......•.............. 113 
Using Givens Rotations ................... 123 
vi 
PAGE 
APPENDIX B REAL TIME GRAPHICAL SIMULATION OF 
SYSTOLIC ARRAYS. • • • • . . . . • . . . . . . . . • • • . • . . . . . . 131 
APPENDIX C SAGS PROGRAM LISTING ...••.••...•.••.••...••. 170 
LIST OF FIGURES 
FIGURE PAGE 
1. Some Matrix Operations Possible With 
Faddeev' s Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 7 
2. Triangular Systolic Array for Matrix 
Triangularization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 
3. Microcode Specifications of Boundary Cell 
and Internal Cell for Gaussian Elimination 
With Neighbor Pivoting .......•................... 21 
4. Microcode Specifications of Boundary Cell and 
Internal Cell for Orthogonal Triangularization ... 22 
5. Nash's Systolic Implementation of Modified 
Faddeev's Algorithm .............................. 25 
6. Microcode Specifications of Boundary Cell and 
Internal Cell Used in Nash's Array During the 
First Phase, i.e. Givens Rotations ............... 26 
7. Microcode for Boundary Cell and Internal Cell 
Used in Nash's Array During the Second Phase, 
i.e. Gaussian Elimination ..•..................... 27 
8. Chuang and He's Systolic Implementation of 
Faddeev's Algorithm .............................. 31 
9. Microcode Specifications of Cells Used by 
Chuang and He's Array for Gaussian Elimination 
With Neighbor Pivoting ..•........................ 33 
10. Microcode Specifications of Cells Used by 
the Array for Ordinary Gaussian Elimination ...... 34 
11. Three Ways to Decompose the Input Data Flow. 
(a) Parallel Decomposition. (b) Vertical 
Decomposition. (c) Hybrid Decomposition ......... 36 
12. Systolic System With 26 Subarrays of Types 
T ands, Each of Width w ••••••••••••••.••..•.•••• 38 
viii 
FIGURE PAGE 
13. Feedback Systolic System With a Smaller Number 
of subarrays for Parallel Decomposition .......... 39 
14. Two-Dimensional Feedback System With One S 
and One T Subarrays. . . . • . . • . . • . . • • . . . . • . . . . . . . . . . 41 
15. Array System for Vertical Decomposition of 
l:l11>1lt r>a.t:Ci FlolN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 
16. Problem Size Independent Array System for 
Vertical and Hybrid Decomposition of Input 
Data Flow........................................ 44 
17. Recycling Shift Registers for the Temporary 
Storage of the X Values .....•........•........... 45 
18. Parallel Decomposition of a Sparse Matrix 
Problem With m = 6............................... 47 
19. Systolic System for the Processing of Sparse 
Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 9 
20. Processing Sequence Showing the Order in 
Which the Non-Zero Blocks of Figure 18 Are 
Fed Into the System of Figure 19 ........•..•..... 51 
21. Dual Mode Systolic Implementation of 
Faddeev's Algorithm. The Number of Cells 
Needed Is Smaller and I/O Bandwidth 
Requirement Is Reduced .............•............. 55 
22. Microprogram Specifications of the Circular 
and Square Cells for the Array's Dual 
Mode Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 
23. Dual Mode Array Shown Only With the 
Interconnection Pattern for Control Signal Cl .... 60 
24. Dual Mode Array Shown Only With the 
Interconnection Pattern for Control Signal C2 .... 61 
25. Dual Mode Array Shown Only With the 
Interconnection Pattern for Control Signal C3 .... 62 
26. Array Showing Only the Interconnection Pattern 
of Control Signal C4. . . . . . . • . . . . . . . . . . . . . . . . . . . . . 63 
27. First Iteration in the Processing of a Problem 
Larger Than the Array Size ....•.•................ 68 
ix 
FIGURE PAGE 
28. Second Iteration of the Problem .................. 70 
29. Control/Timing Sequences of Input and Output 
Data Flow for Each Iteration. The Dash/Dotted 
Lines Represent Input Strips, While the Dotted 
Lines Represent the Output Strips .••....•..•..... 72 
30. L-tuple Arrays System Processing a Problem 
Larger Than the I/O Bandwidth w •••••••••••••••••• 75 
31. Control/Timing Sequences for Each Array .......... 77 
32. An L-tuple Arrays System With a Common Data 
Bus From Each Array to Host. The Vertical 
Feedback Path Has a FIFO Queue Br for 
Temporary Storage of Intermediate Results ........ 79 
33. Reduced System for Sparse Matrix Processing ...... 81 
34. Parallel Decomposition of x = 3 Horizontally 
Compatible Problems .............................. 89 
35. Parallel Decomposition of y = 3 Vertically 
Compatible Problems .•............................ 95 
36. Parallel Decomposition of x by y Compatible 
Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 
SEQUENCE OF SNAPSHOTS 
B.1. Simulation of Nash's Systolic Array Solving 
Ex amp 1 e (A . 4 ) • • . • . • . • . . . . • • . • . . • . . • . • . . • . . • . • • . . . 13 5 
B.2. Simulation of Chuang and He's Systolic 
Array Solving Example (A. 2) ...................... 142 
B.3. Simulation of an L-tuple Arrays System 
Solving Example (A.3), With L = 2 .......•........ 150 
CHAPTER I 
INTRODUCTION 
As a general class of problems, matrix computations 
are found to be very useful, if not essential, within a 
broad spectrum of scientific applications. However, they 
are generally expensive in terms of storage space and 
processing time. To be sure, numerous algorithms with 
substantially reduced storage requirement have been devised 
for specific matrix computations. Yet, it is with the 
recent abundance of low cost memory that storage demands of 
matrix computations in general cease to be an important 
issue. On the other hand, the need for greater throughput 
rate has become more acute as applications grew in power and 
complexity. Indeed, for real time operation required in 
such applications as robotics, signal processing and 
computer animation, the computing power of serial computers 
proved to be woefully inadequate. Before long, it was 
evident that the only way to meet the ever growing 
computational requirements of many applications is to build 
faster systems. 
2 
WAYS AND OBSTACLES IN SPEEDING UP DIGITAL SYSTEMS 
Essentially, there are two ways to build faster 
systems. One is to use fast components, the other is to use 
concurrency. 1 Since the technological trend clearly 
indicates that we are reaching the maximum components speed 
potential, any major gain in computational speed must come 
from the concurrent use of many processing elements. 1 • 2 • 3 • 4 
As it is, the architecture of conventional computers suffers 
from two inherent difficulties: 
(1) Long communication paths such as buses between 
CPU and its memory substantially slow down the 
transmission of information. Also, the system 
I/O bandwidth provides an absolute upper bound 
limit on the data rate, which acts as a 
bottleneck in limiting the system speed. 4 • 5 • 6 • 7 
{2) The single CPU sequentially fetches and executes 
instructions thereby does not fully exploit its 
hardware resources at all times, providing 
little or no concurrency for speeding up 
processing. 5 • 6 • 7 
Indeed, the above problems are widely acknowledged for 
quite some time. Nevertheless, they remain to be formidable 
obstacles which must be surmounted before a substantial 
speed increase is obtained. Several schemes were proposed 
to address one or both of them while maintaining the same 
degree of generality offered by the von 





- replicating CPU's processing units (such as 
adders, multipliers, ALUs), 
- and multiprocessor systems. 4 • 7 
Pipelining, as a form of parallelism, involves the 
application of assembly line techniques to improve the 
performance of an arithmetic or control unit. 7 • 8 • 9 
Theoretically, maximum utilization of available components 
can be achieved if the pipeline is kept full at all time. 
However, in actual operation this ideal condition is 
impossible to maintain and speed gains occur only in burst 
between pipeline flushes. Al though widely implemented on 
many high-speed systems today, pipelining does not fully 
exploit the parallelism inherent in many applications and 
only constitutes a minor architectural fine tuning of the 
basic von Neumann structure. 7 • 10 
Memory caching is used to reduce the cost of memory 
and alleviate the communication bottleneck at the expense of 
additional system complexities. A memory cache is a small 
but high-speed memory system that tries to capitalize on 
temporal locality, the theory of which basically states: if 
a particular instruction or piece of data is read from 
memory, then the probability of it being used again 
increases. 
instructions 
Thus, after the 
or data brought in 
4 
cache is filled with 
from slower memory, the 
number of subsequent reads by the CPU which can be performed 
at full speed to the cache increases before access to slower 
memory is required. The effectiveness of a cache memory is 
known as the "hit ratio." Given a certain number of 
instructions (or data) that must be fetched, the hit ratio 
is the number that can be accessed from the cache versus how 
many that must be accessed from slower memory. Generally, 
the cache employs the highest-performance 
technology~bipolar; 4 however with the performance of CMOS 
technology steadily bridging the gap and its cost declining, 
the trade-off seems less attractive. Furthermore, while 
caches seem to work well with single processor computers, 
they are difficult to incorporate into multiprocessor 
systems because of the cache coherence problem. Cache 
coherence relates to the integrity of data between various 
caches within a system. Suppose a two processor system is 
tightly coupled through a main memory but each processor has 
its own data cache. A different routine is running on each 
processor and the two tasks communicate through the shared 
memory. If, however, a shared address is present in both 
caches and the individual processors read and write that 
address, then each processor would not have the same piece 
of data in its respective cache. This results in neither 
processor seeing the changes caused by the other. 7 
5 
Of course there are schemes to remove this problem, 
but they invariably add further complexity to the system. 
Thus, while partially improving the performance of the von 
Neumann architecture, pipelining and caching create other 
problems of their own. 
At the other end, we have systems with replicated 
processing units or multiple processors which incorporate a 
very high degree of parallelism while striving to retain the 
same level of generality available in von Neumann 
architecture; however, run-time considerations such as tasks 
synchronization and memory contention incur rather severe 
system overhead. 7 • 11 • 12 Thus, full utilization of available 
hardware can never be realized. 8 • 12 
Simply stated, the price for generality in highly 
parallel structures is decreased speed, decreased efficiency 
of hardware utilization, and increased software 
requirements. 1 2 During the last decade, there have been 
many highly parallel general-purpose architectures proposed 
or implemented. In general, they required many man-years 
of efforts to design and, because of their complexity, were 
very costly to build. 
Tailored to meet specific application requirements or 
to off-load computations especially taxing to general-
purpose computers, 1 special-purpose systems provide a very 
high degree of parallelism with minimum system overhead and 
complexity. They are generally the fastest and most 
efficient in hardware utilization. 12 
6 
However, because of 
their limited applicability, their cost must be low enough 
to justify their selection over a general-purpose approach. 
THE SYSTOLIC ARCHITECTURE CONCEPT 
Because special-purpose systems are seldom produced in 
large quantities, their design cost is a lot higher 
comparing to the parts cost. 1 • 6 This is particularly true 
when special-purpose designs are implemented with VLSI 
technology. Even though VLSI offers a number of major 
benefits~low cost per component, high density, reliability 
and ease of fabrication, 2 • 5 • 6 effective use of the 
technology to achieve massive parallelism requires careful 
consideration. 
Briefly, a highly parallel VLSI structure should 
adhere to the following principles: 1 • 3 • 10 • 11 • 12 • 13 • 14 
(1) Simplicity and regularity: the design should 
consist of only a few simple types of modules 
which are replicated many times, thus reducing 
design complexity. A simple and regular 
structure is therefore highly cost-effective. 
In addition, such a structure can be easily 
expanded by increasing the number of basic 
modules. This, in turn, leads to linear speed 
improvements. 
( 2) Concurrency: The degree of concurrency in a 
system is largely determined by the underlying 
algorithm. Massive parallelism can be achieved 
if the algorithm is designed to support a high 
degree of pipelining and multiprocessing. 
(3) Communication: Control and communication become 
significant in a parallel computing structure, 
especially with VLSI where routing costs 
dominate the power, time and area required to 
implement 
underlying 
a computation. The design's 
algorithm should therefore employ 
only simple, regular control and communication. 
In a processor array, communications should 
occur only between neighboring processors. 
(4) I/O consideration: Since a special-purpose 
device is typically attached to a host, its 
computational rate should not exceed the host's 
available I/O bandwidth. Therefore, if multiple 
computations are performed per I/O access, 
orders of magnitude improvements on system 
throughput are possible. 
7 
To meet these requirements, Kung and Leiserson in 1977 
introduced the concept of systolic architecture. Originally 
proposed for VLSI implementation of some matrix 
operations, 15 a systolic system consists of an array of 
processing elements (PE's) called cells, each capable of 
8 
performing some simple operations. These cells communicate 
only to their nearest neighbors, and communication with the 
outside world-i. e. the host-occurs only at the boundary 
cells. 1 • 5 • 6 • 15 Data flow from the host through the array in 
a rhythmic fashion, and computations are synchronized by a 
global clock signal. Each data item once brought out from 
memory is used effectively at each cell while being moved 
from cell to cell along the array. 
Conceptually, computational tasks can be classified 
into two categories-compute-bound computations and I/O-
bound computations. In a computation, if the total number 
of operations is larger then the total number of input and 
output elements, then the computation is compute-bound, 
otherwise it is I/O-bound. While speeding up an I/O-bound 
computation must rely on an increase in memory bandwidth, 
the systolic architecture allows a speed-up of a compute-
bound computation without increasing the memory bandwidth 
requirement. 1 
Since cells in a systolic array are of only a few 
simple types, cost-effectiveness and ease of VLSI 
implementation are among the many advantages that systolic 
architecture offers. Others include simple and regular 
control and data flow, elimination of global broadcasting 
and modular expansibility. 1 
9 
SYSTOLIC ARCHITECTURE DESIGN CRITERIA 
Today, the systolic approach is increasingly being 
considered for computational intensive problems and there 
exist many systolic designs for a wide class of compute-
bound applications. In several of his papers, 1 • 6 Kung 
suggested a number of systolic design criteria which are 
briefly outlined below. 
(1) The design makes multiple use of each input data 
item. This property allows systolic systems to 
achieve high throughputs with modest I/O 
bandwidths for outside communication. 
(2) The design uses extensive concurrency. The 
underlying algorithm should use as many of the 
available cells as possible at any given time 
during a computation. Even higher concurrency 
is possible if another level of pipelining is 
introduced to operations within the cells 
themselves. 
(3) There are only a few types of simple cells. A 
large number of cells are required if a systolic 
design is to achieve any great performance 
gains. The cells must therefore be simple and 
of only a few types to curtail design and 
implementation costs. However, one should 
remember that there is always a trade-off 
between cell simplicity and flexibility. An 
exact estimate can only be arrived at on a case 
by case basis. 
( 4) Data and control flows are simple and regular. 
Pure systolic systems totally avoid long-
distance or irregular data communication wiring. 
This is the principal reason why a systolic 
array is adjustable to various performance 
goals. The only global communicat.ion (besides 
power and ground) is the system clock. 
ORGANIZATION OF THIS THESIS 
10 
Even though the systolic architecture offers many 
advantages, it is not without some drawbacks. One possible 
problem is that if a systolic array is too large, its global 
clock signal could be skewed to the point where two cells at 
its opposite ends could not be synchronized properly. 1 • 11 
Another issue is the degree of utility a systolic device can 
support. Proposed as a special-purpose architecture, one 
nonetheless wants a systolic array to be able to perform 
more than one type of computation. These are issues which 
cannot be resolved satisfactorily unless both architectural 
and algorithmic considerations are reviewed carefully. 
The rest of this thesis is divided into four chapters. 
In Chapter II, a brief introduction to Faddeev's algorithm 
is presented; because the main focus of this thesis is in 
its architectural mapping, a more thorough treatment of the 
11 
algorithm is referred to the original book listed in the 
REFERENCES section. Since matrix triangularization is an 
essential component of Faddeev's algorithm, descriptions of 
two systolic arrays for this matrix operation are also 
included. 
Chapter III contains detailed examinations of two 
systolic implementations of Faddeev's algorithm. Analysis 
of the designs performance and correctness of operation is 
presented. Also, their advantages and weaknesses are 
discussed in this chapter. 
In Chapter IV, a new systolic array implementation of 
Faddeev's algorithm is proposed. Again, a detailed 
description and a performance analysis of the design are 
offered. Necessary comparisons to the previous arrays 
concerning modularity, expansibility, versatility and ease 
of implementation will show it to be vastly superior. 
In Chapter V, three different extensions to Faddeev's 
algorithm are developed. It will be shown that these 
techniques are ideally suited to the new systolic array. 
This leads to a four fold increase in the array throughput 
when matrix operations are to be solved continuously. 
Lastly, concluding remarks are offered at the end of this 
chapter. 
Relevant materials that do not fall within the main 
focus of this thesis but are nonetheless important are 
included in the three appendices A, B and c. For the reader 
12 
who wish to verify how Faddeev's algorithm solves various 
matrix computations, Appendix A contains examples which 
illustrate different variants of the algorithm. If he 
wishes to further investigate the operation of all 
architectures put forth in this thesis, Appendix B contains 
sequences of snapshots which show these arrays solving the 
examples of Appendix A. Finally, Appendix C contains the 
Pascal source listing of SAGS, a Systolic Arrays Graphical 
simulator which produces those snapshots, and sample script 
files. 
CHAPTER II 
FADDEEV'S ALGORITHM AND MATRIX TRIANGULARIZATION 
SYSTOLIC ARRAYS 
One aspect of systolic arrays that is the focus of 
several recent research efforts is their lack of generality, 
i.e. an array designed for one algorithm generally cannot 
run another. An approach aimed at removing this drawback 
taken by Kung is the use of a programmable systolic 
chip. 16 • 17 While this allows different sequences of 
operations to be performed within the cells of a systolic 
array, it is only a partial solution to the problem since 
the interconnections between neighboring cells are still 
unalterable. To remove this inflexibility, Snyder proposed 
a programmable switch lattice structure that gives an array 
processor re-configurable interconnections between its 
PEs; 1 3 however, the added complexity of such a network is 
beyond the current integration technology for large array 
sizes. 
Another less drastic approach is to find algorithms 
and their array implementations which are general-purpose 
within a class of problems. This approach generally results 
in simpler processor and/or simpler interconnections, thus 
more array cells can be put into a single chip. 
14 
Consequently, the clock skew problem of large array sizes 
will be effectively reduced since the number of chips 
required would be smaller. 
FADDEEV'S ALGORITHM 
One general purpose algorithm, useful for a wide class 
of matrix operations and especially suited for systolic 
implementation, is the Faddeev's algorithm18 illustrated by 
the simple case of computing the value of ex + D, given 
AX= B, where A, B, c, and Dare known matrices of order n, 
and X is an unknown matrix. 
The problem can be formulated as 
a a . . . a b b . . . b 
1 1 1 2 ln 1 1 1 2 ln 
a a . . . a b b . . . b 
2 1 2 2 2n 2 1 22 2n . . . . . . . . . . . . 
a a . . . a b b . . . b 
n 1 n2 nn n 1 n2 nn 
( 2. 1) 
-c -c . . . -c d d . . . d 
1 1 1 2 ln 1 1 1 2 ln 
-c -c . . . -c d d . . . d 
2 1 2 2 2n 2 1 22 2n . . . . . . . . . . . . . . 
-c -c . . . -c I d d . . . d n 1 n2 nn n 1 n2 nn 
or, in abbreviated form 
* D (2.2) 
15 
If by some means a suitable linear combination of the 




where W specifies the appropriate linear combination such 
that only zeroes appear in the lower left hand quadrant, 
then the lower right hand quadrant will become matrix 
E = ex + D. This is because annihilating -c requires 
W = CA- 1 so that D + WB = D + CA- 1 B, and since AX= B, 
D + WB = D + ex. The elegance and simplicity of the 
algorithm is apparent when one notes that to carry it out, 
it is only necessary to annul the lower left hand quadrant 
by applying a suitable matrix triangularization procedure to 
16 
the left side of (2.2) while extending the operation to its 
right side. We will then have from ( 2 .1) 
a' k > a ck> . . . . a ck> b' k) b' k) . . . b' k) 
1 1 1 2 ln 1 1 1 2 ln 
0 a ck> . . . . a ck> b' k) b' k) . . . b' k) 
22 2n 2 1 2 2 2n 
0 0 
( k ) a ck> b' k) b' k) b' k) a . . . . . 
3 3 3n 3 1 3 2 3n 
. . . . . . . . . . . . . . . . . 
0 0 . . . 0 a' k > b' k) b' k) . . . b' k) 
nn n 1 n2 nn 
-
0 0 . . . . . 0 e e . . . e 
1 1 1 2 ln 
0 0 . . . . . 0 e e . . . e 
2 1 22 2n 
. . . . . . . • . . . . . . 
0 0 . . . . . 0 I e e . . . e 
n 1 n2 nn 
or, in short 
A' k) I B' k) 
0 I E 
where A' k > is an upper triangular matrix and B' k > is B 
modified k times by the procedure. Often used in solving 
linear systems, Gaussian elimination is one of the better 
known triangularization methods available to perform the 
Faddeev's algorithm. Since the usual backsubstitution is 
not needed here, considerable savings in computation and 
storage are obtained. 
With Faddeev's algorithm, a variety of matrix 
operations can be performed by selective entries in the four 
quadrants. For example, when D = O, C = I where I is the 
17 
identity matrix, and B is a column vector, E becomes X, the 
solution to the linear system AX = B. Some other matrix 
operations possible with Faddeev's algorithm are shown in 
Figure 1 below. The reader is referred to Appendix A for a 
detailed treatment of Gaussian elimination and the solutions 
to a sample linear systems using Faddeev's algorithm. 
* => CA- 1 B+D * => A-
1 B 
D 0 
* => CB * => A-1 0 0 
* => CB+D D Figure 1. Some matrix operations possible 
with Faddeev's Algorithm. 
SYSTOLIC ARRAYS FOR MATRIX TRIANGULARIZATION 
Since the underlying procedure to carry out Faddeev's 
algorithm is matrix triangularization, any systolic 
implementation of the algorithm should be based on a 
structure which can perform triangularization efficiently. 
Developed by Gentleman and Kung as a common platform for two 
different triangularization methods, the triangular 
systolic array of Figure 2 can execute both Gaussian 
elimination with neighbor pivoting or orthogonal 
triangularization. 19 • 20 The array consists of two types of 
cells: the boundary cells (represented by circles) and the 
Cycle 7 ----+ x .. 
Cycle 6 ----+ x.J XJ• 
Cycle 5 ----+ x.2 X33 X2• 
Cycle 4 ----+ x., X32 X23 X14 
Cycle 3 ----+ X31 X22 X13 
Cycle 2 ----+ Xz1 X1z 
Cycle 1 ----+ Xn 
J. J. J. J. o::o::o::o:: 





Fiqure 2. Triangular systolic array for matrix 
triangularization. 
18 
internal cells (represented by squares). These cells are 
locally interconnected into a triangular mesh. Each cell 
stores a microprogram, enabling it to interact with its 
neighbors in such a way that a triangularization procedure 
can be carried out. Changing the microprograms of the cells 
will allow the array to execute different procedures. 
In the following discussion, the term data row refers 
to a row of entries of matrix X, whereas the term array row 
means a row of cells of the array. The triangularization of 
matrix X by the array is as follow. Initially, all cells 
contain only zeroes. As each data row i enters the array 
via the top boundary, its entries are stored in the cells on 
the it h array row. Before the data row i reaches its 
19 
respective array row however, its entries are modified by 
cells of previous array rows such that the first i - 1 
entries are discarded~i.e. became zeroes. The modification 
of an incoming data row is initiated by a boundary cell. 
This cell generates modification factors, values resulting 
from computations performed on an incoming entry and the 
cell's own stored value. The modification factors are then 
sent rightward to meet other entries of the same data row in 
the internal cells. There, they are used to modify the 
entries which are subsequently outputed to the next array 
row. While cells of any given array row are updating a data 
row, they may also update their own currently stored values. 
Note that because of the critical timing required for 
the rightward data stream to reach internal cells at proper 
moments, the input data flow is fed into the array in a 
skewed order. After completion, modified x values left in 
cells constitute elements of a triangularized matrix and can 
then be readily read out, one from each cell. 
Gaussian Elimination With Neighbor Pivoting 
When Gaussian elimination procedure is performed using 
finite-precision arithmetic, as would be the case for 
electronic computing devices, a diagonal element that is 
small compared to the entries below it in the same column 
can lead to substantial roundoff error. Traditionally, 
pivoting strategies such as partial or total pivoting have 
been used to improve its numerical stability. 21 Because of 
20 
the global communication that may result from pivot 
selection, they are not quite suitable for systolic 
implementation. Thus, to maintain the same degree of 
stability for the triangularization process described above, 
Gentleman and Kung suggest the use of another pivoting 
strategy, called neighbor pivoting. This technique 
introduces a zero to a row by subtracting a multiple of an 
adjacent row from it, interchanging the rows when necessary 
to prevent the multiple from exceeding unity. 19 In 
Appendix A, examples of Faddeev's algorithm using Gaussian 
elimination with neighbor pivoting is shown. 
The triangular array of Figure 2 can perform Gaussian 
elimination with neighbor pivoting using the cells shown in 
Figure 3. As its microcode reveals, the boundary cell 
generates two modification factors: a multiplier M t , as 
OU 
well as a Boolean variable V t , which signals a row 
OU 
interchange when having value one. This occurs at every 
array cycle, the maximum length of time necessary for a cell 
to execute its microprogram once. 
Orthogonal Triangularization 
The orthogonal triangularization procedure involves 
the execution of a series of plane rotations (also known as 
Givens rotations) on the target matrix. They are applied 
initially to the first row and the second row, the first row 
and the third, the first row and the fourth, and so on to 
the last row. At this point, all rows except the first will 
BDUNDARY CELL 1 






Vout ~ I 
Mout ~ If X10 ;': 0 then -XIX'" 




INTERNAL CELL 1 
x., 
l 
M., ~ rFvil ~ Mout = M,n 
v., ~~~ vout = v .. 
l 
Xout 
vout ~ 0 
Mout ~ -X., IX 
If V.n then 
begin 
Xout ~ X + M., ,. X., 
x ~ x,. 
end 
else 
Xout ~ X1n + M10 "' X 
Fiaure 3. Microcode specifications of boundary 
cell and internal cell for Gaussian elimination 
with neighbor pivoting. 
21 
have zero entries on their first column. Next, the above 
process is repeated starting with the second row, then again 
with the third row, and so on until zeroes are introduced to 
all columns such that the resultant matrix becomes upper 
triangular, after which the triangularization procedure is 
completed. In Appendix A, the reader will find a more 
detailed description of Givens rotations along with examples 
of Faddeev's algorithm illustrating their uses. 
The systolic array of Figure 2 can perform orthogonal 
triangularization using the cells specified in Figure 4. 
While this method yields better numerical accuracy than that 
of the previous section, 22 notice the added complexity 
22 
necessary for boundary cells because of the square roots. 
Since all cells in the systolic array must operate at the 
same throughput rate, the boundary cells could form a 
bottleneck on the overall system performance. 20 





11-ITERl•JAL CELL I 
x,n 
1 
C"' ----+ rr=)(il ~ C out a C In 
s," ~~~ sout & sin 
1 
Xout 
If X., = 0 then 
beg;n 
else 
Cout ~ I 
Sout ~ 0 
end 
begin 
Cout f--- X 1/x2 + xz In 
sout ~ x,n /~~ 
x f--- /x2 + x2 In 
end 
xout f--- -s,n x + cln x,,, 
x f--- con x + son x,0 
Figure 4. Microcode specifications of boundary 
cell and internal cell for orthogonal 
triangularization. 
CHAPTER III 
SYSTOLIC IMPLEMENTATIONS OF FADDEEV'S ALGORITHM 
In this chapter, we will look at two systolic 
implementations of Faddeev's algorithm, originated from 
different authors. Their basic arrays are remarkably 
similar in most aspects such as interconnection topology, 
cells layout, I/O requirements and general algorithm 
mapping. This is not surprising since both are based on the 
same triangular array we've just examined in the previous 
chapter. However, they differ in the triangularization 
methods used to implement Faddeev's algorithm, which lead to 
dissimilar cells' control codes and numbers of pin-out. We 
can attribute this to the respective authors' design choices 
concerning the trade-offs between algorithm's stability and 
array's throughput rate. 
NASH'S IMPLEMENTATION 
To improve its numerical stability, Nash et. al. 23 • 24 
suggested a modification to Faddeev's algorithm by replacing 
the Gaussian elimination procedure used to triangularize the 
coefficient matrix A of (2.2) with orthogonal 
triangularization. 
24 
For clarity, it is useful to divide their algorithm 
into a two-phase procedure. In the first phase, A is 
triangularized by a series of Givens rotations 
(simultaneously applied to B): in the second phase, the 
diagonal elements of the resulting triangular matrix are 
used as pivoting elements in the Gaussian elimination 
procedure on c and D, where columns of C will be zeroed out 
and D will become the result. Note that for the Gaussian 
elimination procedure to work properly, it is necessary that 
these pivoting elements be non-zero, hence the requirement 
that A be full rank, i.e. at least one of its square 
submatrices of order n has a non-zero determinant. 
Nash's systolic implementation, shown in Figure 5, 
consists of a triangular array and its right extension, a 
square array. The triangular array, based on Kung's design 
in Figure 2 for orthogonal triangularization, performs 
Givens rotations on A (first phase) and ordinary Gaussian 
elimination on c (second phase). For higher efficiency in 
performing Givens rotations, cells' microcodes of Figure 4 
are slightly modified into those of Figure 6. Furthermore, 
the added processing of ordinary Gaussian elimination 
requires the extra codes of Figure 7. The square array 
simply extends the corresponding processings to B and D and 
thus consists only of square cells. 
The input data flow involves feeding A and B through 
the system from the top with cells executing the 
Second Ph0-se 
Go.ussio.n ellMino. tlon 
First Ph0-se 
Orthogono.l trio.ngulo.rlzo. "tion 
dw 
d~ d .. 
d.. dn dH 
du dR de dM c.. d., cl., cl,. b .. 
c., c,. d., cl., b., b,. 
C.. C~ c.. du b.. b.. bH 
C4 C~ C~ Cl< bu bK ba bH 
c.. c... c,. 0-.. b., b,. b,. 6 
c., c,. 0-., 0-.. b., b., 6 0 delo.y cell 
-7c., °' .. °'~ °' .. bu6o6/ 
°'"' °'~ °'~ J'" 6 DD a'.{ Squo_re 0-rr0-y 
°'"' °'"" °'"00000 
0-., 0-.. 600000 






999 x .. / 9 9 x,. x .. 
9 x .. x .. x •• 
X 41 Xx Xa XH 
x .. x .. x .. 
x .. x., 
Xu 
Figure 5. Nash's systolic implementation of 
modified Faddeev's algorithm. Note the use of 
delay cells to skew the data flows. 
25 
microprograms of Figure 6 on each incoming row. This 
corresponds to the first phase of the modified algorithm. 
Notice that the required skewing of the data flow is 
performed by a triangular array of delay cells (represented 
by rectangles) above the system. The second phase is 
accomplished by a similar flow of c and D, only this time 
the cells execute the microprograms of Figure 7 on the data 
26 
elements and the resulting matrix will appear row by row 
coming out from the bottom of the square array. These 
output rows are straightened back to normal by another 
triangular array of delay cells below the square array. 
With a matching I/O bandwidth, the system will compute 
CA- 1 B + D in 5n - 1 steps and solve a 1 inear system of n 
equations in 4n steps. 
BOUNDARY CELL ' 
Xin 
l 




If x., = 0 then 
begin 
Cout = 1 
Socrt = 0 
r = 0 
end 
els" b"gln ~ 
t = "'re + x:;, 
- r/t 
Cou1 - /t 
Sout = X1n 
r = t 
.,nd 
l 
c,n --79---? cout: c,n 
s,n ---?~---? soui; -s,n 
Xout = -s,n r + c,n x., 
r = c,n r + s,n x., 
l 
xout 
DELAY CELL : 
x,n 
l 
[CJ X,n = Xocrt 
l 
xout 
Figure 6. Microcode specifications of boundary 
cell and internal cell used in Nash's array 
during the first phase, i.e. Givens rotations. 





!Nl ERNAL CELL I 
x,n 
l 
M., ~ g ~ Mout = M,n * ~!!:2J~* 
l 
xout 





X1n - M1n r 
Figure 7. Microcode for boundary cell and 
internal cell used in Nash's array during the 
second phase, i.e. Gaussian elimination. 
27 
The input data flow can be contiguous, i.e. matrices A 
and B and then c and D can enter the array without any 
interruption in between. Data flows of separate problems to 
be solved by the array can also be fed continuously into the 
array. For this to be possible, additional control 
capabilities are necessary to switch the cells from one set 
of codes to another at the proper time. Slight modification 
of the microprograms will also be required. 
Although Nash's modified Faddeev's algorithm is 
mathematically sound, its systolic implementation, 
unfortunately, contains some serious deficiencies. For 
instance, it is possible for the array to produce erroneous 
results, as illustrated by the following example. Suppose 
28 
we have a linear system AX = B of order n = 3 where X is an 
unknown matrix, and one or more entries in column 1 of 
matrix A are zeroes, in this case, a 21 : 
A= [ 
1 2 3 ] 
0 4 7 
2 1 3 
B = [ n (3 .1) 
Since the determinant of A, ~(A) = 9 is non-zero, A is 
therefore full rank, thus guaranteeing that a solution to 
the system exists and that it is unique with x
1 
= 1. 33, 
x
2 
= -0.67 and x
3 
= 1.67. However, when A is fed into the 
array of Figure 5, because a
21 
= o, during the second step 
the boundary cell of row 1 column 1 will clear its r 
register, previously storing a 11 = 1. This effectively 
transforms A into another matrix, say E, whose entries are 
identical to A's except for e
11
, which is zero, and all 
further processings will be on the resulting linear system 
E = [ 
0 2 3 ] 
0 4 7 
2 1 3 
B = [ n (3.2) 
In this case, since ~ (E) = 4 is non-zero, E is also 
full rank and therefore the procedure is completed 
successfully, but with x 1 = 3, x 2 = 4 and x 3 = -1 which is 
the solution to (3.2) instead of (3.1). 
29 
The cause of the above error can be traced to a bug in 
the microprogram of the boundary cell. As Figure 6 reveals, 
this microprogram has the line of code 
r = 0 
which always clears the content of register r whenever 
xi n = 0. In fact, if at any time during processing the 
boundary cell of a row i receives a zero-valued xin from an 
internal cell of row i - 1, erroneous result will appear at 
the end of processing. Thus, to correct the problem, this 
line should be removed. 
For the purpose of verification, the reader is 
referred to Appendix A where correct solutions to examples 
( 3. 1) and ( 3. 2) are arrived at manually using Faddeev' s 
algorithm with Givens rotations. Furthermore, he is 
encouraged to examine the series of snapshots included in 
Appendix B which shows the graphics simulation of Nash's 
array computing (3 .1). These pictures illustrate clearly 
the sequence of events leading up to the erroneous results. 
Implementation errors aside, a drawback of Givens 
transform is the square root needed to compute the values of 
sine and cosine for each rotation. Execution time of this 
operation can easily be ten times that of a multiplication 
or division. Since timing is critical for proper 
synchronization of data flow in a systolic array, it is 
necessary to slow down the entire array correspondingly. 
Thus the circular cells represent a bottleneck in the 
30 
system. Of course a hardware implementation of the square 
root is possible, however, we have to bear in mind the cost 
of added cell's complexity. 
Another drawback of this implementation is the large 
pin counts for individual cells because of the need to 
transmit simultaneously the sine and cosine values to 
neighboring PEs. Not counting clock and control signals, 
the boundary cell will require one input and two output data 
buses and the 'internal cell will require three input and 
three output data buses. For n-bit operands, 3n and 6n I/O 
pins are needed for the boundary cell and the internal cell, 
respectively. This translates to a large chip area for each 
cell. Bus sharing or multiplexing schemes to reduce I/O 
lines are possible, but they would increase the processing 
time and consequently, reduce the throughput rate. 
CHUANG AND HE'S IMPLEMENTATION 
Another systolic 
algorithm, proposed by 
implementation 
Chuang and He, 25 
of Faddeev' s 
significantly 
improves upon the previous array. As shown in Figure 8, 
many similarities exist between the two arrays' design. To 
compute CA- 1 B + D from (2.2), both systems use a triangular 
array for the triangularization of A and the annulment of C, 
and a square array for extending the corresponding 
processing to B and D. The input data flow to both systems 
are similarly organized and skewed, and pipelined through 
31 
each system in a similar fashion. For the processing of the 
lower half of the input data flow (i.e. matrices c and D), 






c .. c,. 
c .. Cu 
Ca c.., 















n .. n .. n,. 
n,. nu nllt 1 
0-21 0.12 ! 
d.,. 
d., d"" 
d,. d,. d.,.. 
cl .. d,, d., d .. 
cl,. d •• d~ b ... 
cl,. d,. b., b,.. 
clu b .. b., b,.. 
b .. b,. b., b .. 
b,. b,. ba 1 
btl bit l 
b .. l 
1 
----7 °'11 l I 
l I v 
T 
6:: .j. ' CJ00:: ' OQ , , r!-,_. _L_/. ~ 
6::[]::[]:![]::L_J.; - -
/f=iJ::boOO 600::~ l. .~
o.rro.y 1 I "'! l. x .... 
.j. x.., x,.. 
• X,t x .. Xe< 
x .. x,, xt3 x .. 






Figure 8. Chuang and He's systolic 
implementation of Faddeev's algorithm. The 
triangularization method used here is Gaussian 
elimination with neighbor pivoting. 
32 
However, Chuang and He's system processes the upper 
half of the input data flow (i.e. matrices A and B) using 
Gaussian elimination with neighbor pivoting instead of the 
Givens transform. 19 Hence, while numerical accuracy is 
somewhat inferior, this implementation is less expensive in 
terms of processing time and hardware complexities. Because 
the square root operation is not used, the array avoids the 
bottleneck problem created by the boundary cells of the 
Nash's array. And since the rightward data flow essentially 
consists of only one operand, Mou t 1 the pin counts of 
boundary cell and internal cell are correspondingly reduced 
to 3n and 4n, respectively. 
Since it is obvious that different phases of 
processing are required for the upper half and the lower 
half of the data flow, two separate sets of microprograms 
for boundary cells and internal cells are needed, as shown 
in Figure 9 and 10. The first set, the pivoting functions, 
performs Gaussian elimination with neighbor pivoting on A 
and B, while the second set, the non-pivoting functions, 
performs regular Gaussian elimination on C and D and is 
essentially the same as the functions of Nash cells in 
Figure 7. 
As the data flow is pipelined through the array, each 
boundary cell stores an input data element and sends a 
multiplier M
0 
u t rightwards to modify the input data that 
enter the internal cells of the same row. Along with each 
BOUNDARY CELL • 
x,n 
l 
if.3Y.. ~ Hout 
~~ Vout 
INTERNAL CELL • 
xm 
If IX., I~ IX I then 
begin 
VIJ'Jt ~ I 
If X1n ~ o then 
Hout ~ -XIX., 
else Hout ~ 0 
x ~ x., 
end 
else 
V0 .rt ~ 0 
H0 ut ~ -X., IX 
If v., then 
begin 
l 
Hin ~fi'Xll~ Hout : M., 
Vin~~~ Vout -V., 
Xout ~ X + Min 1< X10 




xout Xout ~ X10 + M., • X 
Figure 9. Microcode specifications 
used by Chuang and He's array for 




M t it generates a one-bit boolean value V t to signal 
0 u 0 u 
whether pivoting is needed. Each internal cell stores a 
data value arriving from the top and passes downward all the 
following data after modification. M t and V t remain 
OU OU 
unchanged as they travel rightwards through the array. For 
an input column of length and width 2n data elements, the 
output will be a matrix of order n emerging from the bottom 
of square array. It can be seen that when the system 
matches the I/O bandwidth, Sn - 1 steps are required to 
obtain CA- 1 B + D and 4n steps are needed to solve a linear 








Min --7rr=;;=j1--7 Movt =Min 
* -7~--7 * 
Xovt ~ X,n + M,n ,. X 
l 
xou't 
-if TeMporo.rily unused 1-bit bus 
Figure 10. Microcode specifications 





Like in the Nash's implementation, the input data flow 
of this array can be continuous if additional control 
capabilities are used to individually switch each cell from 
pivoting to non-pivoting mode as required. As published, no 
technique was mentioned by the authors of both 
implementations to perform this switching; however, we can 
think of at least two different techniques to do this. One 
is to have the host or a dedicated controller generate the 
controls necessary for each individual cell, thus requiring 
a complex cell addressing scheme. Another is to tag control 
bits to input data elements which will then carry the 
control information with them throughout the array. This 
method assumes that the host, while generating the input 
data, will add the necessary control information to it. Its 
35 
down side is that it will force an enlargement of the I/O 
bandwidth between the host and the array. In the next 
chapter, it will be shown that a combination of the above 
mentioned techniques will be used in our design. Thus, 
while having the advantages of both, it will avoid some of 
their inefficiencies. 
Input Decomposition 
Often, problems in real-world applications are larger 
in size than the available I/O bandwidth between the host 
and the array. When this is the case, increasing the 
array's size or speed does not bring about an increase in 
throughput since the limiting factor is the I/O bandwidth 
itself. One solution is to decompose the problems into 
smaller sub-problems, which can then be stored in the host 
and later processed in the array one at a time. In general, 
the tasks of decomposition and post-processing are complex 
and time consuming: passing intermediate results back and 
forth between the host and the array reduces the throughput 
that the I/O bandwidth can support. Furthermore, the array 
throughput also suffers because of the pipeline flush 
brought about by the interrupted data flow. 
To avoid these problems, Chuang and He 
structuring the array as a feedback array system. 
propose 
The idea 
is that the system simulates the operation of an arbitrarily 
large array by using the small arrays over and over, with 
the output of the small arrays fed back to be processed with 
36 
other input data at the proper times. To match the input 
data flow with the I/O bandwidth, it is necessary that the 
data flow be decomposed. For an I/O width of w, it is 
suggested that the data flow be cut into strips of width w 
parallel to the direction of the data flow, or bands of 
width w vertical to the data flow. These strips or bands 
are further cut into blocks of length w. A problem of size 
2n x 2n where n is m times w will yield 2m x 2m blocks. 
Depending on the order in which these blocks are fed into 
the array, we have parallel, vertical or 
decomposition as shown in Figure 11. 
c 
0 











~ c c: 0 0 .µ . .µ . iii iii " ! .0 0 .. "- Q_ . [ 
l 1 





" 0 "!! m u '- ... .µ ... Jl 
'- >-.. .,. :i: 
> 
~ ~ ~ 
actual width actual width 
w w 
(o.) (b) (c) 
Figure 11. Three ways to decompose the input 
data flow. (a) Parallel decomposition. (b) 




In this figure, the series of vertical numbers 
represent the order of the steps in which the strips or 
bands are fed into an array. Note that in the parallel 
decomposition (Figure lla), the end of the first strip 
overlaps with the beginning of the second strip, i.e. the 
last data item of the first strip enters the array at the 
same time (step number 9) as the first data item of the 
second strip. The bands 
(Figure llb) are similarly 
of the vertical decomposition 
overlapped, as with the band 
segments and the strips of the hybrid decomposition 
(Figure llc). All this overlapping ensures that the input 
data flow to the array is continuous. 
Feedback Systems for Parallel Decomposition 
Suppose we want to compute CA- 1 B + D for matrices of 
size n using the full size array of Figure 12. Again the 
available I/O bandwidth is w wide. We can decompose the 
2n x 2n input data flow into 2m strips, each w wide as in 
Figure lla, numbered from V
1 
to V2 m. For m = 4, the full 
size array of Figure 12 can be thought as consisting of 26 
subarrays, with each subarray of type T or S and of size w. 
Under the given I/O constraint, feeding the strips one after 
another continuously into this array will not work since the 
rightward data stream generated by a T subarray from one 
strip will not meet the following strips at a proper time. 
38 

















S3 S4 Ss Ss Ss Ss 
S2 S3 S4 S4 S4 S4 
T S2 S3 S3 S3 S3 
y 
"" subo.rro.ys ""- T S2 S2 S2 S2 
l l l l 
E1 E2 E3 [4 
Fiaure 12. Systolic system with 26 subarrays of 
types T and S, each of width w. The available 
I/O bandwidth is also w. 
On the other hand, the feedback array system of 
Figure 13 will process the same data flow correctly under 
the same I/O constraint. This feedback array system 
simulates the large array of Figure 12 by using its 
component arrays over and over again as follows. Initially, 
as V
1 
is fed into the T array, it generates a horizontal 
data stream which is then stored into the memory buffer Bl. 

















respectively as they arrive. When the intermediate result 
from strip V
2 
comes out of S
2 
, it too goes into the T array 
to produce another stream of horizontal data which is then 
stored into buffer B2. Again, the content of B2 is fed back 
into arrays S2 , S3 and S4 to process the intermediate 
results of V3 , V4 , and V5 coming out of S3 , S4 and S5 
input do:to. fro" host 
Et ,Ee ,Ee ,E1 
MQl'lory BuffQrs; 
Bl• w x C2n) 
Bo w x C2n - w> 
B3o w x <2n - 2w) 
B4' w x (2n - 3w) 
Figure 13. Feedback systolic 
smaller number of subarrays 
decomposition. This system 
problems with m > 4. 




respectively, and so on. To properly synchronize the 
horizontal data streams, the buffers Bl, B2, B3 and B4 must 
be of length 2n, 2n - w, 2n - 2w and 2n - 3w respectively. 
Note that each successive buffer is shorter by w. This is 
because as a data strip Vi goes through a square array S, it 
is shortened by a w x w block of data, which remains inside 
S. Hence, the T array processing this shortened data strip 
40 
will generate a correspondingly shortened horizontal stream 
of modification factors. 
This feedback array system achieves maximum throughput 
using much less component arrays than the larger array in 
Figure 12. The number of steps for it to compute CA- 1 8 + D 
is 
( ( 2m) ( 2mw) + w - 1) + mw = (3. 3) 
(4m + 1)n + w - 1 = O(mn) 
where O(k) denotes order of k. 
Since this system requires m s arrays and m buffers, 
it is not quite independent of problem size. Because the s 
arrays are identical, eliminating all but one reduces the 
number of component arrays needed and, at the same time, 
yields a design that is problem size independent. Figure 14 
illustrates a one-T one-S feedback array system. The 
feedback scheme is now two-dimensional, with horizontal and 
vertical data streams. The input data flow is similarly fed 
into the system as in the previous system. However, because 
only one S array is available, each data strip V r where 
r = 2, 3, ••. , 2m will be processed by the same S array r - 1 
times. While intermediate results of strip V
2 
will go 
directly into the T array, an additional buffer B is needed 
s 
to store the intermediate results generated from strips V
3
, 
V4 , ••• , and V2 m. The feedback of these intermediate results 
to the S array is inserted in between adjacent strips thus 
41 
preventing data strips from V
3 
onward to be fed continuously 
into the system. 
The throughput of this system is of course lower. The 
number of steps necessary to complete CA- 1 B +Dis now 
m 
2mw + L (2m - k) (2m - k + l)w + 2w - 1 = 
k= 1 
7 5 
-(m 2n) + -(n) + 2w - 1 = O(m 2n) 
3 3 
V2 ) V3 ) V4 ) V5 ) V6 ) V7 ) V8 
vl 
S2 
E 4 ) E3 ) E2 ) El 
Buffer for 
lnterriecilo. te results 
r== I 1 ~I Bs: C2n - w) x w I 
MeMory Buffers 
J ,_________.., ,___ __ __,... Bl1 w x <2n) 
B2: w x (2n - w) 
B31 w x <2n - 2w) 
.___ _ B4: w x <2n - 3w) 
Figure 14. Two-dimensional feedback system with 
one s and one T subarrays. This system is 
problem size independent. 
(3.4) 
42 
Feedback Systems for Vertical Decomposition 
In Figure 15 below, Chuang and He illustrated how an 
array wider than available I/O bandwidth can solve a 
matching large problem when the input data flow is 
decomposed vertically like in Figure llb. Again suppose the 
I/O bus is w wide and the array is 2n = 2mw wide. 
Essentially the same system as that of Figure 12, this array 
system has in addition a 2m-way demultiplexer on the input 
side and an m-way multiplexer on the output side. The input 
data flow, consisting of 2m bands of 2m blocks each, is fed 
Host 
Figure 15. Array system for vertical 
decomposition of input data flow. With I/O 
bandwidth w, full utilization of available cells 
is not possible. 
43 
into the array one band after another continuously. The 
demultiplexer feeds the blocks of each band to the subarrays 
on the first row of the system one at a time from left to 
right. Since all the blocks are skewed, each overlapped 
with its left and right neighbors and the whole band is 
contiguous as it enters the system. 
For this array, the total number of steps to complete 
the process is 
(c2m) (2mw) + w - 1) + mw = 
(4m + l)n + w - 1 = O(mn) 
which is identical to equation ( 3. 3) of Figure 13. While 
the array of Figure 15 has many more subarrays, its 
processing speed is not higher because maximum usage of all 
cells is not realized due to the I/O bottleneck. 
Furthermore, this array is not problem size independent. 
Al though inefficient in terms of usage of available 
hardware, the array of Figure 15 serves as an example of how 
a vertically decomposed data flow should be processed. A 
more flexible system, shown in Figure 16, is problem size 
independent and delivers the same throughput using a smaller 
number of subarrays. In this system, the 2m-way 
demultiplexer of Figure 15 is reduced into a 2-way 
demultiplexer which is repeated at the input side of every 
row of subarrays. As the bands of the input data flow enter 
the first row of the system continuously, the first block of 




Bt I:=" MUX 
Figure 16. Problem size independent array 
system for vertical and hybrid decomposition of 
input data flow. Available I/O bandwidth is w. 
44 
into the s array on the same row. The rightward data stream 
generated by the T array is fed into the s array and 
recycled until all blocks of the same band are processed. 
Because these blocks form a contiguous data stream, no 
buffer is needed to store the M
0 
u t and V t OU values for 
recycling. On the other hand, X values stored in the s 
array cells need to be saved as shifting into the 
neighboring block begins since they will be used later in 
the processing of the next band of data. To simplify 
control and reduce memory access, they will be stored into 
45 
the recycling shift registers implemented next to each cell 










recycling eve1-y w steps 






p - I >. 
L 
P - I 











Figure 17. Recycling shift registers for the 
temporary storage of the X values. Implemented 
next to each cell, each buffer is p in length. 
outputs from the bottom of an S array, the x out 
values, will be processed in the same way by the T and S 
arrays on the next row. When the problem is larger than the 
system, i.e. 2n > 4w of Figure 16, the outputs of the last 
row's S array will be stored in buffer B to be recycled 
s 
back into the system for further processing. 
Feedback System for Hybrid Decomposition 
Due to the finite capacity p of the recycling shift 
registers of Figure 17, the size of problems that can be 
solved by the feedback system of Figure 16 is limited. A 
46 
way to circumvent this limitation is to use the hybrid 
decomposition of Figure llc. The input data flow in this 
case is divided into parallel strips of width pw. These 
strips are in turn divided into band segments of width w and 
length pw vertical to the direction of the data flow. 
Segment by segment, the strips enter the system of Figure 16 
one after another continuously as in parallel decomposition. 
Blocks of each segment are processed as in vertical 
decomposition and fill the recycling shift registers of the 
cells with new X values, to be used later with the next 
segment. The rightward stream of modification factors, 
generated by segments of the first strip, is saved to be re-
used on corresponding segments of the following strips, 
hence the need for the memory buff er Bt . 
Sparsity in Matrices 
Another important merit of Chuang and He's feedback 
array system is that, as they pointed out, it can skip 
blocks of zeroes in the input data flow, and thus greatly 
reduce the processing time. 
linear system 
As an example, consider the 
AX = B (3.5) 










and B = [ B1 , B2 , • I B ]T ID I n = mw, and each Aij or Bi 
with i = 1, 2, ... , m, 1 s j s i, is a w x w submatrix, 
or block. 
The data flow is decomposed parallely into w wide 
strips of w x w blocks as shown in Figure 18. The blank 
blocks are the zero submatrices and the -li,j blocks are the 
v1 v2 V 3 V4 vs V6 B (01~ ~. 1 ) 
A11 B1 
A21 A22 B2 
A31 A32 A33 B3 
A42 A43 AH B4 
As3 As4 Ass BS 







Fiaure 18. Parallel decomposition of a sparse 
matrix problem with m = 6. Note that matrix B 
in this case is the strip Vm+i· 
48 
diagonal submatrices of the -I matrix. Without loss of 
generality, B is assumed to be an n x w matrix. In this 
example, the bandwidth p of A is three blocks wide. 
To understand how sparse matrices can be exploited to 
yield better throughput, let us analyze what happens when 
the system from Figure 12 process the data flow of 
Figure 18. on its first row, as the 2m blocks of V1 are 
processed by the T array, 2m blocks of M t values are 
OU 
generated horizontally to modify data strips on the right. 
Since only p + 1 blocks of V
1 
are non-zero, only p + 1 
blocks of M t values are non-zero. This is because when 
OU 
incoming X. = o, the boundary cells invariably generate in 
M = O. out Furthermore, because the internal cells always 
generate XO u t = xi n when Min = 0 I as the data strip vm+ 1 
(containing B matrix) goes through array sm+i on the first 
row, only its corresponding p + 1 blocks are modified, with 
the first zero block below Bm becoming the result X
1
• On 
the other hand, strips V
2 
to Vm emerge from the S arrays of 
that row unmodified but minus their first blocks. This is 
because as they pass through these arrays, all zero entries 
of their leading blocks are retained in the cells' X 
registers, and thus X t = Xi • 
OU n 
The above process is repeated on succeeding rows of 
arrays until all results are computed. Since the S arrays 
of column i (i = 2, ... , m) are not needed to process strip 
Vi, they can be removed from the system and the strip's 
49 
leading blocks of zeroes can be skipped. Because they do 
not contribute to the modification of data strips on the 
right, the zero blocks above and below the diagonal band of 
-I can also be skipped. 
The architecture that most efficiently process sparse 
matrix problems is shown in Figure 19. This system receives 
the data flow of Figure 18 from the host, where all the zero 
blocks are eliminated except those of the strip Vm+ 
1
• As 
seen from Figure 19, the system uses only one s and one T 
arrays. The single T array is fed with A's non-zero blocks, 
one strip after another continuously. Its horizontal data 
flow, consisting of modification factors M t and V t' is 









B1 , , , , B <ll B m 2 3 
A 11 
~ .. ~ s 
~ 
\,/ 






Figure 19. Systolic system for the processing 
of sparse matrix problems. Note that this 
design requires an I/O bandwidth of 2w. 
50 
fed directly into the s array to modify Vm+l. Vm+l iterates 
through the s array p + 1 blocks at a time, each iteration 
is concurrent with a strip of A. During each iteration, the 
leading non-zero block remains in the S array where it is 
used to modify the next p - 1 non-zero blocks, and transform 
the last block (originally a zero block) into a block of 
results. The demultiplexer below the array S routes the 
modified p - 1 non-zero blocks to buffer B and outputs the s 
block of results to the host. As they emerge from B , the s 
modified p - 1 non-zero blocks are then combined with a new 
non-zero block and another zero block from Vm+ 1 to form 
input data for the next iteration. 
For instance, the first iteration sees the T array 
process blocks A11 , A21 , ~ 1 and -lm+l,l of strip V1 at the 







first zero block of strip Vm+l. This produces: 
- block B
1 
which remains in the s array, 
- blocks B~ 1 > and B; 1 > which are temporarily stored in 
buff er B , 
s 
- and the block of results X
1 
which is outputed. 
During the second iteration, the T array will process 
blocks ~ 2 , ~ 2 , A4 2 and -lm+ 2 • 2 of strip V2 , while the S 
array process blocks B' 1 > 2 , B' 1 > 3 , B4 and the second zero 
block of Vm+l. The entire sequence of processing is 




~~ (5) /µ,· (5) Mu -1"_ 
~ Au v6 B, (5) 0 M.,.~ 
' !l· (4) M~ -1~ 
x. (4) B~(4) M~ A~ 
l ~4>/ 0 
'"'-........ A~ Vs M" ... • 
4) B, (Q) M_. -1....,. 
' B~<3) M~ A_. 
tE~:; ~ (3) M • 
A~ 
w '"'-........ A •• V4 :::E M ... u 
1----i -1 ... 3.3 I- 3) Bs <0) M,., 
' _!3. (2) M., AS3 
X,<2> ~2) M:s:s A •• 
~. (2); o '"'-........ A:s:s V3 M.,.u 
B3 <2) B4 (Q) M42 -1__.. 
' _!!3 (1) M:JZ A42 
x, (1) B2 <D M22 A:JZ 
~3 (1) I 0 ~ A22 V2 M.,.u 
B2 <D B3 <0) M31 -1...u 
' B2 <0) M21 A31 output B1 <0) Mu A21 
fr OM ~ Au v1 s Input to 
s Input to 
T 
Figure 20. Processing sequence showing the 
order in which the non-zero blocks of Figure 18 
are fed into the system of Figure 19. 
51 
Thus, the total number of steps it would take the 
system to compute AX = B is 
(m(p + 1) 
p-1 
- l k + 3)w - 1 = ( 3. 7) 
k= 1 
1 
n (p + 1) - -(p - l)pw + 3w - 1 
2 
52 
Note that this throughput rate requires that the 
system process the data strips of A concurrently with the 
data strip of B. Consequently, the total I/O bandwidth 
needed must be 2w wide instead of w. Furthermore, if B has 
more than one strip, the system of Figure 19 must be 
modified. The reader should be aware that the formula (3.7) 
was derived by the author of this thesis after it was found 
that the one given in the original paper was erroneous. 
ASSESSMENT SUMMARY 
As we have examined both systolic implementations of 
Faddeev's algorithm, several points should be noted. First, 
the feedback system of Figure 13 as shown can not process 
problems in which (2.1) is larger than 2n x 2n, where 
n = mw; however, by adding another feedback path from the 
output of its component array S
2 
to the input of the top 
demultiplexer and using external memory for all Bi buffers, 
the system can be made independent of problem size. 
With cells specification of Figures 6 and 7, system 
configurations of Figures 13, 14 and 16 can perform 
Faddeev's algorithm using orthogonal triangularization. 
This means that Nash's implementation of Faddeev's algorithm 
can be configured to have feedback paths which will allow it 
to solve problems larger than the available bandwidth. 
Since the configurations of Figures 13, 14 and 16 
extensively multiplex data flows to and from their component 
53 
arrays, added control and hardware complexities are 
unavoidable. Furthermore, because the data flows must be 
skewed and overlapped, all multiplexers (and demultiplexers) 
used will need the ability to switch paths sequentially for 
each column of entries. This will require additional 
control for each multiplexer (or demultiplexer) which, in 
turn, adds to the complexity of the systems. 
Lastly, 
vertical or 
al though the feedback array systems for the 
the hybrid decompositions represent an 
interesting approach to solve the size independent problems, 
they require overly complex structures and controls while 
offering no real benefits or throughput improvement over 
their counterpart for parallel decomposition. These systems 
are thus impractical for actual implementation. 
CHAPTER IV 
A NEW SYSTOLIC ARRAY ARCHITECTURE 
In this chapter, we will introduce a new systolic 
implementation of Faddeev' s algorithm which, in its basic 
form, reduces the I/O bandwidth requirement by half and the 
number of cells needed by more than one third. Furthermore, 
it will eliminate some of the drawbacks that exist in both 
of the previously described arrays. 
ARCHITECTURAL DESCRIPTION 
Our design consist of a square 
cells are orthogonally connected 
Figure 21. Data bus interconnections 
array in which the 
as illustrated in 
between cells are 
indicated by arrows. Functionally, there are two types of 
cells. The first type consists of all the diagonal cells 
(denoted by circles) of the array, and the second type of 
all the non-diagonal cells (denoted by squares). 
Depending on the actual processing phase, the array 
functions in one of the two modes: the T (triangular) mode 
or the s (square) mode. Together, these two modes implement 

















Cl C2 C4 
d .. 
d43 d34 
d .. d .. cl .. 
0 0 0 -··------ d<I d12 d23 d,4 
o a o ·-·------ d3l d 22 d13 lo .. 
Q Q Q --------- dZI d,. b43 b,. 
o a o --·------ d11 lo42 lo33 lo24 
o a a --·------ b., b.. b22 b .. 
0 0 0 --·------ b3l b22 bl3 c .. 
0 0 0 --------- lo 21 lo 12 C 43 C,. 
0 0 1 -··------ b 11 c.. C., Ce• 
0 0 -··------ C41 C12 C23 C14 
0 0 ···------ C31 C22 C13 o. .. 
0 0 -··------ C21 C12 0.43 O.,. 
0 0 -------·- C11 0.42 0.33 0.2• 
0 --------- O.., O.aa 0.22 O.,. 
0 --·------ 0.31 0.22 0.13 l 
• I 
0 --·------ 0.21 0.12 : : 
: ! ! 
o.ll : : i 
l I I I 
~t---r- 'W t---r-· 


















~----- 'W -----· 
::t. er 
-+9-+Q{~]~ T I 
) 
-+D-+0-+0-{]-h u P'.l d 
Sl £: 
-0 0 
QI '-°' (j... LL 




! ! I i 
! ! "" x .... 
! l X43 X34 
I 
~ X 42 X33 X24 
) 
X41 X12 X23 X14 
X31 X22 X13 
Output results 















Figure 21. Dual mode 
Faddeev's algorithm. 
is smaller and I/O 
reduced. 
systolic implementation of 
The number of cells needed 
bandwidth requirement is 
55 
56 
When the array is in T mode, cells of rows i where 
i = 1, 2, .. , wand columns j where j ~ i, form a triangular 
sub-array which, based on Gentlemen and Kung's array of 
Figure 2, performs Gaussian elimination with neighbor 
pivoting on A, and ordinary Gaussian elimination on c. 
During this mode of operation, the circular and square cells 
essentially carry out the same functions specified by 
Figure 3 boundary and internal cells, respectively. 
When in s mode, the entire array is used to process B 
and D. In this mode, every cell of the array acts similarly 
to the internal cell of Figure 3, i.e. circular cells 
functionally become square cells. In order to switch the 
array from one mode to another, it is only necessary to 
change the program of the diagonal cells. This is 
accomplished with cells microprograms listed in Figure 22. 
By alternating between the two operational modes T and 
s, our array essentially simulates the system of Chuang and 
He (the one-T and one-S system in Figure 8) to solve (2.2) 
with a smaller number of cells and half the bandwidth 
requirement. Naturally, the input data flow will have to be 
slightly modified because of the differences in array's 
topology .. 
PEs' Description 
The circular and square cells, as shown in Figure 22, 
have identical I/O and control bandwidth: two n-bits data 
input ports, two n-bits data output ports, four one-bit 








If' C4 1n = 1 "then 
x t- o.o J 
If' Cl In = l "then 
begin 
If' I X 1n I ~ I X I and C2 In = 1 then 
begin 
C3out f- lJ 
:-1f X 1n "# 0.0 then 
I Mout t- - X/Xln 
\else 





C3 out t- 0 J 
M out f- - Xn / X i 
end 
else begin ,\/ ~ 





else X out f- X In + M 1n M! X J 
C3 out f- C3 In J 
end J 
Cl out t- Cl n J 
C2 out t- C2 n J 
C4 out f- C4 n J 








If C4 in = 1 then 
x t- 0.0 J 




Xout f- X•M .... •XnJ 
Xf-X 1nl 
end 
else x out f-X ... •Mn•X; 
Cl out f- Cl In i 
C2 out f- C2 In i 
C3 out f- C3 1n; 
C4 out f- C4 In i 
,.,,, ,:,~ l <-- fl.I. -,.~ J ,_......, J 
Fiaure 22. Microprogram specifications of 





control input ports and four one-bit control output ports, 
for a total bandwidth of 4n+8. In fact, this number is 
comparable to the actual pin count that Chuang and He's 
internal cell (in Figure 9) would need, since their cell 
does require extra control capabilities to work properly. 
58 
Although the choice of processors for our cells will 
be implementation dependent, the following observations 
nevertheless can be of support. 
Physically, one type of processor can be used to 
implement both circular and square cells because of the same 
I/O and control bandwidth requirement and similar general 
functionalities. 
Such a processor would have to be on a single chip for 
the array's chip count to be kept at a minimum. Another 
advantage is that functional blocks of the processor can 
work together without the time and pin-out penalty of off-
chip communication. 
Internally, the architecture of the processor should 
allow for a significant amount of parallelism, i.e CPU 
functions should be partitioned into units that can operate 
concurrently. To supply data efficiently to these uni ts, 
multiple internal data buses are essential. Additionally, a 
horizontal microinstruction set is mandatory to support such 
a structure; this in turn will dramatically shorten 
microprograms and will enhance performance. 
A large internal storage for microprograms and a 
microsequencer with good branching facility must be provided 
by the processor for adequate cell programmability. Also, 
provision must be made for the transmission of pipelined 
systolic control signals, which are crucial for run time 
operation of .the array. 
59 
And finally, the processor should have fast, on-chip 
arithmetic and logical capabilities, with a rich set of 
register files for flexibility of operation. 
Because of these atypical requirements, conventional 
microprocessors which are available commercially are not 
quite suitable as PEs in a systolic array. For now, 
dedicated systolic chips are scarce and the few that are 
being offered on the market lack some of the above features. 
However, this situation is expected to change soon as the 
use of systolic architecture will become more widespread. 
Control Signals Interconnections 
As shown in Figure 21, the circular cell relies on 
three external control signals Cl, C2, and C4 for internal 
computation and itself generates signal C3, all of which it 
broadcasts locally to its neighbors for correct operation of 
the entire array. The square cell uses only C3 and C4, and 
passes all control signals it receives to neighboring cells 
unchanged. Cl, C2, C3, and C4 are all one-bit boolean 
values whose functions and interconnection patterns are 
described below. 
Cl controls the behavior of diagonal cells and 
consequently selects the operation mode of the array. When 
Cl is true, the diagonal cells execute the portion of their 
code that enables them to function like Kung's boundary 
cells, thus changing the array into T mode. Otherwise, with 
60 
Cl false, diagonal cells function like square cells, and the 
array is in s mode. 
Because of the strict timing required, mode switching 
should occur as entries of the first row of B reach each 
cell, i.e. the switching sweeps across the array in skewed 
waves as the transition between c and B flows through the 
cells. This can be accomplished without the need to address 
separate control signals to each individual diagonal cell. 
In fact, Cl needs to be fed only to the top left diagonal 
cell of the array and, with cell interconnections of 
Figure 23, will be pipelined through the array to reach 
every diagonal cell. 
Cl 
t 
Cells used In 
!-bit T Mode c-r= =-===============;iv 
,~o~o o 01: 
', J, 
o,o~o o 




Cells used in 
S Mode 
Figure 23. Dual mode array shown only with the 
interconnection pattern for control signal Cl. 
61 
As the data flow changes from matrix A to matrix c, T 
mode processing in the array gradually switches from 
Gaussian elimination with pivoting to non-pivoting. This 
event is started with C2, whose value is true for pivoting 
allowed and false for pivoting not allowed. Again, C2 is 
fed only to the top left diagonal cell and propagated 











Figure 24. Dual mode array shown only with the 
interconnection pattern for control signal C2. 
Generated internally by diagonal cells when they are 
in T mode, C3 is the functional equivalent of M t of the 
OU 
boundary cell from Figure 3. It is thus used to direct 
square cells on the same row to pivot incoming data when 
true, or not to pivot when false. Figure 25 shows C3 
connections in the array. 
CT -7o-7o-7o-7o+;tCJ 
CT 
""' ! -70-70-70-70-7 i 
CT 
D 
~ -70-70-70-70-7 ~ 
~ ~ 
~ -70-70-70-70-7 
Figure 25. Dual mode array shown only with the 
interconnection pattern for control signal C3. 
62 
When switching between the T and S modes of operation, 
it is essential that the X registers in each and every cell 
of the array are cleared to zero before the new data 
elements arrive. If C4 is true, a cell will clear its X 
register prior to receiving X. from its northern neighbor. 
in 
The X register remains unchanged if C4 is false. C4 is 
distributed throughout the array by the interconnections 
illustrated in Figure 26. 
Control Interface With Host 
We have shown how external control signals are 
distributed throughout the array with only simple and 
regular interconnections. The need for complex individual 
cell addressing scheme is thus effectively eliminated while 
accurate timing at cell level is maintained. 
Typically, systolic arrays are attached to a general 
purpose host running UNIX, an operating system favored by 
63 
C4 t I-bit 
QrQfcjQ 
DODD 
..!- .,!, ..!- J, 
DD OD 
..!- .,!, .,!, J, 
DD DO 
..!- .,!, .,!, .,!, 
Figure 26. Array showing only the 
interconnection pattern of control signal C4. 
the scientific and engineering community. This is because 
UNIX provides a programing support environment that is 
crucial to the development of systolic application software. 
However, the real time response of such host is inadequate 
for the critical control timing of systolic arrays. This is 
due to the software overhead associated with various 
peripherals supported by the operating system. Thus, the 
computational power of a systolic array cannot be fully 
exploited unless effective interface with the host exists. 
In our case, a cost effective approach would be to 
generate and buffer all necessary control signals along with 
data prior to the initialization of a process; if buffer 
storage is sufficiently large, multiple problems can be 
solved by the array in burst before refill is necessary. 
For a small number of arrays, this approach is efficient and 
64 
rather simple to implement. However, it becomes less 
desirable as the number of arrays increases. 
A more efficient solution requires the use of a 
dedicated controller for array management. Advances in VLSI 
technology today have made the cost of fast and powerful 
conventional microprocessors very affordable. Acting as an 
intelligent interface between a slow host and fast arrays, 
such a device requires minimum supervision from the host 
while is able to control a large number of attached arrays. 
In any case, the sequence of control signals needed by 
the new array to solve (2.2) is simple and straightforward. 
The task of programing the host or the controller to 
generate it is trivial. In the next section, such a 
sequence will be specified with the corresponding input data 
flow. 
DATA FLOW DESCRIPTION 
Again suppose that A, B, C and D of (2.2) are n x n 
matrices and the available bandwidth is w = n • The input 
data flow, of width n and length 4n, will be continuous and 
consists of matrices A, c, B and D, in that order, skewed as 
shown in Figure 21. Note that the control signals necessary 
for each step are displayed alongside the data flow. 
Processing will be as follow. Initially, A enters the 
array followed by C; because C4 is true (for the duration of 
one cycle), all cells will clear their X register of values 
65 
left from any previous problem. With Cl and C2 both true, 
cells of the upper triangle begin performing Gaussian 
elimination (with neighbor pivoting) to triangularize A as 
its data elements are upon them. As Cl reaches each 
diagonal cell, the array gradually switches to T mode. 
When entry c
11 
of matrix C arrives at the top left 
cell, C2 becomes false which disables neighbor pivoting in 
the diagonal cells. Thus, only the ordinary Gaussian 
elimination is performed to annul c. Throughout this 
period, Cl remains true, hence the array remains in T mode. 
Next, as B reaches the array, C4 goes true again for 
the duration of one cycle (step) , long enough for the top 
left cell to store this value; the signal is then propagated 
to all cells and clears their X registers. At the same 
time, Cl becomes false and remains so until the last row of 
D is in the array. As Cl reaches each diagonal cell, it 
turns it into a square cell and thus gradually changes the 
array to S mode as the data elements of B are pipelined 
through the array. The results, shown in Figure 21, fully 
emerge from the bottom of the array after 6n - 1 steps for 
CA- 1 B + D and Sn steps for the solution to a linear system. 
Storage and Feedback of Modification Factors 
During the processing of matrices A and c, 
modification factors M
0
ut and pivoting control bits C3 are 
generated by diagonal cells based on incoming values X. 
1n 
They are then sent rightwards to the square cells on the 
66 
same row to modify adjacent X
1
n values. As it reaches the 
edge of the array, this rightward data stream is stored in 
B , a FIFO queue of size w x w shown in Figure 21. 
q 
This 
queue acts as a delay mechanism that will recirculate its 
contents to the left side of the array for the processing of 
B and D as they arrive at the array. 
To reduce demands on available bandwidth between the 
host and the array, B should not be implemented using host 
q 
conventional memory. Instead, the queue should be a 
dedicated buffer made up entirely of shift registers and run 
at the same clock rate as the array. This represents the 
most efficient way to implement the horizontal feedback 
path. 
SOLVING SIZE INDEPENDENT PROBLEMS 
~nother virtue of the array in Figure 21 is that it 
can readily handle problems of arbitrary size without 
requiring any architectural modification. Furthermore, the 
throughput can be improved proportionally by adding any 
number of arrays to an existing system. This gives the 
array a degree of flexibility that makes it truly useful in 
real life implementation: performance is adjustable 
according to cost constraint while versatility is preserved 
regardless of expansion of any size. 
For problems larger than array size, the input data 
flow shown in Figure 21 will be decomposed into smaller 
67 
strips which are processed continuously by the array, one 
after another. The intermediate results from each strip 
will then be fed back to the array for further processing. 
This vertical feedback and the horizontal feedback of the 
modification factors constitute two dimensional feedback 
paths for the array. 
Input Decomposition and Vertical Feedback Path 
With matrices of size n where n is m times the 
available bandwidth w, ( 2. 2) can be parallely decomposed 
irito 2m strips, each w in width and 2n in length as in 
Figure lla. Each strip in turn consists of 2m w x w blocks 
which are of the same size as the array. 
For w = 2, n = 4 and m = 2, Figure 27 shows an array 
with its input data flow decomposed parallely into four 
strips numbered from V1 to V4 • These strips are processed 
by the array one after another continuously. The procedure 




is being processed, a horizontal data stream consisting of 
values M t and signals C3 is generated and moved rightwards 
OU 
into B . Subsequently, the array is switched to s mode for q 




• In this 
mode, the contents of B is recirculated back to the array q 












V2 c .. I I 
r-1 c., c .. I 
y I I 
I c .. c., I 
V1 c.,. I c., c .. I I I 
c .. c .. I c .. o .... I 
I I 


















d,. d .. 
ci., d,. 








b., b,. v 
b1:1 
b .. b .. 
b., b., 
L ___ j 
bu 
c., CIZ I a. .. o., L ___ J I 
Ca a. ... I a. .. o,. 
I o., a. ... I 0.12 
o,.. o.,.. L ___ J 
o .. 0.12 
Cla I I <.-':!!-;:;;.. cu 
.I .I 
-0 >. 
}625:1 ' ' I => 
iii Cl 
Bq s.. .... s.. ..... Cl 
3\1 lK ..... cu 
I I 
- ..... 





4 I d"' .. 
"' r-1 d: d"' 34 y I d~ d"' 
v<n 
I .. <2M-D\I c1: I d~ d:!' 3 I 
r-1 d~ cl"' I d; b'.! I 
I 
.. 
I I v '+' I cl~ cl~ I 1:i: b"' ' 
vm 
"' \ 
c:! I d~ c1: I b~ 0 \ 2 I I ' \ cf, c;! I d::' b: I 0 0 ' 
c~ c~ 
I 
b~ b: I 0 \ I I \ 
c~ c~ I b: 0 L ___ J \ \ I \ c: a.:: I 0 0 \ 
I ' a.: o.~ I 0 
a.~ 0 L ___ J 1nterMecl10. te results 
0 0 
0 
Figure 27. First iteration in the processing of 
a problem larger than the array size. Note that 
the strips of intermediate results all have 
leading blocks of zeroes. 
68 
69 






generates an output strip v< i > 2 I v< i' 3 I v< i' 4 of length 
( 2m - 1) w = 6 that is preceded by a block of zeroes as it 
emerges from the array. In Figure 28, these intermediate 
results are stripped of their zero blocks and then fed back 
to the array where the above procedure is repeated. The 




, come out from the bottom of 
the array, each (2m - 2)w = 4 in length and likewise, is 

















































d:;' d (I) 12 I b~ 







































X23 X1• '\ 
X13 0 \, 
0 0 
X12 I 0 


































Figure 28. Second iteration of the problem. 
Intermediate results are stripped of their 




Figure 29 shows a mapping of input and output data 
flow of each iteration to array execution steps. Notice 
that input data flow of the second iteration is optimized, 
i.e. zero blocks that exist between output strips of the 
first iteration are eliminated. 
In general, a w x w array will solve a problem which 
is decomposed into 2m strips of length 2mw and width w, in m 
iterations. During the ith iteration, where 
i = 1, 2, ... , m, the array eliminates the strip Vi (in T 
mode) and reduces the length of each of the remaining strips 
by w ( in s mode) . This is because each remaining strip 
leaves behind one w x w block of data in the X registers as 
it is being processed by the array, and subsequently emerges 
with a w x w block of zeroes preceding it. These zero 
blocks can be skipped in the next iteration to shorten 
processing time without incurring any error. Final results 
after the mth iteration consists of m strips, each mw in 
length and w in width. 
The number of steps needed for the array of Figure 27 
to compute CA- 1 B +Dis: 
m 
(2w - 1) + I (2m - k + lfw = 
k=l 
7 3 1 
-(m2 n) + -(mn) + -(n) + 2w - 1 = O(m 2 n) 
3 2 6 
























































C3 LOOP 1 
.. . ' vm 
: ~--·-----d:) ~ 4 
:1-" d,.;/ 
l 1:: ::i: 
· ..Jd~.·······ro· v v4 
no~ 1o"; 
I ,. . 
. l~ ..... ····.o:;.·L 
,,b,, ~: 
: ~-lo,. .-· .. : v.<ll . .- .. . 3 
~ .. d ./ 
~~d. d·;~ 
~ ida d • i ~ 
'j"d . ·'·;·" v ·. • ..... ···l"o : _,., 3 
!"b"~ I... '"i 
I ~-· !lo,. ...... ·· 6""·1. 
.·•b •··. 
:i" ~,.: :"lo : (1) ~~-,.-- .. , Ve 
:(C• C ./ 
:1~: /~=! 
;·a.·~·· .. i 'V 
I o. .. :.....----- f 
iO.a O." ! 
i o.., o..J 
~-/~ 
f"' C.i 
!c• c i 
I •j 
!c• c.i 
jCn o.,. i 
l°'" o.. [ ........ ..V1 
t°'· o.. i 
;°'• o..l 
io. • ......... :-i 
'- ~· 










.. -----~. • (041 d'" i.: vm 
: id: .. ··"d=~ 
4 
··1·-d:··· d:;! "' .. ' ;a"'.·· "6;.;·;·· E 
.. ~b·:·· ~= ! ~ l 
: l!j .. --·., r : ra:. d: ;;' vm 
\!d~ ... ·····d;~ 
3 
la: d:; 
!d: b: ! 
ib: kl;) 
(5./~ re: c:i vm 
;c: c:V 2 
I "' ' c.. c:i
! c; a:d 
' ., . ;aa a_;} 
~ ........... · 
RESULTS 
.. -~-~-· ·><~ . E2 x .. / 
x .. x .. 
Xa Xw_ .. : 
x~ .. ·······ti" 
·o o 
o .... ·""x~ • E1 
.-··Xa x. ~ 
: x. x. 




Figure 29. Control/timing sequences of input 
and output data flow for each iteration. The 
dash/dotted lines represent input strips, while 
the dotted lines represent the output strips. 
72 
73 
Controls and Horizontal Feedback Path 
In Figure 29, values of Cl, C2, and C4 necessary for 
the above example are illustrated at each step. C3 is not 
shown since it is dependent on input data and generated on 
the fly by the diagonal cells. For each control signal, a 1 
represents the boolean value true and O represents false; 
when a signal remains unchanged from its previous value, a 
dash (-) entry is entered. The pattern is as follow: for 
each iteration, Cl is true during the first strip and false 
throughout the remaining strips. C2 is true only where 
pivoting is allowed, i.e. the portion of the first strip 
which contains data elements of matrix A, and false anywhere 
else. C4 clears the X registers of the array each time a 
new strip arrives, therefore it is true at the first step of 
each strip and false elsewhere. 
In general, an input strip with N blocks of vertical 
data will generate a corresponding N blocks of horizontal 
modification factors pairs (M t and C3); thus, the storage 
OU 
of the horizontal data stream should be N blocks long so 
that timings for horizontal feedback are accurate. Because 
the array itself acts as a w x w block of storage, for each 
ith iteration, the FIFO queue B should be (2m - i)w long. 
q 
With m = 2 and w = 2, Figures 27 and 28 
corresponding length of B for each iteration. 
q 
show the 
The buffer B should have the addressing capability 
q 
such that its length can vary in uni ts of blocks. This 
74 
permits the array to solve problems of arbitrary size, as 
long as B maximum length is adequate for the largest of 
q 
them. The control for the addressing can be generated by 
the host or the dedicated controller. 
Multiple Arrays Configurations 
Even though both have throughput time O(m 2 n), the 
system of Figure 13 is slightly faster when compared to the 
array from Figure 27. Given a problem, the former will 
solve it with 
7 3 1 
-(m2n) + -(mn) + -(n) + 2w - 1 
3 2 6 
7 5 
- -(m2 n) - -(n) + 2w + 1 = 
3 3 
3 
-(m - l)n 
2 
steps less than the latter. This stems from its use of two 
subarrays, where some overlaps in processing are possible 
when the S array is working on a strip while the T array 
processes intermediate results from the previous strip. 
Likewise, by using multiple arrays, the system of 
Figure 30 gives better throughput than the single array of 
Figure 27 under the same I/O constraint. This is because 
each subarray effectively replaces one iteration, with 
partial results from one subarray immediately processed by 
the next, thereby maximizing concurrency while eliminating 
the corresponding iteration. Such a system will be called 

















V1 c.., I I 
c., c .. I 
I 
C31 C22 I 
c .. c,. I I 
CR o. .. I 
I o.., 0.32 I 
V3 ,-, cl., 
I I 
d31 y I 
c .. I d., I 
c., c,. I du 
I b., C3, c •• I 
c., c,. I b31 I 
c,, a. .. I b .. 
I bu a.., o.,. I 
V4 
r-1 
I I y I 
cl.., I 
I 












d., d •• 
cl., d,. 
cl,, b •• 
b., b,. 
b., b •• 
b., b .. 
b,, 
b22 t_ ___ J 
b,. 
0.33 a. .. L ___ J 
o.., o. .. 
0.13 
0.31 a. .. L ___ J 






x,. x .. 












E 2 I x.. Ii\ 
x., x,. 













' ' ' ' 
0 0 
0 0 



































Fiqure 30. L-tuple arrays system processing a 
problem larger than the I/O bandwidth w. Again 
w = 2, n = 4 and m = 2. With L = 2 arrays, the 
problem is solved in one iteration. 
75 
76 
system. In Figure 31, control and timing sequences of 
Figure 30 subarrays are illustrated. Because the input 
strips v1 1 > of the second array are interspersed by blocks 
of zeroes which cannot be removed, buffer B2 is required to 
q 
have the same length as Blq, instead of being one block 
shorter. 
In general, a problem requiring m iterations on a 
single array will need only k = m / L iterations on a system 
of L-tuple arrays, assuming that m is an exact multiple of 
L. After each ith iteration, the length of partial results 
will be (2m - iL) 2 w. Hence, the system will compute 
CA- 1 B + D of such a problem in 
mtL 2 
(L + l)w - 1 + 2 (2m - (k - 1) L) w = 
k= 1 
7 3 1 
-(kmn) + -(mn) + -(nL) + (L + l}w - 1 = O(kmn) 
3 2 6 
( 4 .1) 
steps. The first part of (4 .1) represents the number of 
steps taken for input data of the last iteration to traverse 
the system, and the summation term gives the number of steps 
to feed input data of all iterations into the system. Final 
results in this case always emerge from the bottom of the 
last array of the system. 


















































s - 0 
4 
3 
2 - - 0 
1 I 1 1 
ARRAY l Cl C2 C3 
.. ·· 
: . -· . : v<o -
: fJ./ 1C) ;......- 4 -
! td: ~::, -
: ;d.. d !; 
:._jd ..... ···1:i~·i·· v 
!b~.. ... .. :.......- 4 
! ~~i ,bi'> ..... ··1o .. 
.. -~i:; .. r·, -
;!"' b.J: (!)-: ~-,.--~ ; .... Y3 
i ffi" d. i: -
: .d d . ; -
u:;/~:!' v 




i1~r- ~~'~ · = 
:a., o.., ;..--- 2 
!O:a 0. i 
:a .. ~·· 
~·a./ .. .- . . C•i ;cs c.; 
,c. c ~ 
jc. o. • i V 
!o. ·~ l 
I 0..' 
!Oa O.ee i 
!a . 






.... ···· ·, E 
:.,.......- 2 
: .------~: 
~ (6: d; i: (1) 
: .d ...... -;;., i v ... 
:! "!'···· ..... r--
lcl: cl~ i 
!d: b:l 
ib: b;J. 
Uo;_,.,,.---o ·, E1 
.. ·r o ~ 
: 0 ..... ~: : ra: d; J·· (1) 
: "d'" _...:.;,.,I V3 
.I a .. ·· \,1-~ 
"-j-ai cl:I'; 
. d"' b"' ! 
l1o: bj} 
~-/~ 
o ___ _...-c!\ 
r'f!G) Q) I 
,-.. c .. i vm 
!C: c:~ 2 
(c: c: ! 
! c: a.:~ 




.. ········ ·. i:-,. x.,,.: ~ 
x.. x* .. ~ 
x.. Xa4 
X" Xw.: 
·-~-~-- ... -g .. 
0 0 
0 0 
_o_.· .··· ... x~ L.--E1 
x., x., . 
Xn Xu 
: 






Figure 31. Control/timing sequences for each 
array. Note that both arrays 1 and 2 process 
their respective input strips concurrently. 
77 
78 
Thus, when m = L (as with the example used in 
Figure 30), CA- 1 B + D is computed in a single pass with 
total processing time equal to 
(4m + l)n + w - 1 = O(mn) 
which is identical to the performances of the systems from 
Figure 13 and 16. However, note that the system of 
Figure 30 is totally independent of problem's size and the 
number of cells used is smaller since the T arrays are 
eliminated. 
When m is not an exact multiple of L, that is when 
mmodL * O, the number of iterations required to complete the 
problem is k = rm;D, with the kth iteration employing only 
the first mmodL subarrays of the system. The total 
processing time will be 
r mt .V 
2 
(m L + l)w - 1 + I (2m - (k - 1) L) w mod 
k= 1 
Again, the summation term represents the time 
necessary to feed input data of k iterations into the 
system. However, since only the first mm 
0 
d L subarrays of 
the system are used during the kth iteration, final results 
will emerge from the bottom of the mmodLth subarray, instead 
of the last subarray. Therefore, the first term of the 
throughput equation reflects the shorter path through which 
data has to traverse during the kth iteration. Figure 32 
froM host 
V1 V2 ''' V2M 
Blq 
2MW X W 
B2 q 
2MW X W 
BL q 
2MW X W 









• l\J ,:::,, 
Figure 32. An L-tuple arrays system with a 
common data bus from each array to host. The 
vertical feedback path has a FIFO queue B for 
' ' r temporary storage of intermediate results. 
79 
shows a multiple arrays system which provides a common data 
bus that delivers final results from any one of its 
subarrays to the host. 
Intermediate Results Storage 
Until now it was assumed that the intermediate 
results, generated in between iterations by all of the 
systems discussed in this chapter, are handled by the host 
and that the blocks of zeroes can be stripped in the host. 
However, the resulting back and forth of data between host 
and system places heavy demands on valuable I/O resources. 
80 
A more efficient approach, used in the system of Figure 32, 
is to route this vertical feedback into the FIFO queue B . 
r 
Similar in concept to the use of B for the horizontal 
q 
feedback, this queue acts as a buffer storage in which 
intermediate results emerging from the bottom of the system 
are delayed from being fed back to its top until inputs of 
the previous iteration are fully processed. An added 
benefit is that, during processing, the queue can be used to 
eliminate zero blocks generated by temporarily halting the 
pipeline for some corresponding durations. 
B should be (2m - L) 2 w - Lw long, i.e. long enough to 
r 
accommodate partial results of the first iteration of the 
largest problem likely to be solved by the system, minus the 
combined length of all subarrays. And since each iteration 




also be given the addressing capability which allows its 
length to be altered by an external control. This ensures 
that data enters the array continuously for maximum 
throughput. 
PROCESSING OF SPARSE MATRICES 
Another feature which further enhances the versatility 
of our array is that it can compute problems involving 
sparse matrices efficiently by skipping blocks of zeroes, 
similar to the system from Figure 19. Furthermore, because 
the design functions in both triangle and square mode, only 
81 
one array is needed for problems of such type. While a 
multi-array system like that in Figure 32 is fully capable 
of processing sparse matrices efficiently, the procedure 
involves only the first array; thus, in Figure 33, it was 
reduced to a single array system for the sake of clarity. 
In the following discussion, the example (3.1) will be used, 
with p = 3 and the input data flow decomposed parallely like 
in Figure 18. Because only one array is needed, the 
continuous stream of input data alternates between non-zero 
blocks of strips V1 , V2 , ••• , vm which are processed by the 
array in T mode, and the corresponding blocks of strip Vm+i' 










An (1) (1) 
I I I B2 B3 I I I 
Blq 
3w x w 
Br 
2w x w 
X2 I II X1 
Reduced system for sparse matrix 
82 
Initially, non-zero blocks ~ 1 , ~ 1 , ••• , AP 1 and block 
-lm+i,i of strip V1 are fed into the array. They in turn 
generate corresponding blocks of M t and C3 which move 
OU 
rightward into buffer Bl . Of length pw, Bl is long enough 
q q 
to provide the required delay so that its contents can be 
used by the array (in S mode) to modify subsequent blocks 
B1 , B2 , ••• , BP and the first zero block below Bm. 
Thereafter, B1 is left stored in the array, whereas B2 , ••• , 
BP emerge from the array as B~ 1 ' , ••• , B~ 1 ', to be stored in 
queue B . Thus, the capacity of B should be (p - l)w to 
r r 
hold these modified B blocks. The zero block, after 
modification, becomes the first block of result X
1 
and is 
sent to the host. 
From V2 to Vm, the computation proceeds similarly with 
blocks Ai, ... , p+ i- 1 • i and -lm+ i. i of strip Vi generating 
their own Mout and C3 values to modify Bi~~~~ ,p+i- 2 , Bp+ i- 1 
and zero block Bm+i" The modified block Bii- 1 > is then left 
in the array; blocks B~ i - 1 > . , 
l.+l, ... ,p+l.-2 8 p+ i - 1 become blocks 
B~i> . 
l.+l, ... ,p+l.-1 which are then stored in B r for the 
succeeding strip Vi+ 1 , and the modified Bm+ i emerges from 
the array to become the result Xi. 
The throughput time of this system is 
(m(p + 1) 
p-1 
- l k)2w + w - 1 = 
k=l 
2n(p + 1) - pw(p - 1) + w - 1 
83 
which nearly doubles the throughput time of the system from 
Figure 19. This is to be expected since the single array 
from Figure 3 3 system is doing the work of two. However, 
such a comparison would be misleading because it does not 
take into account the fact that, for the two subarrays T and 
S of Figure 19 to work concurrently, the total I/O bandwidth 
of that system would have to be 2w. Or to put it in another 
way, with a total I/O bandwidth of w, these two subarrays 
will each have only a bandwidth of w /2. Consequently, a 
problem will have to be decomposed into twice as many input 
data strips with width that are only half as wide. This 
effectively doubles the throughput time of the system such 
that it is actually comparable to that of Figure 33. 
OVERLAPS IN PROCESSING BETWEEN PROBLEMS 
In the simplest term, a systolic architecture can be 
thought of as a pipeline architecture in which each row of 
cells of subarrays in the system represents a stage in the 
pipeline. A pipeline reaches its peak performance when it 
outputs a usable piece of data for each of its cycles. This 
peak performance is attained only after the pipeline is 
completely filled with data, a process termed pipeline fill. 
To maintain its peak performance, the pipeline must be fed 
continuously. 
Similarly, a systolic system can reach its maximum 
throughput rate only after it is completely filled with 
84 
data. This maximum throughput rate is defined as the rate 




sets to problems emerge 
times elapse between 
from the 
any two 
consecutive sets. Note that these elapsed times between 
solution sets may be of different lengths since the sizes of 
the problems themselves can vary. To maintain this maximum 
throughput rate, the input data flow must be continuous, 
i.e. problems to be solved must be fed into the system 
without any empty gap in between them. An empty gap in the 
data flow will result in a corresponding length of time 
during which cells are idle, and solutions to problems will 
be that much farther apart. A gap which exceeds the total 
length of the system will cause the system to completely 
empty itself of data, resulting in what is commonly termed a 
pipeline flush. A pipeline flush is expensive because it 
takes a finite amount of time to refill a system. 
To put in another way, the maximum throughput rate of 
a systolic system is achievable and, more important, 
sustainable only if processing overlaps between problems are 
fully exploited. Say that two matrix problems, PP and PN, 
are to be solved in that order by a system of L subarrays. 
For an I/O bandwidth w, PP is decomposed into mp data 
strips. A processing overlap between PP and PN occurs when 
data of the last iteration of PP and data of the first 
iteration of PN are processed by the system at the same 
time. Maximizing this processing overlap can shave off 
substantial amount of computing time from PN. 
85 
It can be 
seen that the time saved, in number of steps, is calculated 
as the number of subarrays through which data of the last 
iteration of PP must travel, times the size w of these 
subarrays, plus the skew factor w - 1 of the data flow. 
Thus, when mp is an exact multiple of L, the total number of 
cycles necessary to solve PN is reduced by 
(L + l)w - 1 
When mp is not an exact multiple of L, the last 
iteration of PP involved only mp mod L subarrays of the 
system. Therefore, PN is solved with 
(mp modL + l)w - 1 
less cycles. Lastly, if PP is a sparse matrix as described 
in the previous section, the number of cycles reduced from 
the computation of PN will always be 
2w - 1 
This is because sparse matrices are processed only by 
the first array of the system. 
CHAPTER V 
EXTENSIONS TO FADDEEV'S ALGORITHM AND CONCLUSION 
In the previous chapter, the reader has seen the ease 
with which the new systolic array uses massive parallelism 
to solve many types of matrix problems via Faddeev's 
algorithm. The actual size of the array, and therefore its 
throughput, is shown to be restricted only by the available 
bandwidth between the host and the array. Even this 
restriction is effectively circumvented when a number of 
such arrays are combined into a system to give a desired 
level of performance. Such a multiple arrays system reach 
its maximum throughput rate when its pipeline is completely 
filled with data. By ensuring that the input data flow is 
continous, this maximum throughput rate is maintained at all 
times. It would seem then, algorithmically speaking, that 
nothing further can be done to induce more parallelism into 
matrix computations. 
However, that last observation is simply not true. We 
have found that, by extending Faddeev's algorithm, the 
maximum throughput rate of a system can be nearly 
quadrupled. Furthermore, such a tremendous improvement in 
system throughput requires absolutely no architectural 
modification to the system. 
87 
HORIZONTAL EXTENSION TO FADDEEV'S ALGORITHM 
Before illustrating how we extend Faddeev's algorithm, 
let us introduce the concept of compatibility between matrix 
problems. Suppose we have matrices A, B and D of order n, 
upon which we wish to perform the operations A- 1 , A- 1 B and 
A- 1 + D. From Figure 2, we can solve these matrix problems 
with Faddeev's algorithm by formulating them as 
* = A- 1 ~ = A-1B _: I : = 
( 5. 1) 
A- 1 +D 
0 
( 1 ) ( 2 ) ( 3 ) 
where I is the identity matrix. These constructs reveals 
that they all have identical left halves, i.e. they consist 
of the same matrix A in their top left quadrant and the same 
matrix -I in their bottom left quadrant. When this is the 
case, we say that the problems are horizontally compatible. 
Obviously, solving x horizontally compatible problems 
involves repeating the calculations for the same left side x 
number of times. In the case of (5.1) where x = 3, solving 
(1), (2) and (3) requires repeating the process of 
triangularizing A and annulling -I three times. If by some 
means the redundant iterations of this process are 
eliminated, nearly half of the calculations necessary to 
solve (2) and (3) of (5.1) can be skipped. This would yield 
a large savings in computing time. 
88 
To accomplish this, we extend Faddeev's algorithm 
horizontally to the right so that (5.1) is reformulated as 
A I I I B I I (5.2) 
-I 
( 1 ) ( 2 ) ( 3 ) 
Grouping (1), (2) and (3) together as in (5.2) allows 
us to triangularize A and annul -I only once, and reuse the 
multipliers generated from that several times on the right. 
The results will appear as 
Ack> I' k > Bek> I' k > 
0 A- 1 A-
1 B A- 1 +D 
( 1 ) ( 2 ) ( 3 ) 
It is easy to see that the horizontal extension to 
Faddeev's algorithm maps particularly well to a system using 
our systolic array design: it requires absolutely no 
architectural nor algorithmic modification, either at the 
system level, subarray level or cell level. When the 
available I/O bandwidth is w, (5.2) is parallely decomposed 
into (x + l)m input strips, each 2mw in length, as shown in 
Figure 34. 
VJ / ...-Ci;~ 
--- d.3 d:i. 
. .-···'ci~ i d33 d2.' A 
.. • ii ci [.,._,, I 
f d., ci~: d23 ~: I 
.· _ ..6 .. ··.,i d31 d~i d;a/i ! : 
Results /a· o H d ,.J U-.-e::o O ?1 1 . " 21 U,z,, /' ' 
(o. f ter Modifico:tion) .-······"6'\ ! 0 o ii d /0 ii 1 /'/'O i: 1 .. " . , >" . . _,, , I I 
~ /a· 0 : ! 0 0 U,i) 0 ~ _,,0 0 : I 
)-(b•i! o o ~ O/b:.!!}/f Ji.a;;;;~) 
; o o jj o o/';! 'b.3 b.>f .. ·>/' _,, /' 1 .·······. ; ; ; <' ; ; b .-{_ ' ' 1 .... /' / / /' .. ....-·o \j o o :: o_,,/'IO.i!i: ~ 02•::, . ..-·> /' 
/'/' I ...... · ii !! .( ;;-<_ b i ······-:;,/ /'/' 
/' 1/0 o :: o 0Af c.1 b.>zn oZJ ?<f /' /
/'/' . ..... 1l ! ! / '; i; .··. / //' 
/ .......... :::1 ·\1l 0 0 ii 0,,,/ 1 :: b/l/b2~;,.,b_,_3_// /'/' /1 .. ··· '1' ;; '- ;;.-{_ b ! .... ..,,,_/' /'/ 
/' /' /Q OLO OH'O 011to21 ~:/' /' / //' ...... ii jlj .-e::/i i /,/' j j ... -;/ // 
/' ........ ··o \ 1:-1 O ii! 0 /' 0 ii 1 _,, 0 : : .. bu_ .... ·/ ,,, one do. to. strip ,,- .• / ii !I! L/' :;/,/ j ·······"// // 
/ 0 0 jlj 0 0 !'r 0 0 < 0 0 ,; /' / . l '' /',, '/' /' 
i O -1 iii 0 o. ... !I 0//1 ll.. 0 ........ ;> /',,, Eli Mina. ted o. fter ' 'I' ' ' /, ' ·........ / /' 
i 0 0 !1! 0.43 o.;J,t 0 0 j /',,, L __ --~ second ltera tlon 
j-1 a.~:' ;".:d'~2•i t .... ~_ .... ;·;> / 
: a.., o.~~ 0.23 o.,1; /' /' 
i 0.31 Qj i 0.13 ...... ;-:./ 
! : \ ...... :;_;-> EllMino:ted o. ft er 
; O.zi O.iz., / 
i a. .... ....- L----~ first itero.tion 
\ 11_ ••. ...-
·· ..... ·· 
Figure 34. Parallel decomposition 
horizontally compatible problems. 
example, n = 4, w = 2 and m = 2. 




As before, the L-subarrays system of Figure 32 will 
process this input data flow in k iterations, where the 
value of k depends on m and L. When m is an exact multiple 
of L, we have k = m/L and the system will compute x 
horizontally compatible problems in 
mtL 
(L + 1) w - 1 + I [ (x + 1) m - (k - 1) LJ [2m - (k - 1) LJ w 
k= 1 
(5.3) 
cycles. In the above equation, the first product term of 
the summation represents the number of input strips for each 
iteration, while the second term indicates the strips 
length. The solution to the first problem will come out 
after 
cmtL>-1 
(L + l)w - 1 + I [ex + l)m - (k - l)L] [2m - (k - l)L]w 
k= 1 
2 + (m + L) w 
cycles, with the second line of the equation indicating that 
only part of the kth iteration is needed. Afterward, 
solutions to subsequent (x - 1) problems are outputed one 
for every (m + L)n cycles. In the special case when m = L, 
we have k = 1 and the system will solve the first problem in 
(4m + l)n + w - 1 
cycles. As to subsequent problems, the system will complete 
one every 2mn cycles. The difference between the two 
91 
throughput equations of the first problem is due to the fact 
that the input data flow for x horizontally compatible 
problems consist of (x - 1) m more strips than that of a 
single problem. This means that during each iteration, the 
system has that many more strips to process. Thus when 
k > 1, the previous iterations will delay the output of 
results whereas with k = 1, those delays are non-existent. 
When m is not an exact multiple of L, the number of 
iterations required for the system to process (5.2) is 
k = rm;fl, with the kth iteration involving only the first 
mmodL subarrays of the system. The total throughput will be 
rm, H 




with solution to the first problem coming out after 
LID/~ 
(m L + l)w - 1 + \ [(x + l)m - (k - l)L] [2m - (k - l)L]w 
mod L 
k= 1 
2 + (m + m L) w mod 
cycles. Again, the second line of the above equation 
indicates that only part of the last iteration is needed by 
the system to compute the first problem. Afterward, 
solutions to subsequent x - 1 problems will emerge one for 
every (m + mmodL)n cycles. 
------, 
92 
Since the input data flow of x horizontally compatible 
problems consists of only (x + l)m strips, versus the 2xm 
strips required if they are not compatible, large saving in 
storage space can be gained on the host side. On the other 
hand, the length of the FIFO buff er B 
r 
should be 
( (x + l)m - L) (2m - L)w - Lw since the intermediate results 
after the first iteration have many more strips. Because 
the length of each strip is still 2mw, the capacity of the 
buffers B should remain unchanged. 
q 
To get an idea of how much the system throughput can 
be improved when horizontal extension is applied, suppose 
that we have a system of L = 4 subarrays, with each array of 
size w = 32. On this system, we wish to perform x = 50 
operations with matrices of order n = 128. If these 
operations are not compatible, solving them one at a time 
without processing overlaps will take a total of 110, 350 
steps. With processing overlaps, this number is reduced to 
102,559. However, if the operations are horizontally 
compatible, they can be processed by the system in 52, 383 




nearly by a factor of two. Of course, this number can vary 
depending on x. As x gets larger, the improvement factor 
gets closer to two. 
93 
VERTICAL EXTENSION TO FADDEEV'S ALGORITHM 
Even when a group of matrix problems are not 
horizontally compatible, they may exhibit another type of 
compatibility which can also be exploited to give an 
equivalent speedup in system throughput. To expand on this, 
let's suppose that we have y = 3 matrix operations to 
perform, namely CB, B + D and EB + D where B, c, D and E are 
of order n. Like before, we can express these problems as 
* = CB 0 * = B+D D * = EB+D D 
(5.5) 
( 1 ) ( 2 ) ( 3 ) 
Because the left side of problems (1), (2) and (3) of 
(5.5) are not the same, they are not horizontally 
compatible. However, it can be observed that they all have 
the identity matrix I in their top left quadrant and matrix 
B in their top right quadrant. To put it differently, these 
problems all have identical top half. When this is the 
case, we say that the problems are vertically compatible. 
94 
To avoid repeating the same calculations on the 
identical top sides of vertically compatible problems, we 
extend Faddeev's vertically such that (5.5) becomes 
-~ ( 1 ) 
(5.6) 
-* ( 2 ) -E D ( 3 ) 
When y vertically compatible problems are grouped 
together as in ( 5. 6) , the common top side needs to be 
processed only once. This means that after the top left 
quadrant is triangularized and the top right quadrant is 
modified with the generated multipliers, they can be used 
repeatedly to annul the left side of succeeding stages and 
transform their right side into solutions. 
In the case of ( 5. 6) , solving it involves only the 
annulment -c, - I and - E. This is because the identity 
matrix I in the top left quadrant is, by its nature, already 
triangularized; as a consequence, matrix B in the top row 
will remain unmodified. Annulling -c, -I and -E while 
extending the operations to the right will give 




( 1 ) 
( 2 ) 
( 3 ) 
95 
which shows the solutions to (1), (2) and (3) in the right 
quadrants. 
As with horizontal extension, systems using our array 
design can handle vertical extension to Faddeev's algorithm 
without any modification. Shown in Figure 35, the input 
\.I ..... ·:_t··. 
W .... ····· U44' 
< > ./ ; 
/1 .····.. i d43 d34: 
/ ··~1 ·.~I . 
/ I ............ U4i;: U33 d24: 
// I .. ·· " , 
,,.,. 1( ci41 d32! j d23 d14! 
// .. ···· .. : :: : 
/ ,,, ..... ······~e ~,i: d31 d22i: d13 d44: 
/ /I .. ·· 'I' :: : 
,,/ .,,,./ l/~43 -e34l1l cl21 d12!! d4:i d34: 
// ... ········ .. I' q: : : ----'--: --- Results 
,,.,. ............. -e ~i:-.e 33 -e e1:1: ci It d 12] : d33 d 21] 
/ iii iii ci ~1 i ; "' "' ; >e 11 -e :ie>-e 23 -e H:: 11 un:: ue:i u11: . ·~ ·t .. . 
(o. fter Modlfico. ti on) 
i-e31 -ee.J1!-el3 -1 i1! d31 d22ii dl3 0 . 
j-e21 -eie'lj 0 0 ill d21 d 12 ·· 0 0 ~ 
;-e 11 0 :1: -1 0 ;I; d 11 0 ; ; 0 0 j ~ ' 'I' 'I' ,. ' 1 
0 0 !1! 0 0 i1! 0 0 ii 0 0 [,../ 1 one do. ta strip 
l 'I' : ' /: I 
0 -1 !1i 0 -C44l1l 0 0 !! 0 ... /b44! I 
0 Q !i1-C43 -C34!:· 0 0 .,,,.~b43 bpj1 : 
-1 -C~l;-C33 -C24H: 0 /b42i: b_p/b24: I I 
l '~ ' '' 'I I 
-C41 -c~1i-C23 -C14i: b41 bp!f 623 b,~} I I 
l .. ~ .. . ) 
-Ca1 -C~ j-cl3 1 ! ! bj,/ KJ 22!!_ b13 ...... ....- I 
;I; ;; 6 ;·· ......... ···· _,/" ) 
-C21 -C12il; 0 0 q-· 21 bia-: .,,,. ,,.,. 'I' .....;, .. / / 
-C
11 
0 ll! 1 /(j 1!. b 11 ......... ...- // // :: / :· ... · / / 




....... ·--' // // Elif"lino. teci o. fter 




! ·· .......... ··~// seconu i ero. ion 
..' / 
0 
1 ..... / 
·· ...... ·· ····· ( EliMino. ted o. fter 
---> 
first itero. tion 
Figure 35. Parallel decomposition of 
vertically compatible problems. Again 
w = 2 and m = 2. 
y = 3 
n = 4, 
96 
data flow of y vertically compatible problems consists of 2m 
strips, where each strip is (y + l)m blocks long. The 
L-subarrays system of Figure 32 will process this data flow 
in k iterations. When m is an exact multiple of L, k = m/L 
and the process will be completed in 
mtL 
(L + l)w - 1 + 2 [2m - (k - l)L][(y + l)m - (k - l)L]w 
k= 1 
(5.7) 
cycles. When m is not an exact multiple of L, k = rm/fl and 
the throughput is computed as 
r mt il 




In throughput equations ( 5. 7) and ( 5. 8) , the first 
product term within the summation represents the number of 
input strips for each iteration. The length of each strip, 
on the other hand, is indicated by the second product term. 
Even so, note that (5.7) and (5.8) are identical to (5.3) 
and (5.4), respectively, save for the variables x and y. 
After the kth iteration, the set of y solutions 
emerges in m output strips. As shown in Figure 35, an 
output strip consists of y segments, each of width w and 
length mw. Each segment i = 1, 2, ... , y is part of the 
solution to the ith problem. Because a solution is divided 
into m segments with each segment part of an output strip, 
97 
the solutions will not be completely out until the last 
strip has emerged. Thus, the number of steps needed for the 
first solution to come out is computed by subtracting 
(y - l)mw from (5.7) or (5.8). Each following solutions 
takes another mw steps. 
Again, storage space needed on the host side is 
greatly reduced since the input data flow of y vertically 
compatible problems is only 2(y + l)m2 w long, as opposed to 
4ym 2 w were they not compatible. However, the length of the 
FIFO buffer B should be ( (y + l)m - l)w to accommodate 
q 
longer strips of modification factors. In addition, the 
length of B should be (2m - L) ((x + l)m - L)w - Lw to 
r 
adequately hold intermediate results with longer strips. 
TWO-DIMENSIONAL EXTENSION TO FADDEEV'S ALGORITHM 
While using either one of the previously described 
extensions yields substantial reduction in computing time, 
still greater improvement in throughput is possible when 
both techniques are combined into a two-dimensional 
extension to Faddeev's algorithm. To illustrate, consider 
the matrix operations AB, AE + F, B + D and E + G. As 
before, A, B, D, E, F and G are all matrices of order n. 
Formulating the operations as follow: 
98 
*-AB *=AE+F 
(1) (2) (5.9) 
* = Bt-D * = E+G 
(3) (4) 
reveals that ( 1) and ( 2) are horizontally compatible, as 
with (3) and (4). Furthermore, (5.9) also shows that (1) 
and (3) are vertically compatible, as with (2) and (4). 
Thus, using horizontal extension, (5.9) becomes 
I I B I E 
-A 
( 1 ) ( 2 ) (5.10) 
I I B I E 
-I 
( 3 ) ( 4 ) 
Since both constructs of ( 5. 10) have identical top 
halves, vertical extension can also be used to further 
obtain 
I B E 
-A 0 F ( 1 ) and ( 2 ) ( 5 .11) 
-I D G ( 3 ) and ( 4 ) 
This results in a two-dimensional extension to 
Faddeev's algorithm. Annulling -A and -I in (5.11) and 
99 
extending the operations to its right prompt the solutions 
to (1), (2), (3) and (4) to appear as 
I B E 
0 AB AE+F ( 1 ) a n d ( 2 ) 
0 B+D E+G C 3 ) a n d C 4 ) 
As (5.11) reveals, the two-dimensional extension to 
Faddeev's algorithm allows a compatible matrix problem to 
share three of its quadrants with others, instead of two. 
This translates into the elimination of a larger number of 
calculations per problem. 
The input data flow of (5.11) for the L-subarrays 
system is shown in Figure 36. When the number of problems 
is x across by y long, the input data flow is decomposed 
into (x + l)m parallel strips, each (y + l)mw in length. If 
m is an exact multiple of L, the total number of steps for 
the L-subarrays system of Figure 32 to process this data 
flow is 
mtL 
(L + l)w - 1 + I [ex + l)m - (k - l)LJ 
k= 1 




(o. f ter Modifico. tion) . . . 
~ 
....... 942: : 933 924: 
.···· (g41 932! l 923 914: 
one do. to. strip "' .--~~j I Q31 922 j Qf13 ff44j 
.···· : l.A43 lil34; : Qzi 912: : 43 34: 
//I ... ·········· .. ci42': : d33 d24! I gll f ~ if 33 f 241 
/// . (d41 d32i : d23 dH! j f 41 f ~ j f 23 f l~--1 
/ ........ ·-_, : : : : : : : /,. / : I 
Z 
/I .. · .. ······· -1 \ \ d 31 d 22j : d 13 0 j : f 31 f ~ i f y/ e 44j I 
/ / ... : : : : : : : /, : l 
/ / / .····.. { 0 0 j j d 21 d 12: j 0 0 j ! f zi f y( e 43 e ~ 1 I 
/ / 0 ··1 0 .. d 0 ··o 0 .. f /,. .. /.I / .. · : : : : : : : : : : / : / . ....... : r : : 11 : : : : y,,, e 42j : e 3)' e 24: 1 1 ... , , , , = = , '......:: , , ,,, = I 
(O o j j o o j j o o j : o o,,, H e 41 e ~>re 23 e 14: 1 1 
: : : : : ' ' -<- : : ,,, ,,, : : .. ' )1 I : 0 -1 : : 0 -o. 44: : 0 0 : : 0 ,,, KJ 44: : e 3Y e 221 : e 13 .......... ,) 
: : : : : : t-r:' : j,. / : :... ........ / / / 
: 0 0 : ~0.43 -0.34: : 0 0 ,;,.-: K..l43 bw., e21 e12: ···· ,,, / 
: :: :: /:: //jj > //// 
t 1 -0. 4e: ~o. 33 -0. e4: j 0 / 'b 42: j b ~ b 24: j e 11 .. ······. / / 
: : : :I L,/ : : / : ·,_ ........ ,,, ,,, ,,, ,,, 
~0.41 -0.3e! ~O.e3 -0.14:'fb41 b~'fb23 b14: · ..... · / / 
: : : : : / : : _; // // 
~O. 31 -0. ee! ~O. 13 1 : ! b av /b 2e: : b 13 .... ··· ·· / / 
: : : : : / : .... .... .... · / / / / 
~O.e1 -0.12j j 0 0 ) 1"be1 b12j ....... / / 
: : : /;( : : j // // 









) ,,,,,, - second itero. ti on 
: : : ...... / 
Io o I :.... ..... ;·;,,,,,, 
EliMino. ted o. fter 
:. 1 ,..: L,,, --> first itero. tion 
·· .... ·· 
< > 
'W 
Figure 36. Parallel decomposition of x by y 
compatible problems. x = 2 is the number of 
horizontally compatible problems, and y = 2 is 
the number of vertically compatible problems. 
As before, n = 4, w = 2 and m = 2. 
100 
101 
If m is not an exact multiple of L, then the number of 
steps needed is computed as 
r mt fl 
(m L + l)w - 1 + \ [ex + l)m - (k - l)L] 
mod L 
k- 1 
[(y + l)m - (k - l)L]w (5.13) 
Subtracting [ (x - 1) (ym + L) + (y - 1) ]mw from (5.12) 
or [(x - l)(ym + mmodL) + (y - l)]mw from (5.13) will, in 
both cases, give the number of steps elapsed before the 
solution to the first problem is completely out. The 
interval between solutions to problems on the same column is 
mw steps. Between problems on the same row, this interval 
is computed as (ym + L)mw when mmodL = O, or (ym + mmodL)mw 
when mm 
0 
d L '¢ 0 • 
Because of the increases in number of strips and in 
their length, the capacity of buffers B and B should be q r 
expanded as previously indicated. 
To see how much of an improvement over single 
dimension extensions this technique is capable of, let us 
again assume that we have a system of L = 4 subarrays, with 
each array of size w = 32. With this system, 10000 
operations are to be performed on a number of matrices of 
order n = 128. Solving the problems one at a time without 
processing overlaps will take a total of 22,070,000 steps. 
Maximizing processing overlaps will reduce this number to 
20,480,159. If single dimension extensions can be used, the 
102 
problems can be solved in 10,241,183 steps. The improvement 




However, if compatibilities between these problems are 
exploited such that the two-dimensional extension can be 
used with x = 100 and y = 100, the total throughput will be 




almost doubling the speedup figure achieved with single 
dimension extension. As was noted before, the improvement 
factor grows closer to four as x and y get larger. 
Another advantage of the two-dimensional extension is 
that it further enhances the inherent programmability of 
Faddeev's algorithm. For example, should it be necessary to 
compute U, where 
U = (AE + F) (E + G)- 1 (B + D) +AB, (5.12) 
(5.11) can be rearrange to become 
I E B 
-I G D (5.13) 
-A F 0 
103 
Solving (5.13), that is annulling -I and -A while 
extending the operations to the right will give 
I E B 
0 E+G B+D (5.14) 
0 AE+F AB 
Observe that within the box of (5.14), the necessary 
components of (5.12) are already correctly positioned such 
that repeating the Faddeev's procedure on them will produce 
the final result 
I E B 
0 (E+G) ck> (B+D) ck> (5.15) 
0 0 u 
In short, to compute U from (5.13), one only needs to 
triangularize the augmented matrix formed from I, E, -I and 
G, then annul the augmented matrix formed from -A and F 
while extending both operations to the rightmost column of 
(5.13). Using the L-subarrays system, U is computed from 
the input data flow of (5.13) in 2k iterations. The first k 
iterations are needed to compute the matrices in the box of 
( 5 .14) . This intermediate results is immediately fed back 




By now, it is clearly obvious that the symbiosis of 
Faddeev's algorithm and the new systolic array system 
described in Chapter IV has given rise to a very powerful 
and versatile tool. The algorithm itself provides a 
considerable generality of operation which should allow the 
system to have a large range of application in the 
scientific and industrial fields. In return, the system has 
brought massive parallelism to the multitude of matrix 
operations capable by the algorithm. Furthermore, the 
system's enormous potential for parallelism can now be fully 
exploited to yield very high throughput with the Faddeev's 
algorithm extensions described in Chapter V. 
As compared to other designs from Chapter III, this 
system does not suffer any of their drawbacks while 
providing many practical advantages, some of which can be 
summarized as follow: 
- Either in single or multiple arrays form, 
the system is totally independent of problem 
size and will solve sparse matrix problems 
efficiently without any reconfiguration. 
- The system provides identical performance 
using a smaller number of cells or arrays. 
Indeed, given an equal number of arrays, its 
performance will be superior. When taken 
into account the fact that its design is 
ideally suited for the extensions made to 
Faddeev's algorithm, its throughput 
potential far outdistances any other system 
previously considered. 
- From a 
system 
user point of 
is exceedingly 
view, operating the 
simple: the input 
data flow is fed only to the top array and 
system controls consist of a few signals to 
each array top left cell. 
- The design of the system is truly modular, 
with simple and regular interconnections 
between cells and between modules. Hence it 
is very amenable to expansion: adding extra 
blocks of shift registers will allow it to 
handle correspondingly larger problems, 
while increasing the number of arrays will 
yield higher throughput. 
- Since all modules are square blocks w x w in 
size, it is topologically more economical 
and efficient in terms of PC board area. 
105 
In conclusion, the system's most important advantage 
is that while its design is simple enough for implementation 
to be an easy task, it is abundantly powerful and versatile 
to make that task worthwhile. Therefore, it is this 




1. H. T. Kung, "Why Systolic Architectures?" IEEE Computer 
Magazine, Vol. 15, No. 1, January 1982, pp. 37-46. 
2. Dan I. Moldovan, "On the Design of Algorithms for VLSI 
Systolic Arrays," Proc. of the IEEE, Vol. 71, No. 1, 
January 1983, pp. 113-120. 
3. Kai Hwang and Fay~ A. Briggs, Computer Architecture and 
Parallel Processing, McGraw-Hill, New York, 1984, pp. 
768-774. 
4. Charles L. Seitz and Juri Matisoo, "Engineering Limits 
on Computer Performance," Physics Today, Vol. 37, 
No. 5, May 1984, pp. 38-45. 
5. c. A. Mead and L. A. Conway, Introduction to VLSI 
Systems, Addison-Wesley, Reading, MA, 1980, pp.263-292. 
6. H. T. Kung, "Notes on VLSI Computation," in Parallel 
Processing Systems, ed. by David J. Evans, Cambridge 
University Press, Cambridge, MA, 1982, pp.339-356. 
7. Ronald Collett, "CPU Architecture, Part I: Problems And 
Limitations of Von Neumann Computers," Digital Design, 
Vol. 14, No. 11, November 1984, pp. 90-95. 
8. Wolfgang Handler, "Innovative Computer 
Architecture~How to Increase Parallelism but Not 
Complexity," in Parallel Processing Systems, ed. by 
David J. Evans, Cambridge University Press, Cambridge, 
MA, 1982, pp.23-32. 
9. R. W. Hockney and c. R. Jesshope, Parallel Computers, 
Adam Hilger, Ltd., Bristol, 1981, pp. 1-51. 
10. P. M. Dew, "VLSI Architectures for Problems in 
Numerical Computation," in Supercomputers and Parallel 
Computation, ed. by D. J. Paddon, Oxford University 
Press, New York, 1984, pp. 2-21. 
11. s. Y. Kung, "VLSI Array Processors, " IEEE ASSP 
Magazine, Vol. 2, No. 3, July 1985, pp. 4-22. 
107 
12. Leonard s. Haynes et al., "A Survey of Highly Parallel 
Computing," IEEE Computer Magazine, Vol. 15, No. 1, 
January 1982, pp. 9-24. 
13. Lawrence Snyder, "Introduction to the Configurable, 
Highly Parallel Computer," IEEE Comouter Magazine, 
Vol. 15, No. 1, January 1982, pp. 47-56. 
14. Douglas G. Fairbairn, "VLSI: A New Frontier for Systems 
Designers," IEEE Computer Magazine, Vol. 15, No. 1, 
January 1982, pp. 87-96. 
15. H. T. Kung and c. E. Leiserson, "Systolic Arrays (for 
VLSI) , " Sparse Matrix Proc. 197 8, Society for 
Industrial and Applied Mathematics, 1979, pp. 256-282. 
16. A. L. Fisher et al., "Design of the PSC: A Programmable 
Systolic Chip," in Proc. of the Third Cal tech 
Conference on Very Large Scale Integration, ed. by R. 
Bryant, Computer Science Press, Rockville, MD, March 
1983, pp. 287-302. 
17. A. L. Fisher et al .c, "The Architecture of a 
Programmable Systolic Chip," Journal of VLSI and 
Comouter Systems, Vol. 1, No. 2, Computer Science 
Press, Rockville, MD, 1984, pp. 153-169. 
18. D. K. Faddeev and V. N. Faddeeva, Computational Methods 
of Linear Algebra, W. H. Freeman and Company, 1963, pp. 
150-158. 
19. W. W. Gentleman and H. T. Kung, "Matrix 
20. 
Triangularization by Systolic Arrays," Proc. SPIE-The 
International Society of Optical Engineering, vol. 298, 
1981, pp. 19-26. 
H. T. Kung, "Systolic Array for 




21. Richard L. Burden et al, Numerical Analysis, PWS 
Publishers, Boston, MA, 1981, pp. 289-294. 
2 2 . W. M. Gentleman, "Error Analysis of QR Decompositions 
by Givens Transformations," Linear Algebra and Its 
Application, American Elsevier Publishing Company, New 
York, 1975, pp. 189-197. 
23. J. Greg Nash, "A Systolic/Cellular Computer 
Architecture for Linear Algebraic Operations," Proc. 
1985 IEEE International Conference on Robotics and 
Automation, March 1985, pp. 779-784. 
108 
24. J. G. Nash and s. Hansen, "Modified Faddeev Algorithm 
for Matrix Manipulation," Proc. SPIE, Vol. 495, August 
1984, pp. 39-46. 
25. Henry Y. H. Chuang and Guo He, "A Versatile Systolic 
Array For Matrix Computations," The International 
Symposium on Computer Architecture, 1985, pp. 315-322. 
/ 
APPENDIX A 
EXAMPLES OF FADDEEV'S ALGORITHM 
In the following, we will solve sample matrix problems 
using Faddeev's algorithm and its variants. The unmodified 
Faddeev's procedure, involving only ordinary Gaussian 
elimination, is illustrated with the first example. Its 
variant form using Gaussian elimination with neighbor 
pivoting is illustrated in the next two examples. Taken 
from chapter III, examples (3.1) and (3.2) are solved using 
the Faddeev's procedure combined with Givens rotations. 
All calculations in the examples are carried out using 
nine decimal places precision; however, because this thesis' 
line formatting allows only a finite number of characters, 
results are shown rounded off to two decimal places. 
Using Ordinary Gaussian Elimination 
Suppose we want to compute CA- 1 B + D, where A, B, c 
and D are matrices of order n = 3 and 
[ 
2 -1 3 ] 
A = -1 0 2 
4 -4 5 
B = [ 
1 2 4 ] 
3 1 -3 
1 7 9 
[
-1 2 3 ] [ 0 4 - 6 ] 
C = O 7 -4 D = -2 1 O 
1 -5 0 7 3 2 
110 
With Faddeev's algorithm, 
expressed as 
this problem can be 
2 -1 3 
-1 0 2 
1 2 4 
3 1 -3 
* 
4 -4 5 1 7 9 
= 1 -2 -3 0 4 -6 
0 -7 4 -2 1 0 
-1 5 0 7 3 2 
(A.1) 
where, by means of matrix triangularization, all entries 
below the diagonal elements of A are zeroed out such that A 
is triangularized and c is annulled. After completion, the 
results should appear in the place of D. 
Matrix triangularization procedures are often used, 
among other things, to solve linear systems. In solving a 
linear system, three operations are permitted on its rows: 
1) Entries of row Ri can be multiplied by any non-
zero constant "II.. and the resulting row used in 
place of R i. 
("11..R i) ' (R i) 
This operation will be denoted 
2) Entries of row Rj can be multiplied by any 
constant "II.., added to row Ri, and the resulting 
row used in place of Ri. This operation will be 
denoted (Ri + "11..Rj)' (Ri). 
3) Rows Ri and Rj can be transposed in order. This 
operation will be denoted (Ri) ~ (Rj). 
When used within Faddeev's algorithm, the third 
operation has a restriction which states that i and j cannot 
111 
be larger than the order n of the matrices, i.e. transposing 
the order of the two said rows is not allowed if either one 
or both rows belong to the lower half of (A.1). 
Furthermore, although the entries in the affected rows are 
expected to change after any of these three operations, for 
ease of notation we will again denote the entry in the ith 
row and the jth column of matrix X (X here represents A, B, 
C or D of (A.1)) by xij" With this in mind, we can apply 
Gaussian elimination procedure to (A.1) by sequentially, for 
i = 1, 2, •.. , n-1, perform the operation 
(Rj - (aji/aii)Ri) -+ (Rj) on the upper half of (A.1) with 
j = i+l, i+2, ••• , n, and the operation 
(Rk - (-ck-n,i/aii)Ri) -+ (Rk) on the lower half of (A.1) 
with k = n+l, n+2, ... , 2n, provided that aii * O. When 
8 ii - O, a search is made for the first non-zero element aji 
where j = i+l, i+2, ••• , n and the operation (Ri) .... (Rj) is 
performed so that the procedure can continue. 
112 
Thus, by performing the operations (R2 + .5R 1 ) ~ (R2 ), 







) on (A.1), row R
1 
is effectively used to 
zero out all entries below a
11 
to give: 
2 -1 3 1 2 4 
0 -.5 3.5 3.5 2 -1 
0 -2 -1 -1 3 1 
0 -1.5 -4.5 -.5 3 -8 
0 -7 4 -2 1 0 
0 4.5 1.5 7.5 4 4 
In this system, Rz is used to eliminate entries below 
8 22 by the operations (R3 - 4R2 ) ~ (R3 ) ' (R4 
(Rs - 14R2 ) ~ (Rs ) and (R6 + 9R2 ) ~ (R6 ) • 
system is then 
2 -1 3 
0 -.5 3.5 
0 0 -15 
0 0 -15 
0 0 -45 
0 0 33 
1 2 4 
3.5 2 -1 
-15 -5 5 
-11 -3 -5 
-51 -27 14 
39 22 -5 
- 3R2 ) ~ (R4 ) ' 
The resulting 
113 
Finally, with the operations (R4 - R3 ) .... (R4 ) ' 
(R
5 
- 3R3 ) .... (R 5 ) and (R 6 - 2. 2R3 ) .... (R 6 ) , we obtain the 
system 
2 -1 3 1 2 4 
0 -.5 3.5 3.5 2 -1 
0 0 -15 -15 -5 5 
0 0 0 4 2 -10 
0 0 0 -6 -12 -1 
0 0 0 6 11 6 
which shows the result CA- 1 B + D in its lower right hand 
quadrant. 
Using Gaussian Elimination With Neighbor Pivoting 
We have indicated earlier that obtaining a zero for a 
diagonal element aii during the Gaussian elimination 
procedure necessitated a row interchange of the form 
(R i) ++ (R j) where i < j < n was the smallest integer with 
8 ji '¢ O. Actually, it is often desirable to perform row 
interchanges (or pivoting) involving the diagonal elements 
even when they are not zero. This is because when the 
calculations are performed using finite-digit arithmetic, as 
would be the case for calculators or computer-generated 
solutions, a diagonal element that is small compared to the 
entries below it in the same column can lead to substantial 
roundoff error. 
Referred to as neighbor pivoting, the two adjacent 
rows R i and R j where i < j < n are interchanged whenever 
114 
I a iii < I a j ii , immediately before an operation of the form 
(Rj - (aji/aii)Ri) ~ (Rj) can be performed on them. To 
illustrate this, let us consider the problem of computing 
CA- 1 B + D with matrices of order n = 3 
[-1 5 -3 ] [-2 -7 6 ] 
A= 3 4 1 B = 1 3 1 
6 7 -2 5 9 4 
[ 1 -2 4 ] [ 2 1 -5 ] c = 3 4 -1 D = 2 4 6 
-5 3 2 -3 2 9 
Like before, the problem is expressed as 
-1 5 -3 -2 -7 6 
3 4 1 1 3 1 
*6 
7 -2 5 9 4 
= -1 2 -4 2 1 -5 
(A. 2) 
-3 -4 1 2 4 6 
5 -3 -2 -3 2 9 
Since (A. 2) shows that I 8 11 I < I 8 2 1 I pivoting is 




• Thus, performing 
the operation (Rl) ++ (R2 ) gives us 
3 4 1 1 3 1 
-1 5 -3 -2 -7 6 
6 7 -2 5 9 4 
-1 2 -4 2 1 -5 
-3 -4 1 2 4 6 
5 -3 -2 -3 2 9 
where, after the operation (R
2 
+ .33R 1 ) ~ (R 2 ), we have 
3 4 1 
0 6.33 -2.67 










1 3 1 
-1.67 -6 6.33 








Note that how neighbor pivoting has just been carried 
out by the two previous steps. Once again, the above system 







I < I a
3 1 






6 7 -2 













5 9 4 




























) , becomes 
6 7 -2 
0 6.33 -2.67 
0 .5 2 
0 3.17 -4.33 
0 -.5 0 
0 -8.83 -.33 
9 4 5 
-1.67 -6 6.33 
-1.5 -1.5 -1 
2.83 2.5 -4.33 
4.5 8.5 8 
-7.17 -5.5 5.67 
and 
The procedure is carried out further with the 
elimination of entries below a
22 













) , (Rs + . 08R
2
) -+ (Rs ) 
and (R
6 
+ 1.39R2 ) -+ (R 6 ). We thus have 
6 7 -2 













-1.37 -1.03 -1.5 
3.67 5.5 -7.5 
4.37 8.03 8.5 
-9.49 -13.87 14.5 
117 






















) , the solution to CA- 1 B + D appears in 
the lower right hand quadrant of 
6 7 -2 5 9 4 
0 6.33 -2.67 -1.67 -6 6.33 
0 0 2.21 -1. 37 -1. 03 -1.5 
0 0 0 1.81 4.11 -9.54 
0 0 0 4.24 7.93 8.36 
0 0 0 -12 -15.75 11.75 
The following is another example of Faddeev's 
algorithm with neighbor pivoting. Given matrices A, B, c 
and D of order n = 4, with 
[ 2 -1 3 0 ] [ -8 3 0 3 l 4 -2 7 0 -20 5 1 6 
A= -3 -4 1 5 B = -2 -9 7 8 
6 -6 8 0 4 7 4 2 
[ 1 -1 2 -1 ] [ 1 3 -5 7 ] 2 -2 3 -3 0 -4 1 7 
c = 1 1 1 0 D = 2 1 3 0 I 
1 -1 4 3 1 -3 -1 9 
118 
we want to compute CA- 1 B + D. Formulating the problem as 
follow 
2 -1 3 0 -8 3 0 3 
4 -2 7 0 -20 5 1 6 
-3 -4 1 5 -2 -9 7 8 * 6-6 8 0 4 7 4 2 
= -1 1 -2 
(A. 3) 
1 1 3 -5 7 
-2 2 -3 3 0 -4 1 7 
-1 -1 -1 0 2 1 3 0 
-1 1 -4 -3 1 -3 -1 9 
reveals that, because la 11 I < I az 1 I ' pivoting of rows R 1 and 
Rz is necessary. Thus, the operation (Rl ) ++ (Rz ) produces 
the system 
4 -2 7 0 -20 5 1 6 
2 -1 3 0 -8 3 0 3 
-3 -4 1 5 -2 -9 7 8 
6 -6 8 0 4 7 4 2 
-1 1 -2 1 1 3 -5 7 
-2 2 -3 3 0 -4 1 7 
-1 -1 -1 0 2 1 3 0 
-1 1 -4 -3 1 -3 -1 9 
119 








4 -2 7 0 -20 5 1 6 
0 0 -.5 0 2 .5 -.5 0 
0 -5.5 6.25 5 -17 -5.25 7.75 12.5 
6 -6 8 0 4 7 4 2 
-1 1 -2 1 1 3 -5 7 
-2 2 -3 3 0 -4 1 7 
-1 -1 -1 0 2 1 3 0 
-1 1 -4 -3 1 -3 -1 9 
Before we can proceed any further in eliminating 
entries in the first column, because I ai 1 I < I a 4 1 I ' we have 
to perform the operation (R
1
) ++ (R4 ) : 
6 -6 8 0 4 7 4 2 
0 0 -.5 0 2 .5 -.5 0 
0 -5.5 6.25 5 -17 -5.25 7.75 12.5 
4 -2 7 0 -20 5 1 6 
-1 1 -2 1 1 3 -5 7 
-2 2 -3 3 0 -4 1 7 
-1 -1 -1 0 2 1 3 0 
-1 1 -4 -3 1 -3 -1 9 
120 
Now, all remaining entries in the first column can be 
eliminated with (R4 - .67R 1 ) -+ (R4 ) ' (Rs + .17R 1 ) -+ (Rs ) ' 
(Rs + .33R 1 ) -+ (Rs ) ' (R7 + .17R 1 ) -+ (R7 ) and 
(Rs + .17R1 ) -+ (R 8 ), to give 
6 -6 8 0 4 7 4 2 
0 0 -.5 0 2 .5 -.5 0 
0 -5.5 6.25 5 -17 -5.25 7.75 12.5 
0 2 1. 67 0 -22.67 .33 -1.67 4.67 
0 0 -.67 1 1. 67 4.17 -4.33 7.33 
0 0 -.33 3 1. 33 -1. 67 2.33 7.67 
0 -2 .33 0 2.67 2.17 3.67 .33 
0 0 -2.67 -3 1. 67 -1. 83 -.33 9.33 
Prior to zero out entries in the second column, 
because a 22 = o, the operation (R 2 ) ++ (R3 ) is used to obtain 
6 -6 8 0 4 7 4 2 
0 -5.5 6.25 5 -17 -5.25 7.75 12.5 
0 0 -.5 0 2 .5 -.5 0 
0 2 1. 67 0 -22.67 .33 -1.67 4.67 
0 0 -.67 1 1. 67 4.17 -4.33 7.33 
0 0 -.33 3 1.33 -1.67 2.33 7.67 
0 -2 .33 0 2.67 2.17 3.67 .33 
0 0 -2.67 -3 1. 67 -1. 83 -.33 9.33 
121 
Applying (R4 + . 36R2 ) -+ (R4 ) and (R7 - • 36R 2 ) -+ (R7 ) 
to the above system, we are left with 
6 -6 8 0 4 7 4 2 
0 -5.5 6.25 5 -17 -5.25 7.75 12.5 
0 0 -.5 0 2 .5 -.5 0 
0 0 3.94 1.82 -28. 85 -1. 58 1.15 9.21 
0 0 -.67 1 1. 67 4.17 -4.33 7.33 
0 0 -.33 3 1. 33 -1. 67 2.33 7.67 
0 0 -1.94 1.82 8.85 4.08 .85 -4.21 
0 0 -2.67 -3 1. 67 -1. 83 -.33 9.33 
which requires pivoting of rows R 3 and R4 • Therefore, after 
(R3 ) ++ (R 4 ), we have 
6 -6 8 0 4 7 4 2 
0 -5.5 6.25 5 -17 -5.25 7.75 12.5 
0 0 3.94 1.82 -28. 85 -1. 58 1.15 9.21 
0 0 -.5 0 2 .5 -.5 0 
0 0 -.67 1 1.67 4.17 -4.33 7.33 
0 0 -.33 3 1.33 -1.67 2.33 7.67 
0 0 -1.94 1.82 8.85 4.08 .85 -4.21 
0 0 -2.67 -3 1. 67 -1. 83 -.33 9.33 
122 
where we can proceed to eliminate all entries below a
33 
with 














(Ra + .68R3 ) -+(Ra>· The resulting system will be 
6 -6 8 




















0 -1. 77 
4 7 4 2 
-17 -5.25 7.75 12.5 
-28.85 -1.58 1.15 9.21 
-1.66 .3 -.35 1.17 
-3.22 3.9 -4.14 8.89 







Finally, annulling the lower left hand quadrant 
completely with the operations (Rs - 5.67R4 ) -+(Rs), 
(R6 - 13. 67R4 ) .... (R6) I (R 7 + 4R 4 ) -+ (R 7 ) and 
(Ra + 7.67R 4 ) -+(Ra) will give us the solution in the lower 
right hand quadrant of 
6 -6 8 









0 3.94 1.82 













4 7 4 2 
-17 -5.25 7.75 12.5 
-28. 85 -1. 58 1.15 
-.35 
9.21 





2.2 -2.13 2.27 
-5.9 7.27 -7.53 
4.5 0 5 
-.6 -2.27 24.53 
123 
Using Givens Rotations 
A Givens transformation rotating the two row vectors 
Ri and Rj 
0 • • • 0 8 ii 




8 ik • • • 8 in 
• ajk • • • ajn 
of a given matrix A of order n replaces them with two new 
vectors 
0 . . . 0 a :u a i, i+ i . . . a.fk . . . a.fn 
0 . . . 0 0 a j, i+ i . . . aJk . . . a Jn 
such that, with k = i+l, i +2 I • • • I n, their entries are 
8 Li = 0 ij 
a'.k = cosa . . a .k + sin<X .. a .k 
i 1) i 1) J (A.E.1) 
aJk = -sin<Xijaik + cos<Xijajk 
where 
<Xij = ~ 2 8 ii 2 + aji 
8 ii 
cos <X ij = 
<Xij 
sin<Xij 




cos <X ij + . 2 sin <Xij = 1. 
The transformation obviously leaves unchanged zeroes 
appearing in corresponding entries of both vectors. Thus a 
matrix of order n can be triangularized by applying a 
succession of Givens rotations to its rows Ri and Ri+i' Ri 
124 
and Ri+ 2 , ••• , Ri and Ri+n for i = 1, 2, .•• , n-1 such that 
zeroes are introduced into every columns below the diagonal 
elements. 
When combined with Faddeev's algorithm, Givens 
rotations are used on the rows above the horizontal line to 
triangularize A and ordinary Gaussian elimination is used on 
rows below the horizontal line to annul c. The procedure 
involved can be illustrated much easier with an example. 
Let us find the solutions of the linear system (3.1) 







, and its equations are represented here in matrix form as 
A= [ 
1 2 3 ] 
0 4 7 
2 1 3 
B = [ n 
The solutions' column vector X can then be expressed 
as X = A- 1 B or, by expanding it to become X = IA- 1 B + o 
-where I is the identity matrix and O is a zero vector 
I = [ 
1 0 0 ] 
0 1 0 
0 0 1 
0 = [ n 
! 
125 
allows us to formulate the problem as 
1 2 3 5 
0 4 7 9 
*2 1 3 7 = -1 (A. 4) 0 0 0 
0 -1 0 0 
0 0 -1 0 
Since a
2 1 
= O in (A. 4), we can skip row R
2 
and, by 




using the equations of 
(A.E.1) with 
a1, 3 = ~ 2 al, 1 2 + a3, 1 = ~ 1 + 4 = 2.24 
al , 1 1 
cosa1 , 3 
=--=--= .45 
a1, 3 2.24 
a3, 1 2 
sina1 , 3 
=--=--= .89, 
a1 , 3 2.24 
subsequently get the following system 
2.24 1.79 4.02 I 8.5 
0 4 7 9 














Gaussian elimination is now used to continue 
procedure below the horizontal line. Performing 
operation (R4 + • 45R 1 ) .... (R 4 ) , we have 
2.24 1.79 4.02 8.5 
0 4 7 9 
0 -1.34 -1.34 -1.34 
0 .8 1.8 3.8 
0 -1 0 0 
0 0 -1 0 
Once again, we rotate rows R
2 
and R3 with 
()2 ' 3 = J 2 a 2 , 2 2 + a3,2 = J 16 + 1.8 = 4.22 
cosa2 , 3 = 
sina2 ,3 = 
to obtain 
a 2 ,2 --
()2 ' 3 
4 
= -- = .95 
4.22 
-1. 34 a 3 , 2 ---- = = -.32 
()2 ' 3 4.22 
2.24 1.79 4.02 



















which we can further modify by applying the operations 
(R4 - .19R2 ) -+ (R 4 ) and (R 5 + . 24R2 ) -+ (R 5 ), giving us 
2.24 1.79 4.02 8.5 
0 4.22 7.06 8.96 
0 0 0.95 1.59 
0 0 .46 2.1 
0 0 1.67 2.12 
0 0 -1 0 
Since A is now fully triangularized, performing the 
operations (R4 - .48R 3 ) ... (R4 ) I (Rs - 1. 75R 3 ) -+ (Rs ) and 
(Rs + 1. 05R 3 ) -+ (Rs ) to completely annul the lower left hand 
quadrant of the above system yields X = A- 1 B in the lower 
right hand quadrant of 
2.24 1.79 4.02 


















For the purpose of comparison, we will also present 
here the solutions to example (3.2) of chapter III. Later 
on in appendix B, this example will be used for the graphics 
simulation of Nash's array to show that it produces 
erroneous results as mentioned in chapter III. 
128 
Example (3.2) gives us a linear system which is 
expressed in matrix form as 
A=[nn B=[n. 
Solving this linear system with Faddeev's algorithm 
requires us to formulate it as 
0 2 3 5 
0 4 7 9 
*2 1 3 7 = -1 (A. 5) 0 0 0 
0 -1 0 0 
0 0 -1 0 
Because 8 1, 1 = 0 and 8 2. 1 = o, we can make things a 
lot easier by interchanging rows R 1 and R 3 of (A. 5) with the 
operation (R1 ) ++ (R3 ) ' to give 
2 1 3 7 
0 4 7 9 
0 2 3 5 
-1 0 0 0 
0 -1 0 0 
0 0 -1 0 
129 
Performing the operation (R 4 + .5R 1 ) ~ (R 4 ) reduces 
all entries below a 1 , i to zeroes, and the above system 
becomes 
2 1 3 7 
0 4 7 9 
0 2 3 5 
0 .5 1.5 3.5 
0 -1 0 0 










COSlX 2 ' 3 
=---=---= .89 




sina2 • 3 
=---=---= .45 
lX2 • 3 4.47 
will completely triangularize A to give 
2 1 3 7 
0 4.47 7.6 10.29 
0 0 -.45 .45 
0 .5 1.5 3.5 
0 -1 0 0 
0 0 -1 0 
in which all entries in the second column of the lower left 
hand quadrant can be eliminated with the operations 
130 
(R 4 - .11R2 ) -+ (R4 ) and (R 5 + . 22R 2 ) -+ (R 5 ) • This produces 
the system 
2 1 3 7 
0 4.47 7.6 10.29 
0 0 -.45 .45 
0 0 .65 2.35 
0 0 1.7 2.3 
0 0 -1 0 
Finally, the procedure is completed with the 
operations (R4 + 1. 45R 3 ) -+ (R4 ) ' (R5 + 3.8R3 ) -+ (R5 ) and 
(Rs - 2.24R 3 ) -+ (Rs), to yield 
2 1 3 7 
0 4.47 7.6 10.29 
0 0 -.45 .45 
0 0 0 3 
0 0 0 4 
0 0 0 -1 
which shows the solutions to the linear system in its lower 
right hand quadrant. 
APPENDIX B 
REAL TIME GRAPHICAL SIMULATION 
OF SYSTOLIC ARRAYS 
Simulation techniques play an important role in the 
verification of a design's correctness of operation and 
debugging. Because serial computers are by nature 
sequential machines, their software simulators are often 
little more than conventional language interpreters. 
For systolic arrays, this is simply inadequate. To 
verify whether a given algorithm is correctly mapped into a 
corresponding array architecture, a system designer must be 
able to observe, at all times, the movement of every piece 
of data as they traverse through the array, as well as the 
results from operations performed on each of them by any of 
the cells. Furthermore, for debugging purposes, he must be 
able to look into the registers of every cell at any one 
time, and see the values of all control signals present in 
that cell. In short, he must have the most detailed view of 
the entire system, which may consists of many arrays and 
many cells, at all times. 
To meet the above requirements, a new breed of 
simulator-a systolic arrays simulator-was developed and 
built to aid a hardware or software designer in the task of 
~ 
132 
designing and debugging systolic systems. For reasons which 
will become clear later, it was deemed essential that this 
simulator should be graphics based, hence its name Systolic 
Arrays Graphical Simulator, or SAGS in short. 
From the very beginning, SAGS was designed to simulate 
systolic systems of any configurations. These 
configurations are specified to SAGS by way of script files. 
A script file contains vital informations about a system 
such as its number of arrays, their types and sizes, the way 
they are linked together and the microprograms each cell 
will use. A script file also specifies when and where input 
data and control signals should be fed into~and output data 
taken from~a system. SAGS allows for systems with multiple 
input, control and output data streams. Each input or 
control stream is stored into ASCII files prior to being 
accessed by SAGS. Similarly, outputs of SAGS are written 
into ASCII files. 
During run time simulation, SAGS executes all steps of 
a problem one after another without pause, showing results 
of each step on the screen. This is called multi-step mode 
of execution; it can be stopped and restarted at any time. 
Alternatively, SAGS can single-step through the problem, 
allowing a more detailed inspection of the results. 
Switching between these two modes can be accomplished easily 
at any time. 
133 
Visually, SAGS allows all arrays of a system to be 
seen on a monitor screen, as long as each array has a 
reasonable number of cells. Because the real estate of a 
monitor screen is limited, arrays can be overlapped such 
that one in the background can be brought into the 
foreground for observation at any time. In addition, 
individual arrays can be interactively positioned anywhere 
on the screen to closely match the system schematic. SAGS 
allows an array to have two different views: a real view, 
with the array and its cells appearing smaller and therefore 
containing less information, and a full view, where the 
cells show all their registers content. The view of an 
array can be specified in the script file, or changed during 
run time. All visual changes made to a system configuration 
during run time can be recorded back to the script file for 
reuse. A status bar on top of the screen displays 
additional informations such as the current step number, the 
total execution time and the array being selected. 
In this author's experience, SAGS has been quite 
useful in verifying and debugging the designs presented in 
this thesis. Indeed, it is while using SAGS to simulate 
Nash's implementation of Faddeev's algorithm that the bug in 
its boundary cell microprogram was discovered and 
identified. For the reader's convenience, SAGS source code 
is listed in Appendix c. 
134 
In the following, three series of snapshots illustrate 
the simulations of three different systolic designs. Each 
snapshot is a screen output of SAGS for one execution step. 
All problems used in these simulations are examples taken 
from Appendix A. 
The first series of snapshots B.1 shows the simulation 
of Nash's system (from Figure 5) as it solves example (A.4). 
It can be seen that this implementation of Faddeev's 
algorithm produces erroneous results. 
The second series of snapshots B.2 shows the 
simulation of Chuang and He's system (from Figure 8) using 
example (A.2). 
In the last series B.3, the L-tuple arrays system of 
Figure 30 is simulated, with L = 2. 
here solving example (A.3). 
This system is shown 
-IUDf: MW•tGfWIQIJU W!Jttil$MJN_M_a:Lii5i DI! -ii ~ j v uo; 1 o.uoj 
1 o i ! o I Delay Cells for 
r::R~. 1 1 Input Data Flow 
Status Bar 
I o.oo 1Ho, , .uol o.oo: Skewing 
I 0 0 . I 0 0 I 
r---+--+--- ~. ----1-..... 
1 i I 11 i I I 
, 0.001 0.001 0.001 ! o.oon ll.001 0.001 
ionolo 1 /o!o!o; r 
r ~ I I I Y ----:... o oo · o .ool o oo! I o .ool o 001 o oo! 
I I I l • ' 
C4 1 0~0 11 ~.olv 1 ---A 11 I I Triangular __....-- o oo o ool I o ooi o ool o ool-- Square Array in 
. II I I I 
Q I fl nnl n nnl n nnl 
Anay rn Roal Mod~i ~ Real Mode 
Ta Bits · ··1 • · ··1 ·. ·1 · ··1 
g u l'l'I'' 
I I · o.oo" o.oo o.ooJ 
o I o o ....._ Delay Cells 
.. i ... 1 
U.UIJ. IJ .IJVj 
o I o ! 
---#-
I ! 
I o .OOJ 
I o I 
L--J 
Snapshot B .1.1. simulation of Nash's systolic 
array solving example (A.4). 
--------. . . . . - . , ~ . , , . " ~. . ' : --------
a13--"-l.OOI I O.OOj 
! 0 I I n I . 1,_;_,_i 
I j ! ! 
al2 --r-2.001 o.oo: I 0.001 o oo: 




--...1.00! 0.001 0.00 
I 0 ! 0 
~ ... r .. 1,..,..,.,,.,M ..... ,.,AI 
' .... \;\' ... ' \,>.VI.' u I II .\JI,/ ... i 
c ---i-1.00 I 0 .00 I 0 .00 I 
I • •• ! o "" I o •o I I _.,......-r· ,. I ,.. • ! I 
s ! o.oo Yo.oo I o.oo 1 1 
7' •oood •• ,.1r .. , ..... ,, 
X / ! 1.00 o.oo 1 I 
out I • no I • •o I I I . .. I . .. I I 
1 o.oo v.oo t f 
I ~ .1{1 {!: i I 
Triangular ~ i uo i I 
I o .oo 1 I 
Array in Full Mode I • •. I I -- ~
I t I . 
o.ool o ool o.t:-7 Tag Bits 
0 I 0 I 0 1 I I 
; c. ~c c M ~. tC c I 
o.oo I o.oo • o.oo I 
• o I I I •. o I o .oo I o .oo 1 
o.oo ! o.oo I o.oo I 
o.oc q o.oo cl o.oc oi 
o.oo I o.oo I o.oo 1 
o .oo I o oo I • •• 1-- Square Array I . I "_,.. I • 
1uo I o.oo I o.oo I in Full Mode 
fJ.{l:{I {l i i:i.i:i{l ., {t_~ ~I 
Ii .uo I o .oo : Ii .oo I 
o.oo I o.oo I o oo ! 
UC : UC : H~ l 
• ••' o 'Cl • •. ,I '. •·· 1 .• I • _..,' 
o I o I o 1 
"' ... ,,~ ... ,..,..! 
.. "I , ": 
I 0 ~ (I I ---
0 ~' ~ ' 
Snapshot B.1.2 
135 
_...,,., ••••s.1•HJM• •e'd"te-iu ____ ''•d=· • M· -
a 23 ~ ' ooi i o ooi I 0 ! I 0 I 
a22 
~ ool J ool I o ool 0.001 
• I 0 I I " " I .:......J....,; I " r,; I 
a21 I H I I I I I I 
--+-o.ooll 2.001 o.ool I o.ool o.oo 0.001 
iololoilolo cl 
a 11 --+ 1. OC 0 ; 0. OC 0 : 0 . GU V : : Ci . 00 Ci ; 0. Gt. 0 ~ (..OU ~ 1 
o.oo I 1.00 I o.oo ! I o.oo I o.oo I o.oo 
I.CO : ~.00 I 000 : I 0.00 : 0.00 : 0.00 I 
o.oo tt o.oo ~ o.oo J I o.oo j o.oo • o.oo / 
" M " i n ""' n I I n ,..., n ij ,.. ,..,n n ! n "" n ! • " ' i ' ..... ' I I •. " • I . "" . I '." . I 
1 1.00 I 1.00 I I e.oo I o.oo H o.oo / 
1 nnn len• ilnon lnon lnnn I 
1:···1··"1~."·a···ff·--1 
~~ u.ilO il.Oil UO I 
Snapshot B.1.3 
~f(l.~oo 1 Q.t1{1ttH!t.t':!tt1l 
I LUO 1 / 1.00 f u.oil I O.ilil ! 
! 0.00 I I 0.00 I 0.00 I 0.00 
~ I G.GO I G.00 I 0.00 
o.ool o.oo o.ooj 




I 0 .oo, 
o I 
. ~~ ....... ~ ......... ~. 
a2l clears~o.oo o; 
all from t I oo I 
~egister r l o.oo I 
instead of 1 o.oo 1 




I o I o I 
1 3.ooi r-H' o.ooi
H I 
1.001 '.001 I o .001 o .001 
0 0 : I 0 I 0 : 
" I 11 ' I I 2.00~ •.ool 3.ool I o.ool o.oo o.ool 







o; o.oo rtl J o.oo o; o.oo ~; 
I 1.00 ! I o.oo I o.oo 
I A II I I •. oo I I o.oo I o.oo 
f o.oo I / o.oo f o.oo • 
•f'OTo'Ol I' o.oo •I o.oo .1 
I 1.00 I 1.00 W o.oo 
1 0 .Ou 
I e.oo : 1 uo I o.oo 
~I°"" I u I 
{I. {l{I {I ! I ti. (II) !) ~ I) .1)1) t;i • 
LUO I I i.UO ' i.oo I 
I o .oo I i o .oo I o .oo I 
0.GU 0 
0.00 
0.00 . .. 











o I o I -
ti .VO! 
i 0 ....__.., 




















0 0 0 
00"0 oo·o oo·• 
00"0 I 00"0 






























I oo· o 
Io I oo·o 
o I o 




00"0 I oo·o 















I 00'0 I 00· 1 
Io . oo·o Ne . oo·o 
9·1·a :+oqsd-eus 
I 






0 00"0 o oo·• 
oo·c oo·i-oo·o 
oo·o oo·r ot·r 
00·1 oo·o os·o-
o oo·( 0 00"1 I oo·; 
r I r 
oo·o f oo·o I 
~·1·a :+oqsd-eus 
IOo:oJ 
I oo·o I 
I 00· 1 I I I 0 co·c I 
11 
oo·o I oo·o 






0 M"I) l)l)"(t 
I I oo·o I oo·• I oo·o 
11 00' I I {11)·9 I OO'I 
11 i I 
11 oo·o I 00' I 
Io 
oo·o 

































I I I I I 













.. 00" l 








°'" l I o·· 
os·1 00-1· t oo·o 
00·1 00·1 I 00"1 
I 
11 
os·o-I oo·o I oo·o 













oo·o oo·o I 
oo·o oo·o I 
I 
I 0 
00·1 00"1 I 










00·1 00·1 I 





oo·o oo·r I 
I 
I 00·1 I oo·o I 
Io oo·o Io 09"0 Io I I 
l00·
0
o I I 














I . 00"' oo·o 
11 00·1 






oo·o I oo·r H"0-1 
oo·o 
Io .... I 00" l ti" I 
oo·c-os·o i 
00"1 I 00"1 I oo-o I os·o-I 





































0 00"0 ! I 




00·1 00·1 00·1 
l 
I os·o-I oe·o I oo·o oo·o on i1 oo·o i 1 0 OO"l 11 00"( 11 
l I I l ! ! I I 
WO iiO"ii Ion 11••.-o vu·~ 
I l I I I 











I , ... Ot"O ! oo·o I 
00·1 I OO"I I oo·o oo·o I 
00·1 11 oo·• I 
l ! 
iiii"O I 










0 I 0 
100·0 Ion 
I oo·o I oo·o I oo·o I iOo:Ol 
I 
00'0 
I oo·o I oo·o 11 OO"I I 00·1 
I 00·1 
Io 
00·1 I 00"0 I Io oc·e 
0 oo·o e~·c I I c ffC 
I I 
oo·o I oo·o I n-o· I I s,·o I oo·o I 
cc·o I co·1 
I 
s.-c-I I Sl"c· I H"C· I I I 11 I 
I ot"I I oo·o I ..-o 11 wo I n·o-1 





00·1 I 00·1 I o~·' 
I I I 
I 00·1 00·1 os· o-I oo·o I oo·o I oo·o 




00·1 ,, (U)'t I 0 I I I I 11 l I I I I I 
loo·o I···. Ion I Ion-vu·u l•o·o I I I 
I 
l 
oo·o oo· o , ... 0 oo· o 
i 




i 0 I 
;.,., .• j 





1)0' c C1J· C oo· \ 
~ 00· 1 00· 1 00" l I I 00 l I 
oo· o Si" l 03" ! I i \Z" 't-I 
i ..... ij_ •••• i .... I II iH I 
iUUUU~ 
~!(tl)"(t (tl:t"(I. I 00"0 ~ i M.CI 
I _ .. I ... i ... I ' ... I I I n u-I :n u-i :J• 11• I 
I .. ·-1 ifl1-I 
I II" 0 I n·o-oo· o I I oo·o I oo·o I 
i. cc· 3 I. co· o , .... I 1. C'' ~ 11 ~\. ~ I I' i' .;, v• f I 
i oo· o -~ oo· o oo· o I i oo· o I oo·o I 
I 00· 1 I C0' 1 00" l I i 00" l I 00· 1 I 
I j I i I 
• I oo· o I oo· o 00" 0 I I oo· o 
II 
oo·o I 




11 I I 
100· 0 oo·o 100·0 I 100·0 oo·o 100·0 





100·0 oo· o loo-0 oo·o 
I I 
i l ! 0 








' jc~· G 
I 0 I 0 
...----. 
I o I 
loo·. I I' • i 
l oe-o co·o en ! I co·o ! 1 oo-o I 00·1 I 00·1 1 I 00-1 1 
I on M oo· o I sn I 1 os· r I 
lo wo 1o wo 11 io·o-1 11 ff~ 1 
~jon-I oo-ol 
I :i\"0-t in-I i\O· I l i\ o-I S>o-I 
II·I·a ~oqsd-eus 
I u-o I 11"0 ~ n-o-I I oo·o I oo·o I 
i. "·• ff. .... i 1 .••. 0.1 I. ,., .•. I. .... , 1 
~,~... ;J ·~··l 
I oo·o I oo·o ! oo·o I l!o--oo-·-o~l-o-o-·o-i,.1-o-o-·o--. 
: on I 00"1 I 001 I I 00"! I 00! I 001 I os·o-I oo·o I oo·o I I oo·o I oo·o I oo·o 
I 1 oo · o I 1 oo · o ! 1 oo L I . ! 1 on I 1 on Ii on 
: I I I I I l 1 I I I I ;oo·o too·o 100·0 I ;oo·o 100·0 
I I I l~l~+---;L~~ l ., I • I 
loo-'o loo •o : 





100· 0 i 
Jl'f't:D•P.iWl&l 
--u• t•••tti••·~w· . ·•'ltttiel~"*-.-- ,,, .•. 
I o .001 i 0 .001 
I 0 I I 0 I 
1---i 
o .ool o .oo! I I I I o.oo 0.001 
0 I ~ ! I 1 f ! I 
~ I i I I f I 
o.ooH o.oo 0.001 ! o.ool o.ool 0.001 
' 1 I 1 I 1 11 1 I 1 I 1 I I 11 I I 
UO l : lOv 1 ; > 00 1 : : i. GO q UO i; 0. 00 i! 
0 .00 ! 0.00 i 0 .00 I I 0 .00 I o.oo I o.oo I 
I LOO ! UC I 1.00 i I 1.0C f 100 I :.oo I I 
~ 0.00 I 0.00 I I 0.00 I 0.00 I 0.00 I 
: ,_,7 i; 7.60 1 i 10.:' :; 0.00 I: 0.00 1: 
I 0.00 I 0.00 I I 0.00 I 0.00 1 ·0.22 I 
: ·o. •! 1-us \ 1 -o •5 1-us I ·o. •5 l 
I 0.00 I uo 11 0 00 I 0.00 I 0.0.i I 
f o. •5 1 I I ·o •! ! I o. oo 1 i o. oo 1 I 
I o .oo f I ·2.H I ; .io I u; f 
I 1.00 I ' 1.00 • 1.00 i ~::: ~ :-1.00 I o.oo 
I . 
x ___,..•.col o.oo 
2 I I t 1 
(erroneous) ,....._........ 
? .eel o ooi 
1 I o I ______. 











0.001 o.oof I o.oo 
i I ,c 1 o,---'c 
I I Ii I o.ool o.oo 0.001 j o.ool o.oo 









Z.OG ii l.00 1 I 3. 00 1 I I '. 00 1 I 0. 0-0 I I 0. 00 1 I 
o.oo I o.oo o.oo I o.oo o.oo o.oo I 
I I I I LOC I I.CO I 1.CO I I.CC I l.00 I 1 . cc i 
o.oo ~ o.oo I o.oo I I o.oo I o.oo o.oo 1 
·, ,,7 i; 7.~0 1) 10.:, 1, 0.00 1, 0.00 11 
I 0.00 i 0.00 I 0.00 • 0.00 I 0.00 I 
'·0.IS 1 ·0.I! i 1 ·0.H ,-G.IS ·O.IS I 
~I 0.00 I I 0.01) I 0.00 I 0.00 I 
\~ ,·0.1~ 1, 0.00 11 uo 1\ 
t 0.00 I I li.Otl t ·2.2' I s.~o j 
I 1.00 I I 1.00 I 1.00 I 1.00 
~luc t 0.00 ; uo 
l .. ~J J I 
3 i I I I 1 ! 




(erroneous) . . . 
' 1 oo: o oo/ 
Snapshot B.1.14 
I I I 




lllW!!!ffi!!!'. --...-.:nirJr!r;rJ, .w.w.1.10iiwi:t:u:w ... llirnT'lw. =~-- -- -
I II 
•. OOJ I 0 .001 
Q I I • I 
~I --l--l__JI . 
I 0 .oof •. 001 I I I I 0.001 
x. I 0 0 1 I o c vDelay 
in--- I I I II I 
Cells 
.• --i-o.oo o.oof o.0011 o ool I 
0 
I I 1 · 1 o.ool I ,c_c,c o o 
X "'--~ I I I 
I c_-oc~ o.ocll ucl I c.MI o.col 
Tag Bit ~! o o I 1 o I o 
Tag Bits 
------rj I T Array in o.ool I o.oo~ o.oo o.oo Real Mode L.!_j I o 0 0 -- S Array in G 
1 








Snapshot B.2.1. Simulation of 









I I I r 
1 uo1 0.001 1 o.ool o.oo 
I o I o I I o o I 
I J I I I J I 
T B. I I " I I ag its 1-1.001 o.oo 0.001 I 0.001 o.oo o.ooj 
o I o o I I o I o o Tag Bi ts 
He's 
X I o.oo t I o.oo t o.oo o 
11
1 o.oo o 
1
1 o .. oo or o.oo •? 
ou 1 rray in 
v t ~-GO I 0.00 0.00 0.00 0.00 0.00 I s A . 
M .oo o.oo o.oo o.oo o.oo o.oo --
out ;..o.oo 1 o.oo o.oo I 1 o.oo f o.oo o.oo Full Mode 
-Mout -- I o.oo o o.oo ol I o.oo ol o.oo ol o.oo ol 
I 'I I I 1.00 o.oo I o.co o.oo o.oo I 
I o.oo o.oo I I o.oo I o.oo I o.oo I 
T Array -------I o.oo o.co 11 o.oo ! o.oo I o.oo !---- X 
in Full Mode o.oo o I I o.oo o I o.oo o 1 •.of=f=o V - II out I.CO 0.00 0.00 0.0 
o.oo o.oo I o.oo o.oo Mout 
~ ! 0.00 I 0.00 I u I xout 
---..---. 
I o c o I 
I o.oo~ool 
0 .001 t.001 












o.ro oo·o oo·o 
oo·o 00·1 oo·o 
oo·o 00"1 I 00'1 
Io oo·o oo·o II 00·1 I 
I GO'G I oo·o I oo·o GO'O I oo·o 
I 
I oo·• I oo·o I ot'O oo·o I OO't 
I co·o I co·o 
L 
00·1 00·1 I 00·1 
I I I 0 OO't 





co·o 00'0 oo·o ffO· 
I 
oo·o oo·o oo·o oo·o oo· o [['0 
oo·o 
Io 
00'0 00'0 00·1 OO'l 00'1 I 
I• oo·o oo·o oo·o I oo·o oo·s 0 OO'i I 
jo..°o 
0 I 0 




I Ion OO'l 
I 0 I 
!oo o , !on-! 
•Guwjjj;iii+ffiiiifEJ.•· f~iCifiii·i•iiW a &Hiiiijjil 
c·c:·s: ~oqsd-eus 
o o I o 
100· o oo· o 100-o I 
L__J 
on ; .wo on I~ 
:::: I :::: :::; 11 :::: I 
o oo·o Io oo·o 1 oo·o 11 o oo·o I oG·o I oo·o oo·o 11 oo·o I' on I 
oo·o I co·o oo·o I en en 
oo·o I oo·o oo·o I I oo·e oo·o I o oo·o Io oo·o o oo·o I 1 oo·o Io oo·o I 
oc·o I oo·o oo·o I , ...... -oo-·-o...;;..1-o-o-·o...;.._o_o_·o~I 
oo·o I oo·o oo·o l 1 00·1 
1 
oo·o i oo·o I I oo·o I oo·o oo·o I I oo·o I on I 00·1 I 
~loo·o l1 on lo on-! ;==~:::;:::::~;.;_:,_J 
0 I 0 I 0 
t'lo·o oo·o •oo·o loo-o oo· s 00'( 
I I I .___. 
I 0 0 I I 0 0 I 
,(!{I. 0 loo· o I Jon-Ion I 
I I 
I • i • i l,'fl" i(I,, t I 




I o 0 I 0 
100·1 00·1 oo·o 
I 
o.J"i I ilO"il I w"il I I 
I oo·o I oo·o I oo·o 00"1 
l1 
00·1 I 00·1 
I 00·1 00·1 I 01·1 p oo·o 
II oo·o 00·1 
oo·o oo·o oo·o oo·o oo·o 
oo·o oo·o oo·o oo·o oo·o 
00·1 OO"I I co·1 OO"T 00·1 
l 11·0 1 oo·o Ii Ol"O l 00·1 f 1 [["9 





oo·o Et"O os·o-Ll"O 
oo·o 00·1 00"1 00·1 00·1 oo·o 
Oil"O •1 fO"O •1 ... ,. l 00·1 oo· L I OO"t 
0 0 l 
oo·o 00" l· loo-I 
I 
00·1-00·1 







I o 0 I 0 
oo·o oo·o 100-0 








oo·o I oo·o 
II 







oo·o oo·o Io 00·1 
oo·o oo·o I oo·o 00"0 oo·o 
oo·o 
I 
oo·o I oo·o oo·o 




oo·o 11 oo·o I (f"! I on I I I 
I 00·1 I oo·o I oo· o 11 oo·o I [["O I OS"O-I 
I oo·o I oo·o I •o· 1 I I WI I on I 00·1 I 
I• o;;·o I• •••• : I ~--· I l1 ••«-I 1 vr.· • I. vv·;, I 1 • I 
0 I 0 I 0 I 
oo·e oo·o loo·;-I 00· 1 00" l loo·1-
I A 11 I I 
I 
I I'--' I I 0 0 0 
100· 0 100 I· : :00· 1-oo·; 
! o I I I 
inn .,. l{')(I· •· I 









oo·o oo·o oo·o 
I r.v·o 
I 
OV"O I w"O 
oo·• I 




oo·r I oo·r I cG·l 00·1 
Ir 
I 11 .... oo·o II oo·o I Io oo·o 
oo·o oo·o I oo·o n·i os·o 
oo·o oo·o I oo·o ..... , os·o- oo·r I on I co-r 11 oo·e oo·e I 
I 1 l)"l· I. l oo·o 
" 
oo·o l)"J-l £1"9 
oo·o I 00"!· os·r· I ff\-' CS"O-I u·o I 
OS"0· 11 ll"O I I I oo·o I (("0 I IS"t I E.-0·1 
on I WI I WI I I oo·o I oo·o I oo·o I 
,;o·; •r UU'i II vu·s I I 0 UV"i-li vu·l Ii uv·; I I 
lo..°1 
0 I l I I I OO"i ,, .. , oo·c· oe·o 
0 l 
w~ on 100-i· oo·o 
I r I I 
fotrs·. !{19'{1 . ijjjjj_...,.Hi,.._....Q"•=-+rn;;;;;wiv-ooua;:;w;;-?Aft;W;jjjl 
L·z·a ~oqsdeus 
oo·o 




Oil"il oo·o OO"il nm oo·o oo·o I oo·o I oo·• I 
on I on I on 11 co-I I 
r oo·o 1 r oo·o 1 r oo-o I Io oo·o I 
:::: I :::: I :::: '1 i :::: I :::~ 11 
on I on I on i I on ! oo·o 1 ~11 ffl·1• ff' I 
I on I on i ~~:'.· i I on ll"t I on· I 
oo o I oo·o I tc·o 11 os·o-1 ll"O I os·o I 
on 00-1 I w1 11 on • oo·o I oo·o I 
w o I 1 ;,o· L· ~ 1 w 1 I 1 i Oi.-r· f 6 W" l l 1 ;,o·; \ :==::::;:::::::;:=~-'-' 
I I I I 
I 0 0 I 0 II I I I I I I 
... , 00'( 1··· s I loo·•· oo· •· oo· s I I 
I 0 0 I I I I 
100· 1 00· 1 I 100· 1 oo· £· I I 
I • ' 
I 
9v1 


















I 1 I oo·o I n·i 
(0" I· ff( 1n--
IO"O-os·o-•••• 
oo·o oo·o oo·o 
oo· t-f o O"I• I 
os·\ ffO-
:!:: I os·o 11 u·o· ~~:~ . wo oo·o Io UV • I u oo·; 110 on-lo 
I I 
loo·s-
I I I j 1 I I I 
00· 1 IOO"E· joo·o oo·o 
I I 
I I I 
Ot"J I oo·o oo-o 





oo·' f 1 Oi"i 
I I I oo·o I 
et:t.JT -p.zrtpwn•wuw mw ... ,, __ ?FT .,.,__ 
G·i·a ::t.oqsd"eus 
p:i 
m oo·o oo·o 
I 
o I o 
oo· o oo· o 100· o 
l :::: 'I ::~: 11 :::: I ::: l I 00-1 00·1 co·1 00·1 I 
l 1 oo·o 1 oo·o I 1 oo·o I o n·z ~--.0-.-, .... ;--~-.-•. ~, 
I oo·o I o.i·o I wo-J os·o-I 10·0 I 
I on I 00·1 I oo·o I 00·1 I oo·o I l1 oo·o 11 w'·IO ffl·I o ffi-11 ff' J 
ffJ os·1· fft oo·o 11 En· I 
..... ,. 1 ffO os·o· I ll"O J wo I • 
I 





I 0 I I I ::=.=;1;::=1:::;;;:=1~--....J 
loo·• 00-1 loo 1 I oo·z· 100"0 loo·o I I II I 
1-1 -, +-1 -, _,..r __,. 1- 1
1 
-. -+-1 -. ~r__, 
:oo·s· 00·1 I 100·0 l•o·o I 
I I I I I 





oo·o j 1n I u·i. oo·o 
9["1 I 01 ·o 00·1 I oo·o oo·o 
OS"I• I [O"l• i • l["I· 
I 
os· ,. , (O"J I , .. ,. os·o-10·0 u·1 
I oo·o 
I 
oo·o I on j 























































on 1 Ii" 1 lvi7i-l 
I 00"0 I )£"! I 1 Ol"I I 
00·1 I 00·1 ! oo·o 11 oo·o I j, oo·o 11 €0"1· o L€"1· I 11 n·i I ~..;..;......;;.....~.....;;.~~~ 
w~ ~·s a·• w~ wo 
1
1 







oo·o j oo·o I oo·o I oo·o oo·o 1 
I 00. ·-1' o I • [["9 •• , l9'1• 
~,--!-t-.• -.~j--c-s-·s-..:il---,1-.-,_~l I oo·o I oo·o j oo:o i 
I ll"o I os·o I u·o-I I oo·o oo·o I 00·01 I oo·o M"O oo·o I oo·o oo·o I oo·o 








o 100·9 oo·; oo·o oo·o 
11 
I I I I I I I I 
lw1 ~o·" I 
:··· 0 
1)0' 0 I 
I I I 
i l I ' I 
!,,,. (I I Inn· n I 
;;,:;;mr!J'fr:mxrITlllizl:lr=ll.a:uo:·~·~?~~-1pa_ :...niii~ 
~~!!~~~}!!~·JP·'~·ll~~-~t~ll~~ 
1 o .~~, I t.l. co; 
I o I o I 
I---' 
. I ... 1 I .. 1 I v .uv u .uu1 I 0 .uul O.Ovl 
i 0 0 I ! I I I I 
I I 111 I 
I 0.001 0.001 0.001 I 0.001 o.ov 0.001 
! I I I I ! I I ' I I 
I uo 1; 1.0• ol·uo ol I s.oo ol !.oo •I uo •I 
I Ii.vu t o.uo I o.oo I I o.ou I o.uo I o.oo I 
I :·:: I :·:: ; :·: ! o.oo I o.oo -o.n I 
~ o.oo o.oo s.,7 
• ,_]) i J ·t.,7 0 
0.00 • o.oo 








o.oo 1 o.oo 0.00 
o.oo I u' O.H 
13 .~7 I.SO 
o 1-1.03 o 1-uo • I 0.00 0.00 
0.10 t.1' 




I I ' 0 I 
I 0.0•1 
~ 





~ I 6.olul o .ool 
I o 
I 
·~001 0.001 0.00 
~.oo 1
1
1 7 .ot o -1.00 o 11 s.oo o I !.oo o ,.oo 
o.oo •.oo o.oo o.oo I o.oo •.• o 
o.oo o.oo o.oo I o.oo I o.oo 1.00 
0.00 0.00 0.00 0.00 0.00 0.00 
Snapshot B.2.14 
,.)) I 1-t.'7 0 -t.'7 0 I"'··· 0, ,.]3 • 
o.oo I 0.00 I o.oo I 0.00 o.oo 
1.00 t.00 o.oo o.oo 1.U 
0.00 0.00 I o.co I 0.00 I 1'.SO I 
X31 
X22 
o I -1.03 o I -1.so 1 
I c.co I o.oo 










91 · z ·s: -=l.Oqsdt:ms 
I l I 
Sl' 11 OO't 00'0 
Otl'O 
I 
o.i·o ' oo·o 




I oo·o co·o 00'0 t OS'l• 0 CO'l• 0 l('I· I n·i 
oo·o 00'0 oo·o 00'0 oo·o 
Ot't 00·1 00'0 00·1 OO't 
oo·o oo·o oo·o oo·o 00'0 
0 [['' o oo·,. 0 l9'1· 0 l,.,. • [['' 
00' 0 oo·o oo·o 
I oo·o 00'0 00'0 oo·o oo·o oo·o 
I .. .. 
oo·o 00' 0 oc·o 
oo·o 1 oo·o oo·• 
oo·o 1 00'0 
Io 
00'0 
o ,;o·i· o UV' L oo·; 
0 














O'I 00'0 I 00'0 I oo·o I 
oo·o 00'0 I CO'O I I oo·o I I I 
I• OS'I· I 0 co·1· Io ff I· I 11 U'I 
I co·o I co·o 1 oo·o 1 oo·o 00'0 
I 
I 




I oo·o oo·• 
oo·o I oo· o oo·o 
I oo·o 1 oo·o 1 I 
Io [['' Io oo·,. Io ffl· ( I I l'' i· I [['' 
I 
oo·o oo·c I oo·o oo·o oc·o 00'0 
oo·o oo·o I 00'0 I I 00' 0 oo·o OO't 




vu·~ I 0 U'1'1i I 0 uu·s 





0 I 0 
I !woo loo·oo 
I 0 
00' 0 loo·o loo· t 
i 
I 0 0 i 0 I 
loo·o oo· o 
I 
oo·o 00' l\ I 
I I 
' 0 j 0 
:{I~ (I ' ! l,, ~· ' 
~i,1riTrn=;<.im>11&mmm;rirna::-~1Jln~~;;;;;§rmn:=""=!r.iillW3 
6v1 
IJi1EaiU» Ll&.811tlI«JJlL ·lt\!1.t.fi?t*tf?W.M3JUM lf.ll!IWfmi.I 
~-
I . I •.• i 
xin Delay Cell 
Buffer Blq 
I •. ool .... , 
Cl, C2, C3, C4 . . 
0 '-~-.,..---f-:--r-~Mout 
o.ool o.ool o.ool o.ool o.oo 
o I o I o o UOj C2 C3 C4 .~'' 
0.00 
0 
~ ~,, 2 Arrays i --, 
R 1 Mode A o.ool I e.oo 






o.ool 0.001 0.001 o.oo 
0 I 0 
I I . o.ool o.oo o.tol o.oo 
0 I 0 I 0 I 0 ! I 
o.ool o.ool 0.001 •.oo 







Snapshot B.3.1. Simulation of an L-tuple arrays 
system solving example (A.3), with L = 2. Note 
that n = 4, w = 2 and therefore m = 2. 
~--,~·-·~~ 
Cl, C2, C3, C4 
Cl, C2, C3, 
Mout 
xoi.lt 







Cl, C2, C3, C4 
r: 
o.oo i I o.~ol o.ool o.oo o.oof o.oo 
o .oo I o o o o o 0 ···I 0 ~ --+-~·~-·~·-41~--~ 
 1-••• 
0 .00 0 I I . I I 
<
i:::: 0 00 o.oo o.ool o.oo o.oo o.oo o.oo · 11 0 .1 0 1.1. 0 t o .oo I I I • I I • I • I I 
0.00 
8.00 0 
~i :::o::.oo=~' ~~ 





I I o.oo 
I o.oo 




I o.oo I 
I o.oo I 
of7.Go01 
I o.oo I 
I 
1 e.oc I 
I o.oo 





0.00 0.001 0.00 0.001 0.00 0.00 
0 0 0 0 0 0 
• I 
I o.oo o.col o .oo I o/oj 0 .001 0 .co 












. I o oo·o oo·o 







o I • OG"O oo·o 
I 


































oo·o 00"0 oo·o oe·o oo·o oo·o oo·o O)"O-I 
oo·' I oo·u I 





I oo·o I oo·o 
I WO I 
lololo o o oil I 
I I 00 .. 1 j I oo·o I oo·o 
loo·o oo·o oo·o oo·o oo·o • 
1--~+-~~!.._~+-~~~-+-~"'" ~lo~o~o·~c...i..!o~o~o·~o~ 
I o o I o o o o I I oo· o I oo· o joo·o oo·o ,
1oo·o oo·o oo·o oo·o 11 oo·o I on 
I oo·o I oo·t 
I o 
oo·o oo· o 100· o oo·o oo·o 
oo·o 
o oo·o o oo·o 
I I 
o I I 





; on I I on I 
I oo·~ I 
I
I 0 wo I 
oo·o I 
I 
wo I OO'SI I 
1: oo·: I 







I o o o o 
loo·o oo·o 100·0 oo·o foo·o 
o o I o o 
oo·o oo·o oo·o oo·o oo·o 
I o I o o 
100·0 oo·o 100·0 oo·o 100·0 
0 
oo·• oo·o loo-0 
, ntre1;;a;; i ; 
I 0 ft 
IGG'O oG·Q •oo:o I 
! 0 0 I 0 
ioo·o oo·o loo·o 
I 
I o I o 




GO' 0 •oo·o 
0 I 0 
cc·o co·o 
0 0 





oo·o oo·o I 
oo·o I on I 












Io on I Io 
os·s-
s.:· o I 
oo·• I 





























I oo·o • 
I oo·o I 
I 00'0 I 
Io 00'0 Io 
I oo·o ! 
I oo·o I 
I I 
I oo·o 
Io oe·o 0 








r uu·u I II ,,,,.(1 .. I .... i 
I V> •• i 
I oo·, ~ 
~:on ~1 
I • 1 ii 
100-.. Joo·! 
I I 
' . ' 
I " ' 
100 ,. I 
...---. 















uu·u I oo·o 
1 it·u I 





o o I o 






























00· 1· I 





















~· . . . ~ . ...., ~ . . ' ... .. .. . 
0 I 0 0 I 0 
00'0 00'0 •oo·o GO' 0 •oo·o I 
lo/c 
0 I o 0 I 0 
:}0"0 lcc·c cc·o .... 0 , .. 
0 0 0 0 
I ' loo·o 10•·· oo·o oo· o 100·0 
c 0 I ... •. 
' l···· f ou· u l•i o- 1uu u uu· u 













oo· o co· o I 
I 
oo·o oo·o I 
oo·o Io oo·o I I 
oo·o 00'0 I 
co·o cc·c I 
I 
00'0 00'0 I 
oo·o 0 oo·o I I 
I 00'0 oc·c 
I 
I 00'0 .~,. 0 
I ... , M·o 
f' 
Iii' i-~ u v;;,-u 
01ri oo·o 
I 
.wo· I ffO I i 
00'' I oo·t 
~I oc· s· Ii oc· 3 
I o i I 
100·1 oo· ?· I 
I I 
i 0 1c·o·; ! 
:___j 
::ui;;;;iill!U .• Wi'.Oiiilll:r:~FJ?!i'·~fll:·pp;gwmiil!iriEila '*-;;+;•<Zttilil-
-
0 0 0 o I o oo·o oo·o oo·o oo·o GO"O 
! 
• 0 0 o I 0 oo·o oo·o loo·o oo·o 
Ion I 
0 l 
' ' I • oo·e oo·t oo·o oo·o 19£"0 
i 
l 













ffO I ; 
oo·o 






I oo·o I 
I I 
I oo·o 




oo·o I os·s-o 
onl 
I Ll"O I I 









































I • I 0 I • I • I 
lco·c •oc:c le;· c ...... loc:c 11 oo·o I oo·o 
Io I' 0 0 0 I 0 0 
oo·o oo·o 
1
00·0 co·o Ion oo·o 
I 
• 10··· oo·o 100·0 oo·o 100·0 19£" 0 
l 







! 0 oo·o ! 0 
I cc·o 
I I I oo· o 
• I 




11 ffO I 
I I I 
I oo·o I 
I. CG. 9· l; I• 
I v 
100·1-00·1-
I I i 
I . I 




















• • I loo·o •••• I 
I I ! 
I 
oo·o I oo·o 
I 




Ion oo·o Ion oo·o Ion oo·o I oo·o I .... Io oo·o Io oo·o 
I./. f c..°o 
I I 
0 0 0 
• 1 I oo·o oo· • oo·o oo·e oo·o co·o 
I 
oo·o oo·o 
oo·o .... 0 09·0 oo·o 
I 
oo·o 
I oo·c I I 




0 0 0 0 
!s.:·o li'O-.:1·0 ffO ll'O ll'O 
I 
oo·o I oo·• I 
oo·o I oo·' I 
l iVi'S· ll oo·o 1 on-1 oo·o 
I oo·o I on-I I OO'l I ... , I 








olo •1• o oo·o loc·o oo·o ioo·o oo·o 
! 
os·o· s.:-o ffo· .:1·0 ££'0 .:1·0 






I oo·o I oo·o 
I o~rc 
' oo·o I I I I I oo·o I oe·o 
I 1. oo·o 
I' 
cc·o I, 
I I oo·o I oo·o 
I I oo·o I oo·o 
I I I 
I oo·o I oo·t 
Io oo·o I• oo·o 
I oo·o 
I oo·o I 
i "·o-I •••• I I oo·o I oo·o 






I oo·o oo·' 
Ii oo· J· I l co· t 
I 0 (I I 
100·0 oo· t I 
I 
I I 
' 1<10· (I , __ , 
__ .. ..r._ ..... a 
9 0 0 0 
!on oo·o Ion oo·o loo·o 
lo/o 
0 0 0 
oo·o oo·o oo·o 00"0 
• o I o 0 0 ,[. 0 oo·o joo·o S(" o-oo·o 
i 
0 o I o 0 ' 
!ff' ffO jll"O ffl o.i·o i 
I 0 0 
'en co·o oo·o 
I 0 0 
00"0 oo·o oo·o oo·o oo·o 
• 
0 
loo·o 100·. '["0 oo·o 19!" o-
' 
c L.o. 







I oo·o oo·o 
I oo·o oo·o 
oo·o I oo·o oo·o 
Io oo·c oo·c 
0 oo·o oo·o 
oo·o oo·o oo·o 
oo·o oo·o 
o oo·o o oo·o 
oo·o os·o-
I oo·o I oo·o 
I I on I ' l 
oo·o I oo·, 
I· 
uu· v sn 1 
oo·s 







I oo·• oo·, I 










lei Ion I 
I I 
I 0 0 I 
loo·o oo·o I 
I 





11 oo·o oo·o 
I I 0 oo·o • 0 oo·o 
I p 
I oo·o I oo·e 
I 
oo·o I oo·o 
I 
I oo·o •••• Io 




I oo·o I 
I oo·o I 




r (1(1:'(.'l I I I us· v-1 





1oo·s 00·1 I 
I ! I I ~ 









0 0 0 0 0 
ion oo·o •oo·o oo· o 100·0 
I I 0 0 0 0 I 0 
Ion oo·o foo-c 00"0 •oo·o 
0 0 0 ' 
, 
oo·o 9["0• oo·o oo· o oo·o 
0 0 ' 
, • 
.:1·0 .:ro awo o;·o· ;,·o 
i 
- .. '"§HMW~''WW 
oo· o oo·o lo/o oo·o loo·
0
o 
loo·~ G I 0 0 I 0 oo·o fon oo·c Ion 
I 0 loo·o 100· 0 oo·o u·o- oo· o 
0 0 0 ' lffo IL1"i1 ILi"O ilo"O l•n-
At1rt11D'!f S}~i~·W·I·•· 
LSI 
91°£ 0 8: ~oqsd-eus 
lo! 
Joo o J 
I 0 0 I 
loo· o oo·o I 
I 
I oo·o g oo·o I 
I co·o I I 
0 
Io 
I oc·c I 
oo·o oo·o • oo·o I oo·o Io ce·c I I I I 
• 11 00·1 I 00"0 I oo·o 11 oc·o I !n I 
I oo·' I oo·u I 
~l oo·o ~I H"E I 
n·t L9"0· 
o/. I 
Jt"O oo·o I 
oo·• oo·o I 
• oo·; 1 o ;n I 
~
' lt"O- I ~~:: ; :~:~-I 
I oo·o I oo·o I 
lo oo·o lo on I 
00"( 00"1•1 
I 









I o o I 
loo·o oo·o i 
! I ! 
I oo·o I I 00"0 I 
I oo·o I 
Io oo·o 1. 
I c 
I oo·o I 
11 oo·c • I I oo·o I 
Io oo·o t-1 
I oo·o I 
I I 
I oo·o ' I. oo· ! I OU'i N. I' 1 • I oo·o 
11 
ffO· l 11 
I oo·, I 
Is c:ro Io 
I o 0 i 
100·1 OO"t· 1 
' ' ' I • 
1VO ( 
'--· 













I ..-; I 









~ I SI 
" oo·• oo·o 
oo·o oo·o 
oo· o oo· o 
ot·o oo·o oo·o oo·o oo·o oo·o oo·' oo· o 
ffO oo·o 
t o I ' ' 










' ' ' • 
0 
oo·• oo·o oo·o oo·t n·o oo· o 
' ' . ' 0 
0 
oo·• o;· o-I•« o fft· lll·o ffol i i 
I 
oo·E I ln-1 




oo· o . 
OO'S si-; I 
oo·o t'-·'--I 




Io oo·o Go·; I 
I 
'DJ" -Wfi;pr;;;;; .. fffj.iiWi!ji5jjii.,, ...... 
ocrc 
o olo olo' 
loo·o oo·o loo·o oo·o loo·o oo·o 
I o 
,,1·0-oo·o loo·o oo· o 100·0 u·o 
0 






















I OO'I ffO· I I OO'o I oo·o I 
I I oo·o I w• I 
'ioonl"'~~1 I oo·! (('O . 
11 ••· • ~ ii• l I oo·o I oo·o I 
I ••.•.• N.0 ... ,.... ·· I 
~
I o u I 
100·0 oo·•-1 
I I 
' • I 








, I , 04·· oo·o oo·o oo· o oo· o 
0 • l ' I • oo·o oo·o oo·o 
ffO I" ·o 
' ' • 
0 0 
00·1 oo·o KO oo·o oo·o 
\ 
' 
0 0 0 
:iCO ffO-ffl •••• Ll"O 




















































\I t I 
oo·o oo·o I 
i 
oo·o I oo·e o~·c 
lo/o 
0 
lo/o f oo·'o lo/o 
11 ts· s-I 
oo·o oo·o 11 oo· o I oo·o 
I I e ! 0 
I 0 0 I 0 I 
' loo·e eo·o loo-a I 00"0 El"O '1"0 
loo· o I ' 
0 
oo·o oo·o "". ••••• oo·o 
' 
I 





SJ"( I oo·o 
sn • ffO I 
oo·o I 00·1 
t !I"! ! 16"E 
I in-; I 
I 91·0-1 
I I l)l)"(t 
Io I. l}ij''i u 
oo· £· 
Ll 'U I 
09'0 I 








I '' I 





I ii" j 
oo·o I 
uv· u I 
oo· l ! 
oo· ~-I 
-~u~aiiiiiiiliii&&Unm;;_."t_..~,.w.•1•.. i%ftiJifii2ii ____ , 
ft 
!oo-'o !co:o co·o 
loo·~ ' 1,/0 U-0 
• 
0 0 
9(·0 oo· o oo·o 
0 0 0 
,,.0 ££°0 1£1°0 
I 




' Ion oo·o lrn 
' ' loo·• 
w I C 
tc·o loo·o 
I I 
! ' c ! 0 
ltn-,;1·0 1£i·u i i 
091 
zz · £ ·s: ~oqsdtms 
1.1 loo·o I 
I 1 




oo·o I oo·o 
,:.9· /. 
I oo· o 
00'0 I 00'0 
ffO Io 00' 0 
0 0 0 






' £1°0 oo·o 






ff• ,;1 ·o 
0 oo·o ! oo·o 
00'0 I 00'0 
I 
8,-0 
00'0 I 00'1 
t U'I 1, u· E 
00'0 00'? 




I. uv· Lt 1 vs·o , oo·o I I 
s~·s-I l'''t~ 
sn I .ffO· 




' oo·o I 
• 
o~rs Is oo·• I 
• 100·' 00·1 
lo/£ I L_J 
TZ • £ ·s: ~oqsd1ms 
I • • I 











I eG' 0 
I I. oo·' I. 
IL 
vu· u 
r os·, I I 
····-' ! ' 00'' I 3G. ~ I. I• 
o I 
foo·,;. oo· • 1 
I I I 
I .. I 
I • I 
100· .( i 

















0 I 0 I 0 • loo·o ffS· lfftl· 00'1 1'S' L oo· o 




0 o I o ' I , ' oo·e u·o· 100·0 00' 0 100·0 000 
0 0 l 
' • ' Ill' i Lt·o U(,-V O:l'i· :IL'O L:I' O· 
I 
vz • £ ·s: ~oqsdtms 
I • • I 
loo·o oo·o I 
I I 
I 
00' 0 I 00'0 
I I I OG"O I oo·o 
I 11 oo·o I 00' l 11. E?·o I, ll'll'l"I\ I 
,. H. I 
I I 
00' 0 I ,,.,.I 
co·o I tT'l'I I 






85'1· ff I 
!E'O I 00'0 
I· 
oo·~ I oo·o I 
ii··· I. UU' LI~ LI' I (('; I 
ll'li I ii"O I I I 
I 00'0 I 00'0 I 










oo·o I oo·o I 
• 
I 00'0 oo·o 
loo·'o ' 1,/s. 
I I I 
000 ff El 
~·· 
o·i. I, 
00'0 I 00' 0 
I !?'O Io 00'0 I I 
I 
' 
0 I 0 0 I 0 
£1'0 .!t·o l!O"C ffC l!s·o 
I o 0 I 0 
oo·• 00'0 !!E' 0· 00' 0 
0 0 I o 
' l£i'. LI 'O jff 0 iiO'O lus· o-






I I oo·o I 
I I oe·o I 
I I I 
















I I 00'! I 
I' 




oo· o I 
I I 
i i 





00' l I 
oo·; I I 
sa · szi 
lE'O I 
I 00' I 












' oo·o Ion 
3 3 
• I oo·o oo·o 100-0 oo·o l'C"O 
' ' . Ion os· o-f s,-o ffo-1.:1·0 
[fff __ .....,.,..,_...,,.11!'*'"' 
I 
o o I o o 















0 15· 1-0 
I I ""'·I oo·o I 
I oo·o I 
I Io 
ff:i· I· 1 I ll"r, 
I I ,, .. l 
I oo·o I 
I c oc·' I G 

























3£"0· I M"O I 
••. LI I 




., ..... jjjj 
Sl"£"S: -;+oqsd-eus 
...----.., t I
oo· o I 
I I 
I ••. •• L.'o i ! I I 
roo:o1~ 
I oo-o I oo·o I 11 I 
I I oo·' I oo·' ! 
, .... ,,,.,_, 
' I 
oo-o I I 
11 
WC· I 
...__...__.___.___.__...____. I ;::: ,. ... , 
'S'J· I 0 





I ' jtt"t-oo·o oo·o oo·o •••. 0 
0 
' • ' iffo lo;· O· :ii" u j;;· O· ov·.i 
!["0 







I. it• i· u 
I 14'l-







vu ' Io 
100· 1 00· 1 
; ~; I 





~~· ~1 i 
-~li!Q!._ __ ...l!.!1!f.'@l:D'!"'fIIl:!'Wcl·Fhi•ai&illmtfl'lJW===~~m 
£91 
0 




oo·o oo·o ffO £1'0 
100·• 
' 0 0 
:il'I £;'0• i:t'O ii"O l<l'O 
~~~~~~~~ 
... : ..... 
I I I I I 
·· a z · £ • s: :+oqsdtms 
I on-i oo·n~ 
I ff(I~ cc·\ I 0 
ffs-' 1 eo·o I oo· o I 
Io O!"O Io !n-1 




I u·o I 
I oo·o I oo·o I 
Io ~n-lo snr-1 
ffl-; cc·o 
oo·o I oo·o 
oo·o I 00" l '["0· 
si·;-1, illl"O 
oo·o os·o-
oo·o I us-o-I 
oo· t I ... , I OO"i 
13 oo·: I I 
I t) 









lo/o I I I 





0 t 0 
lc/c ' Ion ltn Ion Ion lcc:c 11 oo·o oo·o I I I 0 !!'1-I 
I o!·o 
0 0 l I 
' 
0 I 01·1-SI'S· I 
tn oo·o oo·o oe·c ln·o !n I io·o ffO I 
oo·o 00'0 J 
0 95·1-0 59· 11 I 
ao·• ffl I 
I !!"O· oo·o 
I 
I I , \ I o 0 I 
loo·o oo·o WO 100· 0 Ot'O 100·. ,[ .. 100· 0 . .. -' 
I• 
Sl' S· ~ ti uu "., 
' • ' 
0-1· oo·o 
Ion-Sl"O lffo-41·0 €["0 41·0 I £1·• I WO I 
I oo·o I oo· t 
lo oo· c: 'l OG' V 
I o 
jOO"[ oo· I 




ioo:o oo·o oo·o 
l 
joo·o ffO ffO 
• l'C"O OO't 100·0 
0 
li:T ·u ffO •'1' 0 





' ioo·o I oo·o (!'0 
100· 0 9[" 0 oo· o 
I ... '. 
0 c 












0£" £ ·g ~oqsd'eus 
lo! 
lonii 












11 oo·o I 










I oo·o 11 oo·, I 





I tl o I 
100· i oo· s-I 
I I 
I 0 I rj 

















j I I 0 0 I 
,o,·s-1on1i 
OS'• 0''0(~ 
oo·• t3" l. I 
I 
oo·o 00'0 I 
0('0 93"1-
oo· o I 
U'O oo·o I 
I 
00'0 oo·o I 15'1-0 s~·•i-
oo·o 00"0 
oo·o I oo·o 





Ota' O Sl'l 
I -··· I 
11 Oi'ii-1 >L U I 
oo·, I 
3 GO'S 
'· I o o I 
100·1 oo·• 
I ! l (i I 
JOO ;: i 
, __ : 




Z£. £·a -=l-OQSdt:ms 
/./o I I" I 
I o o I 
oo·o oo·o I 
I 
oo·o I 
WO I I 
'olo loo 
1
00·0 ffS· 1,,·n-oo·• ""' oo·o 
lo ojo 010 o 






u-o I I 
oo·, I 

























o I o 'l'O H'ft . !". 
0 0 
1.:1 ·o l.:n I I 
o I o l3"U 00'\ 
l 
o I o 
ffO lffO I 
1 £ • £ ·a -=l-OQsd-eus 
~J 
'····· , .... , 
lo-t--;-1 
1,··o-loo~o I Z17x --I I ! 
oo·o I oo·o I 
on I oo·o I o I I 




O O!"O O 9!"1· 
oo·o oo· o I 
oo·c cc·c I 
I 




oo·o I sn 
oo·o I ""0 
00"! \ oo·' 
' 
Vi' ti • Si' i 
""' 
ff+-I 
, .... r··1sn I I ffu-1 .... I I I ... I I I I 00·9 I oo·o I 
Is vv • ~ 0 
I 0 
100·' 













' '~.-o en oo·o ~·o ffO 
' ' \ ' 0 loo·o o;·o-ls«o l.:;-0-l.:1·0 Ei"O 
?i5i •p.u;w•E£Hi53i;;;;; 
I 0 o I o 0 I 0 0 
l~n-l3"H1~a·~ LS" l 100·0 CQ' Q 
0 o I o 0 0 ' 
u·o ffo 1u·o oo·o oo·o oo·o 
,,.,_ oo·o loo·o oo·o lo/o ![" 0 
\ 
























oo· o I 
S("O-I 
oo·e I 


























ff~ Io SI" I 
tt•' I ~t·? 
I I 
I· 
oo·o I oo·o 00"0 oo·o 















0 .00 o.oo] 
0 
0.00 7, 2.0C 0 
7.00 0.00 
o.oo I 0.17 
•. 00 9.33 
I 7.75 o; lt.5~ 
! .;_.;.; 1 ii.uu 
I o.oo I -o.u 
I -o.33 l-u1 
I. SS 0 ,.21 
0.00 0.00 
I o.u I o.oa 






0. 17 0.33 •. 17 
0 0 0 
0.00 o.ool o.i, 
0 0 ! , 
0.17 o.n, o.oo 
0 ' 7 
i-G.35 ol 1.17 
I o.oo i o.oo 
~-S.t7 
! 
1 c .eel c.oof c .ec 




-0.,7 0.75 -0.50 






I c.cc, 7 .,7 uc, 
o I o • i 
I x --+=2~131 0.00 








I o.oo '~ I LOO I 7.00 
, -:::: ! uo r;-;;;;-;; 0. 00 
1 •.oo 71
1!.50 ol 
I 7 .00 0.00 o.oo I o.oo 
I 7.7S I 3.33 I 
I I.IS 0 I 9.21 •I 
0.17 
0 0 
-o .)6 0.00 
0 0 
o.~~ 0.17 -0.,7 t.751 
0 0 ' 
, 
0 .00 0.3' 0.00 t.00 
0 , ' ' 
0.00 , 0.00 I ~-~-~-~------.-~ 
I 
o.u 1' 0.'9 0.08 0.17 0.13 0.00 0.00 0.00 
0.'5 O.Jt 0 0 i 
'1 ·US 0 '1 1.17 0 ' I I 0.00 0.00 I -S.(7 o.oo, 0.00 0.00 0.00 7.,7 
I l u •11 u o u 















I • 0 l ff£ oo·o o I o loo:o 
00'0 •oo:o ffS· lfftl· on I 
I I Io 
oo·o oo·o 
.!I"f 0 S!"O· 
loo·~ ' 
0 
o I o tl"O ll'O 80"8 ffO : 
I 
0 
0 ' 0 
,,["~ 0 
06"0 00'0 9£"0· 100· 
0 0 0 0 
' £1"0 [["0 £1"0 £1"0 oo·o 




' loo-o oo·o lffo £1"0 so·o I 
I 
9 • I 0 0 I 0 100· 0 ,c·o 100·0 00·1 IKO· 
' 
0 . 0 0 
ld.U~ I ff ii 1..-0 .:1 ·o Ill"• 
• I oo·o oo·• u·o 
I oo·o oo·o oo·o 00·1 















I 0 I 















' I I I. Wl I. 
IL 
iiv·v 
r oo·o I Oi'O· : 
I 
I oo·, I 
/ s oo· o : . I o o 100·0 oo·o 
I ~ I I 





















~. I I I 
I 
oo·o I oo·~ I 
I
I oo·o 1 oo-o 
1 
oo·o I oo· o 
, I o 
ioo·o oo·o lffs-fftl oo·• "°' I 0 ff I I 0 st" O· I 
' o I o o I o o I I oo· o j s1·1 I l~n ffO l~n u-e 
1wo on I I oc-o I oc·o 1 ~-~-----~-_._ _ __,_~11 oo·o I oo·, 
0 H"( I, WO 







I oo·' I oo· ~ I 100·0 oo·o 91·0-oo·o oo·o oo·o :i •Hi°il 1 • uu·o I 
;--c~--c-0--0--1----.g.--,--1--."""" ~I oo·o 1 oo·o 1 
l..-• 1!1·u l;i-o ut·o fos·o-s.:·o ,; o-1 ffli 1 
~-~-----~-....... ----'-~ I oo·, I oo·o I 
i 3 00 0 I 0 OGO I 
APPENDIX C 
SAGS PROGRAM LISTING 
SAGS was developed on an IBM Personal Computer, 
running the MS/DOS operating system. It was written in 
Turbo Pascal, a dialect of the Pascal programing language as 
described by Wirth and Jensen in Pascal User Manual and 
Report. The source code of SAGS is listed in this appendix 
along with a sample script file. This script file 
represents the simulation that produces the third series of 
snapshots in Appendix B. Input and control files are also 
included. The source code of SAGS and many sample script 
files are also available in ASCII format on floppy disks. 
To produce an executable copy of SAGS, two software 
n 
packages are needed: a copy of the DOS-based Turbo Pascal 
compiler (version 3. O) and a copy of the Turbo Graphix 
ToolboxT" (version 1. 07), both available commercially from 
Borland International, Inc. Also, since SAGS is graphics 
based, a video card with bit-mapped graphics capabilities is 
needed to run the program. The included source code is 
written for the EGA standard; however, simple changes can be 
made to the program so that it will run on other PC graphics 
standards. Entry points for these modifications are fully 
documented in the source code to ease that task. 
171 
Because computer graphics and simulations are 
floating-point intensive applications, the use of a numeric 
coprocessor is highly recommended. For SAGS to take 
advantage of the numeric coprocessor, it must be compiled 





* * * SAGS is a Systolic Array Graphical Simulator program for a recon- * 
* figurable set of arrays of processors. 1he Max nuniber of arrays is 15. * 
* This is because of the limitation of Turbo Graphix, not of SAGS. * 




( $1 c: \bin\ tbl \ tbgraphx\ typedef. sys} 
( $1 c: \bin\ tbl \ tbgraphx\graphix. sys} 




































(include the graphics system code} 
(include others of SAGS modules} 
begin 
PromptUser; 
if not Read.Script then 
begin 
close(ScriptFile); 



















if (Ch=#27) and 
keypressed then 
read(Kbd,Ch); 
with CurrntPtrA do 
case Ch of 
#13 : MultiStepsExec; 
#27 
#32 : SingleStepExec 
(IOPtr); 
#59 : ChangeColor(-1); 
#60: ChangeColor(l); 
#61 : HardCopy(False,l); 
#62 : WriteScriptFile; 










{gets script file name} 
{reads in the script file and build} 
{system's internal structures} 
{init. the graphix system and screen} 
{sets aspect ratio for true circle} 
{defines the shared world} 
{defines the shared world} 
{establishes system default} 
{drawing color} 
{don't error when window edge hit} 
{read the keystroke} 
{one more char ?} 
{RETURN ? multi steps execution} 
{until a key (any key) is pressed} 
{ESC ? waits for end of current loop} 
{SPACE ? single step execution} 
173 
{Fl? changes to last drawing color .. } 
{F2? changes to next drawing color .. } 
{F3 ? prints the screen image} 
{F4 ? writes updated script file} 
{up arrow ?} 
{then moves current window and .. } 
{stores it with new position} 
{left arrow ?} 










#73 : SwitchWindow 
(CurrntPtr,O); 
#81 : SwitchWindow 
(CurrntPtr,l); 











while IOPtr<>NIL do 
with IOPtr" do 
begin 
end. 
case IO of 









{right arrow ?} 




{for any other keys .. } 
{screams at 1000 Hertz} 
{for 3 tenths of a second} 
{then shuts up} 
{ESC char exits program} 
{sets foreground color to black} 
{before exits} 
174 
{gracefully shuts down graphix system} 
{and the IO system by .. } 
{closing any active input file,} 
{and flush internal disk buffers .. } 
{of any output files and closes them} 
175 
( * 
* * * This is the header file of SAGS. It contains all global definitions and * 
* declarations of constants, types and variables. All of SAGS data struc- * 
* tures are explained here. Be sure to include this file at compile time. * 
* * *AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA***) 
const 
TimeUnit = 0.000001; 
Defl tColor = 13; 
TextSize = 8; 
FirstWorld = l; 
StatusWorld = 2; 
MaxArraySize = 5; 
MaxSequence = 5; 
MaxFileName = 64; 
MaxStr = 10; 
MaxWord = 12; 
MaxLine = 45; 
MaxError = 16; 
MaxBox = 4; 
MaxRegs = 4; 
MaxCodes = 12; 
MaxTxtCoord = 5; 
MaxBus = 2; 
CharSizeX = 4; 
CharSizeY = 6; 
Digits = 6; 
Deciml = 2; 
Gap= l; 
WrldCoord.XY = 1000.0; 
StatWorldX = 79.0; 
StatWorldY = 12.0; 
MaxRadRatio = 0.023; 




I SYSTEMSPECS I ' 
'INFILES', 







{execution time for each step} 
{default drawing color value} 
{max no. of char displayed in PE} 
(world shared by all windows} 
(world used by status box} 
(max no. of PEs/array side allowable} 
(max no. of procedures in a script} 
(max no. of filenames} 
{max no. types of script statements} 
(max length of statement} 
(max length of error message} 
{max no. error message} 
(max no. boxes in status window} 
(max no. registers of one type in PE} 
(max no. of PE executable codes} 
(max no. displayable text lines in PE} 
(max no. of pair I/O bus on each side} 
(character size in pixel} 
(character size in pixel} 
(no. of digits of value displayed} 
(no. of decimal places} 
(size of gap between PEs in PIXEL.s} 
(default world coord.} 
(world coords. for status window} 
(l.15*2/100 ratio of twice the radius) 
(of the largest circle that will fit} 
{inside a lOOxlOO PIXEL.s window} 
(list of valid script statements} 
Error List 
type 
array (list of all possible error messages} 
[ 1. . MaxError] 
of string[MaxLine] 
( ' ! Bad statement ! ! ' , 
' ! Array size too large ! ! ' , 
' ! Delimiter 11 • 11 not found ! ! ', 
' ! Bad delimiter or delimiter not 
' ! IO file name too long ! ! ' , 
' ! Non-existing array ! ! ' , 
' ! Bad statement in context ! ! ' , 
' ! Statement out of sequence!!', 
found I I' .. , 
' ! Arrays are allowed to have only 4 sides !!', 
' ! Bad type of array ! ! ' , 
' ! Array number should be within 1 to 16 ! ! ' , 
' ! Triangular arrays only have 3 sides ! ! ' , 
' ! Input file not found!!', 
' ! Invalid bus specification ! ! ' , 
' ! Unknown display mode ! ! ' , 
' ! Unknown PE code ! ! ' ) ; 
(pointer to reals} 
(pointer to link info between arrays} 
(pointer to IO buffers} 
{pointer to array processors} 
{file name storage} 
{storage for text for screen output} 
176 
RealPtrtype = Areal; 
LinkPtrType = ALinkType; 
IOPtrtype = AIOtype; 
ArrayPtrtype = ASysArraytype; 
FileName = string 
[MaxFileNarne]; 
Textype = string 
[TextSize]; 
SrcDstType = array 
[l.. 2] 
{pair indicating sides of src. & dest.} 
{arrays for dataflow info} 
of integer; 
Pointype = record 
X,Y 
end; 
Pointstype = record 
X,Y 
end; 
DisplayMode = (F\J.11,Arrays,Buffer); 






{pair of coord. for a point} 
: real; 
{All of PE's texts coords.} 
array 
[ 1 .. MaxTxtCoord] 
of real; 
{mode of display of an array} 
{stores default coord. for each PE's} 
{text in an array, essentially acts} 





























{storage for all PEs' text coord.} 
{of an array} 
array 
[l .. MaxArraySize, 
1. .MaxArraySize] 
of Pointstype; 
{internal PE representation, with} 
{all necessary registers} 
real; 
array 
[ 1. . MaxRegs , 
1. .MaxBus] 
of real; 
{pointers to X_out registers in} 
{neighboring PE cells} 
array 
[ 1 .. MaxRegs , 
1. .MaxBus] 
of RealPtrtype; 
{Regs_Txt[l] is for TAG} 
{Regs_Txt[2] is for X,} 
{ ........ [3] ........ Vout} 
{ ........ [ 4] ........ Mout} 
{ ........ [SJ ........ Xout} 
array 
[ 1. . MaxTxtCoord] 
of Textype; 
{control codes registers} 
: integer; 
{for display purpose only} 
{holds PE's execution code number} 
: byte; 
{coord. of box and texts within box} 
real; 
string [ 15] ; 
177 








IOflag = (INPUT,OUTPUI'); 






















{storage for dataflow info to} 
{other arrays} 
{from which side of src. array to} 







{IO link to and from host, that is} 
{to and from external data files} 
: FileNarne; 
{links with which array} 
{to which of its side} 
{and which bus} 













[l .. MaxArraySize] 
of real 
array 
[l .. MaxArraySize] 
of RealPtrtype 
{square array of PE's} 
{upper triangular array of PEs with} 
{diagonal line from top left corner} 
{to lower right corner} 
{lower triangular array of PEs with} 
{diagonal line from top left corner} 
{to lower right corner} 
{upper triangular array of PEs with} 
{diagonal line from top right corner} 
{to lower left corner} 
{lower triangular array of PEs with} 
{diagonal line from top right corner} 
{to lower left corner} 
{storage for each box in status band} 
178 















ErrorType = set of 1. .MaxError; 
Foreground , 
ArraySize : integer; 
Zero : real; 
ZeroPtr : RealPtrtype; 
PEtxtArray : array 
[DisplayMode] 
of PEtextype; 
ScriptName : FileName; 
ScriptFile : text; 
IOPtr : IOPtrtype; 
LinkPtr : LinkPtrType; 
FixedPtr 
CurrntPtr 
StatPtr : ArrayPtrtype; 
ErrorSet : ErrorType; 
Ch : char; 
{storage for systolic array's data} 






LoY : integer; 
Boxes : array 
[1. .Max.Box] 
of StatusBox; 
Steps : integer; 
Ti.mes : real 
) ; 
( DPmode : DisplayMode; 
) ; 
PE : array 
[l .. MaxArraySize, 
l .. MaxArraySize] 
of PEtype 
{current drawing color} 
{size of bandwidth of dataflow} 
{value for PE's grounded input} 
{stores all display mode's templates} 
{always points to top of linked list} 
{always points to top of linked list} 
{always points to top of linked list} 
{always points to the current array} 
{always points to status window} 
{stores error type values} 





* This procedure initializes all global variables needed for drawing * 
* an array. Depending on the specified array size, it will find a * 
* suitable window size and world coordinates for the array. It also * 
* computes an array of coordinates for PEs' text. * 
* This procedure is very machine-dependent, i.e. graphics card specific, * 










Temp4 : real; 
begin 
with PEtxtArray[Full] do 
begin 
Mode : = Full; 
Lines := MaxTxtCoord; 


















for Xcnt:=l to ArraySize do 
for Ycnt:=l to ArraySize do 
with TextCoord[Xcnt,Ycnt] do 
begin 
for 1:=2 to Lines 
do begin 
X[I]:=Templ*Ycnt-Temp2; 
{value 22 is for EGA; 2 is for CGA} 
{computes window dimensions for} 
(a particular array size} 
(compute value of gap in w.c.) 
(compute value of PE size in w.c.) 
(compute round PE's radius in w.c.) 











with PEtxtArray[Arrays] do 
begin 
Mode :=Arrays; 
Lines := 2; 

















for Xcnt:=l to ArraySize do 
for Ycnt:=l to ArraySize do 










with PEtxtArray[Buffer] do 
begin 
Mode := Buffer; 
Lines := 2; 
PEsize := CharSizeX*Digits+lO; 
WDSizeX:=trunc(((PEsize+Gap) 
*ArraySize+l)/8+1); 
{text coord. for TAG bit} 
{computes window dimensions for} 
{arrays displays} 
{compute value of gap in w.c.} 
{compute value of PE size in w.c.} 
{compute round PE's radius in w.c.} 
181 
{compute text coord. for array of PEs} 
{text coord. for TAG bit} 
















for Xcnt:=l to ArraySize do 
for Ycnt:=l to ArraySize do 



















Ti.mes :=O. 0; 
for Xcnt:=l to 4 do 















{compute value of gap in w.c.} 
{compute value of PE size in w.c.} 
{compute round PE's radius in w.c.} 
{compute text coord. for array of PEs} 
{text coord. for TAG bit} 
{STATIJS panel window ntunber is 16} 
{for IBM CGA : HiX=O, HiY=O} 









Boxes[3].Txt:='ARRAY # :'; 
end; 






{get address of ground value} 
184 
( ~*** 




(var Cell : PEtype); 
var I,J : integer; 
begin 




for I:=l to MaxRegs do 









for 1:=2 to MaxTxtCoord do 
Regs_Txt[I]:=' 0.00'; 
end; 
{with this cell, Thou shall} 
{initialize all of its registers} 
{on all buses to zero .. } 
{ .. and all of its texts storage to} 






* * * This procedure initializes a newly allocated square array specified in * 
* the script. It is called by the procedure : * 




(Ptr : ArrayPtrType); 
var I,X,Y : integer; 
begin 
with Ptr" do 
end; 
for X:=l to ArraySize do 
for Y:=l to ArraySize do 
begin 
InitializeCell(PE[X,Y]); 
with PE[X,Y] do 
for I : = 1 to Max.Bus do 
begin 
end; 
if X=l then 
In_Regs[l,I]:=ZeroPtr 
else In_Regs[l,I] := 
Addr(PE[X-1,Y]. 
Last_Out[3, I]); 
if Y=ArraySize then 
In_Regs[2,I]:=ZeroPtr 
else In_Regs[2,I] := 
Addr(PE[X,Y+l]. 
Last_Out[ 4, I]); 
if X=ArraySize then 
In_Regs[3,I]:=ZeroPtr 
else In Regs[3,I]:= 
Addr(PE[X+l,Y]. 
Last_Out[l, I]); 






{for each PEs in square array} 
{ .. init. all of its registers on all} 
{buses and all of its texts storages} 
{with all buses, link PEs together} 
{as follow .. } 
{if PE is on north border of array .. } 
{its north input is grounded for now} 
{else its north input is from its} 
{north neighbor} 
{and so on for the east border .. } 
{except this time we have the south) 
{border and .. } 
{the west border to take care of.} 
186 
( *'>.'-k 
* * * This procedure initializes a newly allocated type 1 triangle array * 
* specified by the script. It is called by the procedure : * 
* - GetSystemSpecs. * 
* * AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA***************) 
procedure InitializeTrianglel 
(Ptr : ArrayPtrType); 
var I,X,Y : integer; 
begin 
with Ptr" do 
end; 
for X:=l to ArraySize do 
for Y:=X to ArraySize do 
begin 
InitializeCell(PE[X,Y]); 
with PE[X,Y] do 
for I :=l to MaxBus do 
begin 
if X=l then 
In_Regs[l,I]:=ZeroPtr 
else In_Regs[l,I] := 
Addr(PE[X-1,Y]. 
Last_Out[3, I]); 
if Y=ArraySize then 
In_Regs[2,I]:=ZeroPtr 
else In_Regs[2,I] := 
Addr(PE[X,Y+l]. 
Last_Out[4, I]); 















{for each PE in triangle array .. } 
{ .. init. all of its registers on all} 
{buses and all of its texts storages} 
{then for all existing buses .. } 
{if PE is on north border of array .. } 
{its north input is grounded for now} 
{else its north input is from its} 
{north neighbor} 
{east border is grounded if PE's on} 
{east boundary, .. } 
{else it's connected to the east} 
{neighbor} 
{south and west inputs are grounded} 
{if PE's on the diagonal boundary .. } 
{else they are connected to the south} 
{and west neighbors} 
187 
(***'~~******-lrkti~******'~~***'~~******'~~******'~~***tt 
* * * This procedure initializes a newly allocated type 2 triangle array * 
* specified by the script. It is called by the procedure : * 
* - GetSystemSpecs. * 
* * AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"********) 
procedure InitializeTriangle2 
(Ptr : ArrayPtrType); 
var I,X,Y : integer; 
begin 
with Ptr" do 
end; 
for X:-1 to ArraySize do 
for Y:=l to X do 
begin 
InitializeCell(PE[X,Y]); 
with PE[X,Y] do 
for I:=l to MaxBus do 
begin 
end; 










Addr(PE[X, Y+ l] . 
Last_Out[4,I]); 
end; 
if X=ArraySize then 
In_Regs[3,I]:=ZeroPtr 
else In_Regs[3,I] := 
Addr(PE[X+l,Y]. 
Last_Out[l,I]); 






{for each PE in triangle array .. } 
{ .. init. all of its registers on all} 
(buses and all of its texts storages} 
{then, for all existing buses .. } 
{if PE's on the diagonal boundary} 
{then its north and east inputs} 
{are grotlllded for now .. } 
{else they are connected to PE's} 
{north and east neighbors} 
{south input is grounded if PE's} 
{at the bottom of array .. } 
(else it's connected to PE's south} 
{neighbor} 
{west input is grounded if PE's} 
{at the west boundary .. } 
{else it's connected to the west cell} 
188 
( *"k* 
* * * This procedure initializes a newly allocated type 3 triangle array * 
* specified by the script. It is called by the procedure : * 




(Ptr : ArrayPtrType); 
var I,J,X,Y : integer; 
begin 
with Ptr" do 
end; 
for X:=l to ArraySize do 
begin 
I:=ArraySize+l-X; 
for Y:=l to I do 
begin 
InitializeCell(PE[X,Y]); 
with PE[X,Y] do 
for J:=l to MaxBus do 
begin 
if X=l then 
In_Regs[l,J] :=ZeroPtr 
else In_Regs[l,J] := 
Addr(PE[X-1,Y]. 
l.ast_Out[3 ,J]); 





















{for each PE in this triangular array} 
{ .. init. all of its registers on all} 
{buses and all of its texts storages} 
{then, for all existing buses .. } 
{if PE's on the north border} 
{then ground its north input .. } 
{else connect the north input} 
{to the northern neighbor} 
{if PE's on the diagonal boundary} 
{then its east input and .. } 
{its south input is grounded for now} 
{else .. } 
{its east input is from its} 
{east neighbor and .. } 
{its south input is from its south} 
{neighbor} 
{west input is grounded if PE's on} 
{the west boundary .. } 




* * * This procedure initializes a newly allocated type 4 triangle array * 
* specified by the script. It is called by the procedure : * 




(Ptr : ArrayPtrType); 
var I,J,X,Y : integer; 
begin 
with Ptr" do 
end; 
for X:=l to ArraySize do 
begin 
I:=ArraySize+l-X; 
for Y:=I to ArraySize do 
begin 
InitializeCell(PE[X,Y]); 
with PE[X,Y] do 
for J:-1 to MaxBus do 
begin 


























{for each PE in this triangular array} 
{ .. init. all of its registers on all} 
{buses and all of its texts storages} 
{then, for all existing buses .. } 
{if PE's on the diagonal boundary} 
{then its north input and} 
{its west input is grounded for now} 
{else .. } 
{its north input is from its .. } 
{north neighbor and .. } 
{its west input is from its west .. } 
{neighbor} 
{if PE's on the east border then} 
{ground its east input .. } 
{else connect it to the eastern} 
{neighboring PE. } 
{if PE's on the south border then} 
{ground its south input .. } 




* * * This procedure writes text inside each PE of an array according to * 
* values of the PE's registers. It's smart enough to know the display mode * 
* of the array and write texts accordingly. * 
* * AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"******) 
procedure WritePEtxt 
(X,Y : integer; 
Ptr : ArrayPtrtype); 
var I integer; 
begin 
with Ptr" do 
end; 
with PEtxtArray[DPmode} do 





{depending on array's display mode .. } 





* This procedure define a window, give it a world coordinate system, * 
* and then depending on array's display mode, will draw a square systolic * 
* array inside the window. This window will overlap on top of all * 




(WorldNum : integer; 
var 






with Ptr"' do 
end; 










for X:=l to ArraySize do 
for Y:=l to ArraySize do 
end; 
begin 














{define window where drawing} 
{will take place} 
{select world for array window} 
{select the window} 
{give it a (black) background .. } 
{else it won't overlap others} 
{if PE's boundary type then draw} 
{it as a circle. Else .. } 
{ .. draw PE as a square} 
192 
(******'~~'***'*********"'"**"''***'*********"'"'**''***'*********'***"'Hrlril-**>l'**I<******* 
* * * This procedure define a window, give it a world coordinate system, * 
* and then depending on array's display mode, will draw a type 1 triangu- * 
* lar systolic array inside the window. This window will overlap on top of * 




(WorldNurn : integer; 
var 






with Ptr" do 
end; 










for X:=l to ArraySize do 
for Y:=X to ArraySize do 
end; 
begin 














{define window where drawing} 
{will take place} 
{select world for array window} 
{select the window} 
{give it a (black) background .. } 
{else it won't overlap others} 
{if PE's boundary type and display} 
{mode is Arrays then draw it as a} 
{circle.} 




* This procedure define a window, give it a world coordinate system, * 
* and then depending on array's display mode, will draw a type 2 triangu- * 
* lar systolic array inside the window. This window will overlap on top of * 




(WorldNun : integer; 
var 






with Ptr" do 
end; 










for X:=l to ArraySize do 
for Y:=l to X do 
end; 
begin 














{define window where drawing} 
{will take place} 
{select world for array window} 
{select the window} 
{give it a (black) backgrotnld .. } 
{else it won't overlap others} 
{if PE's boundary type and display} 
{mode is Arrays then draw it as a} 
{circle.} 
{ .. Else draw PE as a square} 
194 
(***'~rk"k****'~rk-k***lri'"*****'**ld~***"~rk"k****''"**'I'******"'"***'******"'"****** 
* * * 1his procedure define a window, give it a world coordinate system, * 
* and then depending on array's display mode, will draw a type 3 triangu- * 
* lar systolic array inside the window. 1his window will overlap on top of * 












with Ptr" do 
end; 










for X:=l to ArraySize do 
begin 
I:=ArraySize-X+l; 
for Y:=l to I do 
begin 
















{define window where drawing} 
{will take place} 
{select world for array window} 
{select the window} 
{give it a (black) background .. } 
{else it won't overlap others} 
{if PE's boundary type and display} 
{mode is Arrays then draw it as} 
{a circle.} 
{ .. Else draw PE as a square} 
195 
( * 
* * * This procedure define a window, give it a world coordinate system, * 
* and then depending on array's display mode, will draw a type 4 triangu- * 
* lar systolic array inside the window. This window will overlap on top of * 












with Ptr" do 
end; 










for X:=l to ArraySize do 
begin 
I:=ArraySize-X+l; 
for Y:=I to ArraySize do 
begin 
















{define window where drawing} 
{will take place} 
{select world for array window} 
{select the window} 
{give it a (black) background .. } 
{else it won't overlap others} 
{if PE's boundary type and display} 
{mode is Arrays then draw it as} 
{a circle.} 
{ .. Else draw PE as a square} 
196 
('*"*'**************************************************************************: 
* * This procedure will draw the status window at the default location and * 
* writes initial text within its boxes. It is used only once. * 
* * AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAk*) 
procedure DrwStatusWindow 
(WorldNtun : integer; 
Ptr : ArrayPtrtype); 
var I integer; 
begin 










for I:-1 to 3 do 
with Boxes[!] do 
begin 












for I:-1 to 3 do 
end; 




{define window where drawing} 
{will take place} 
{select world for array window} 
{select the window} 




* * * This procedure draws up the configuration of arrays read in from script * 
* file. It also stores all configured windows in the window stack for * 
* later updating. Depending on the type of the array, it will call these * 
* procedures : * 
* - DrwSquare () , * 
* - DrwTrianglel (), * 
* - DrwTriangle2 (), * 
* - DrwTriangle3 (), * 
* - DrwTriangle4 () * 
* to properly draw the array itself. * 
* * AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA***********) 
procedure DrawSystem 
(Ptr: ArrayPtrType); 
var TempPtr : ArrayPtrType; 
begin 
SelectScreen(2); 















while TempPtr<>Ptr do 
begin 
if TempPtr<>StatPtr then 
with TempPtrA do 
begin 







(points to current array} 
(moving array pointer} 
(at RAM screen .. } 
{draw the current array, depending} 
{on which type it is} 
(and stores it in window stack} 
(then clears RAM.} 
(starts with a non-current array} 
{and as with all non-current array .. } 
(except the status window} 

























{stores their image into window stack} 
{then draw status window} 
{don't forget to stores it} 
{and copy'm all to displayed screen. } 
{now selects displayed screen .. } 
{restores current window to its} 
{current position,} 
{selects it} 
{then shows that it's current.} 
199 
( *** 









if Foregrotmd=O then 
if Direction<O then 
Foreground:=l5 





(negative for previous color,} 
(positive for next.} 
(computes next or previous color} 




* * * This procedure redraws the current array in the next display mode. This * 
* will allow a user to look at all registers of PEs in the array at the * 
* same time for easy debuging. Depending on the type of array it will call * 
* these procedures : * 
* - DrwSquare () , * 
* - DrwTrianglel () , * 
* - DrwTriangle2 () , * 
* - DrwTriangle3 (), * 
* - DrwTriangle4 () * 




(Ptr : ArrayPtrType); 
var TempPtr : ArrayPtrType; 
begin 






else with Ptr" do 
begin 
case DPrnode of 
Full : DPrnode:=Arrays; 
Arrays : DPrnode:=Buffer; 



















{points to current array} 
{moving array pointer} 
{if this is the status panel then .. } 
{screams at 1000 Hertz} 
{for 3 tenths of a second} 
{then shuts up} 
{erase old window from window stack} 
{select RAM screen .. } 
{wipes it clean and .. } 
{draw the current array, depending} 
{on which type it is} 
{and stores it in window stack} 
{then clears RAM.} 
end; 
TempPtr:-Next; 
while TempPtr<>Ptr do 
begin 












{starts with a non-current array) 
{and as with all non-current array .. ) 
{except the status window} 
{restores all windows to their) 
{current position,) 
{then restore status window to its) 
{current position) 
{and copy RAM to displayed screen.) 
{now selects displayed screen .. ) 
{restores current window to its) 
{current position,) 




* * * This procedure makes either the previous or the next window current for * 
* any operation, for example moving a window. The current window can * 





(var Ptr : ArrayPtrType; 






























while TempPtr<>Ptr do 
with TempPtrA do 
begin 









{points to current array} 
{0 for previous, 1 for next} 
{remember text and number of} 
{current window} 
{if backward, makes previous window} 
{current, else next window.} 
{Shows window isn't current anymore} 
{and stores it in window stack} 
{now, selects the RAM screen .. } 
{clears it, then .. } 
{updates the AJ]J{AY # box of the} 
{status window by .. } 
{erasing the old status text} 
(and write in status text of} 
(current window} 
(and as with all non-current windows .. } 
(except the status window} 
(brings them back to RAM screen at} 
(their current position.} 
(now draw status window if it's} 
(not the current one.} 
(copy to displayed screen .. } 





(restores current window to} 
(its current position,} 
(selects it} 




* * * '!his procedure writes back out the (possibly updated) script file to * 
* disk. Since its logic is fairly straightforward, no connnents within its * 




















while SysPtr<>StatPtr do 





Intl, ' ' , lnt2 , ' ' , 
HiX,' ',HiY,' , '); 
write(ScriptFile,'Pecodes :'); 
for X:=l to ArraySize do 
begin 
for Y:=l to ArraySize do 
write(ScriptFile, 
' ',PE[X,Y].Code:2); 







if Next<>StatPtr then 
writeln(ScriptFile,' ; ') 





while IOPntr". IO= INPUT do 
with IOPntr" do 
begin 
write(ScriptFile,Name,' ', 
ArNum,' ',Side,' ', 
Bus,' ',IOStart); 
if NextIO".IO=INPlJf then 
writeln(ScriptFile,' ,') 
else writeln(ScriptFile,' .'); 
IOPntr:=NextIO; 
end; 
writeln(ScriptFile, 'OUfFILES : '); 
while IOPntr<>NIL do 
with IOPntr" do 
begin 
write(ScriptFile,Name,' ', 
ArNum,' ',Side,' ', 
Bus,' ',IOStart); 
if NextIO<>NIL then 
writeln(ScriptFile,' ,') 






while LnkPtrl<>NIL do 
with LnkPtrl" do 
begin 
writeln(ScriptFile,ArNums[l]); 
while (LnkPtr2<>NIL) and 
(LnkPtr2".ArNums[l]= 
ArNums [ 1]) do 
begin 












LnkPtr2".Sides[2] ,' ', 
LnkPtr2".I.nkStart,' ', 
LnkPtr2".I.nkStop,' '); 

















* * * This procedure gets script file name specified by user on the conunand * 
* line or failing that it will prompt user for it. * 
* * AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA****kAAAAAA**********) 
procedure PromptUser; 
var OK : boolean; 
begin 
ClrScr; 
if ParamCount=O then 
begin 











if not OK then 
begin 
{clears out display} 
{if no parameter on command line} 
{prompts user and reads in script} 
{file name} 
{IO loop check here} 
{check IO result for error} 
writeln(' !! File not found!!'); {let user knows if error} 
writeln('** Script filename?'); {prompts user again} 





{until no more IO error} 
208 
( ******************* 
* * * This function returns the first (non-space) char of the next word on * 




(var FileVar : text ) 
char 
var TempChr char; 
begin 







{skips all spaces and tabs to first} 
{non-blank char. Return@ if char} 
{is EOln .. } 
{else return it} 
209 
( *********************** 
* * * This function reads statements out of the script file and returns * 
* their type to the calling block. It will set an appropriate error value * 





var I : integer; 
TempStr : string[Max.Word] ; 
begin 
I:=O; 





until Eoln(ScriptFile) or 
(!=Max.Word) or 
(TempStr[I]=' ') or 
(TempStr[I]=' :'); 
if (TempStr[I]=' ') and 
not SeekEoln(ScriptFile) then 
read(ScriptFile,TempStr[I]); 




while (I<=MaxStr) and 
(TempStr<>StringList[I]) 
do I:=I+l; 
if I<=MaxStr then 
begin 
StatementType:=I; 
if SeekEoln(ScriptFile) then 
readln(ScriptFile); 
end 
{skips all spaces, tabs} 
{and blank lines} 
{then reads in the statement} 
{then finds the delimiter.} 
{if found, see what type of} 
{statement it is} 
{and return its type} 












* * * This procedure build and initializes an array specified in the script * 
* file. It does minimal syntax error checking on the script. * 
* It is called by : * 
* - procedure ReadScript(). * 
* * 
procedure GetSystemSpecs 
(var Ptr: ArrayPtrType); 







with Ptr" do 
begin 
read(ScriptFile,Number, 
ArType , Mode, HiX, Hi Y) ; 
if not (Number in 














if not (DPmode in 
[Fu.11,Arrays,Buffer]) then 
begin 











(retrieve values for the array} 
(error if array number is >= 16} 
(stores the type of array or .. } 
(error when type is unknown} 
(stores the array's display mode .. } 
(or error when type is unknown} 
(look for delimiter} 
(and if not found, gives error} 
**) 
end; 
if StatementType=6 then 
for X:-1 to ArraySize do 




if not (PE[X,Y].Code in 



























(Next, look for array's PEs codes} 
(layout and read it in} 
(error if something is wrong} 
(error if something is wrong} 
(depending on type of array, call} 
(propper procedure to initialize} 
(its PEs} 
(convert the array number to its} 
(string equivalent for status panel} 





* * * This procedure sets the start and stop index values to traverse the side * 
* of an array depending on the array type and which side of the array. * 
* * 
procedure SideTraversal 
( Side : integer; 
ArType : TypeOfArray; 
var Xl,X2 , 
Yl,Y2 : integer); 
begin 
case Side of 

































(depending on which side and type} 
(of array is involved, prepares} 
(the start and stop index values} 































































* * * This procedure, given the info contains in a IOType record, will link * 
* a buffer of an IO file to a side of an array at the proper time. If the * 
* IO file is of type INPUT, all In_Regs of PEs' on the proper side of the * 
* array will contain the addresses of the buffer's individual registers. * 
* If the IO file is of type OUTPUT, the reverse is true. * 
* This procedure is called by : * 
* - procedure MultiStepsExec(). * 





( Ptr : IOPtrType; 
Step : integer); 
I 
Xl,X2,X3 , 
Yl,Y2,Y3 : integer; 
begin 
while Ptr<>NIL do 
with Ptr" do 
begin 
end; 





Xl , X2 , Y1, Y2) ; 
X3:=Xl; Y3:=Yl; I:=l; 
repeat 







Addr(PE[X3, Y3] . 
Last_Out[Side,Bus]); 
end; 
if Xl<X2 then X3:=X3+1 
else if Xl>X2 then X3:=X3-l; 
if Yl<Y2 then Y3:=Y3+1; 
I:=I+l; 




{if it's time to link IO to array .. } 
{marks that IO channel is now active.} 
{depending on which side and type} 
{of array is involved, prepares the} 
{start and stop index values} 
{actual linking is done here while} 
{traversing the side of array} 
{PE's input registers gets addresses} 
{of IO channel's input buffers} 
{IO channel output buffer gets} 
{addresses of PE's output registers} 
( 
216 
* * * This procedure build and initializes the IO system of the configuration * 
* from the script file. It does lots of error checking on the script. * 
* It is called by : * 
* - procedure ReadScript(). * 
* Also, it called : * 
























until (Name[I]=' ') or 
(I=MaxFileName); 








if not (IOresult=O) 
then ErrorSet:= 







if not (ArNum in 
[l .. MaxWindowsGlb-1]) 
{garbage can for expediency} 
{does this 'til " 11 or error is met .. } 
{set IO type} 
(IO channel is not active yet} 
{gets the address of systems arrays} 
{get all blanks in between data} 
(retrieves IO filename to storage} 
(if bad name, sets error alarm} 
(sets the length of the name string} 
(IO file preprocessing starts here} 
{if input file, open for reading .. } 
(then checks IO result for error} 
(and set error if there are any.} 
{if output file, open for writing.} 
{get all blanks in between data} 
{then get all remaining data.} 




and (ArPtr" .Next<>CurrntPtr) 
do ArPtr:=ArPtr".Next; 
if ArPtr".Number<>ArNum then 
ErrorSet:=ErrorSet+[6]; 
if not (Side in [l. .MaxR.egs]) 
then ErrorSet:-ErrorSet+[9]; 
if not (Bus in [ 1 .. MaxBus] ) 
then ErrorSet:-ErrorSet+[l4]; 
case Flag of 
INPUT: 
for I:-1 to MaxArraySize do 
InRegs [I] :=(). 0; 
OUTPUT: 




If not (TempChr in[',','.']) 
then ErrorSet:=ErrorSet+[4] 
else readln(ScriptFile); 







until (ErrorSet<>[]) or 
(TempChr='. '); 
end; 
{search for the specified array .. } 
{does array exist ?} 
{valid side ?} 
{valid bus ? } 
{init. all IO buffer's registers} 
{where is delimiter ?} 
{gets storage space for next} 





* This procedure, given the info contains in a list of LinkType records, * 
* will link a side of a source array to a side of an destination array, * 
* or it will cut off the link by pointing input registers to value zero * 
* The link is achieved by having all In_Regs of PEs' on the proper side of * 
* the destination array store addresses of Out_Regs of all PE's on the * 
* proper side of the source array. * 
* This procedure is called by : * 
* - procedure MultiStepsExec(). * 




(Link : LinkPtrType; 
Step : integer); 






while Link<>NIL do 
with Link" do 
begin 
if lnkStart=Step then 
begin 










for I:=l to MaxBus do 
ArPtrs[l]". 





for I:=l to 2 do 
begin 
if X1 [I ]<X2 [I] then 
X3 [I] : =X3 [I]+ 1 
else if Xl[I]>X2[I] then 
X3 [I ] : =X3 [ I ] -1 ; 
{start at begining of Link list} 
{and until the end of list .. } 
{do all things below.} 
{if the moment of truth arrives} 
{then .. } 
{ .. for both source and destination,} 
{depending on which side and type} 
{of array is involved, prepares} 
{the start and stop index values} 
{to traverse the side of array} 
{repeats doing the following .. } 
{for all buses, points Input registers} 
{of destination array to the Output} 
{registers of the source array} 
{increments side traversal index} 
{values for both source and} 
{destination array} 
end; 
if Yl[I]<Y2[I] then 
Y3 [I] : =Y3 [I]+ 1 ; 
end; 
until (X3 [ 1] =X2 [ 1]) and 
(Y3[l]=Y2[1]) ; 
end 















if Xl[l]<X2[1] then 
X3 [l] :=X3[1]+1 
else if Xl[l]>X2[1] then 
X3 [ 1 ] : =X3 [ 1] -1 ; 
if Yl[l]<Y2[1] then 
Y3[1] :=Y3[1]+1; 





{until array's side is fully} 
{traversed} 
{when it's time to cut the link} 
{for the destination array, prepares} 
{the start and stop index values} 
{to traverse its side} 
{repeats doing the following .. } 
219 
{for all buses, points Input registers} 
{of destination array to the} 
{value zero} 
{increments side traversal index} 
{values for destination array} 





* This procedure gets info of the data flow from array to array, including* 
* feedback paths, according to a script file. It does some error checking * 
* on the script file and on the way user specified cormective path between * 
* arrays. 
* The procedure is called by 
* - procedure ReadScript(). 
* It calls : 









(var Link : LinkPtrType; 
var 
Ptr : ArrayPtrType); 
TempChr : char; 
New Link 
























with NewLink"' do 
begin 
Sides[l]:=StatementType-6; 
if not (Sides[l] in 
[1. .MaxRegs]) then 
ErrorSet:=ErrorSet+[7]; 
(does all this until '.' encountered} 
(initializes pointers} 
(gets the destination array .. } 
(is it valid ?} 
(if not, sets error and says goodbye} 
(then does all this until ';' is met} 
(gets input side of destination array} 















if not (Sides[2] 
in [l. .MaxR.egs]) then 
ErrorSet:=ErrorSet+[9]; 




If not (TernpChr in 























until (TempChr in[';','.']); 
until (TempChr='. ') or 
Eof(ScriptFile); 
end; 
{Now, gets source array, its output} 
{side and start and stop values} 
{validates source array here .. } 
{and the output side here} 
{leaves if any errors} 
{then seeks out delimiter. If none} 
{found, sets error} 
{create new link storages and .. } 
{and points to it} 
{stops getting input direction for} 
{destination array} 





* * * This function sets up the SAGS internals according to a script file * 
* specified by the user. It will generate error messages and returns * 
* a FALSE boolean value if any error or inconsistency is encountered in * 
* the file. Graphics errors such as drawing a window out of screen range * 






















if ActionType=Sequencer then 
begin 





















{clears error register and init.) 
{sequence counter) 
{continues the system setup sequence) 
{until error occurs) 
{gets the step ntunber and if it's) 
(in sequence then proceeds) 
(reads in array size of system) 
(if array size too large, sets error) 
{look for the delimiter and) 
{ if not found, error) 
{creates arrays system here) 








tmtil (ErrorSet<>[]) or 
(TempGhr='. '); 























if ErrorSet<>[] then 
begin 
for Sequencer:=l to Max.Error do 
begin 










{look for delimiter) 
{and if not fotmd, gives error) 
{if no error,) 
{then create a circular) 
{doubly linked list which included) 
{the status window.) 
{creates IO system here) 
{starts IO linked list) 
{then reads in IO specs .. ) 
{and reads in some more then .. ) 
{get description of the flow of data) 
{setup sequence ends here.) 
{looks through error list actunUlated} 
(thus far and displays appropriate} 
(error message} 





* * * 'lllis procedure represents the execution code of a shift down register * 
* array. 'lllis array moves data in the North to South direction. * 
* _ R is X _Reg * 
* _ Xin is ln_Regs[l,l]A * 
* _ TAGin is ln_Regs[l,2]A * 
* _ Xout is Out_Regs [ 3, l] * 




(var PE : PEtype); 
begin 








{put value in here for display} 
225 
( *** 
* * * This procedure represents the execution code of a shift left register * 
* array. This array moves data in the East to West direction. * 
* _ R is X_Reg * 
* _ Xin is In_Regs [2, l]" * 
* _ TAGin is ln_Regs(2,2)A * 
* _ Xout is Out_Regs [ 4, l] * 




(var PE: PEtype); 
begin 




Out Regs[4,2]:=In Regs[2,2]A; - -








* * * 1his procedure represents the execution code of a shift up register * 
* array. 1his array moves data in the South to North direction. * 
* _ R is X _Reg * 
* _ Xin is In_Regs[3,l]A * 
* - TAGin is In_Regs[3,2]A * 
* _ Xout is Out_Regs[l, l] * 
* TAGout is Out_Regs[l,2] * 
* * AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA*******) 
procedure S2Ncode 
(var PE : PEtype); 
begin 












* * * This procedure represents the execution code of a shift up register * 
* array. This array moves data in the West to East direction. * 
* _ R is X _Reg * 
* _ Xin is In_Regs[4,l]" * 
* - TAGin is In_Regs[4,2]A * 
* _ Xout is Out_Regs[2, l] * 




(var PE: PEtype); 
begin 












* * * lhis procedure represents the execution code of HE's systolic array * 
* for botmdary cell. * 
* X is X_Reg * 
Xin ic In_Regs[l,l]" * --
TAG in ic In_Regs[l,2]A * --
Vout is c Out_Regs[2,l] * 
Mout ic Out_Regs[2,2] * --
-Mout is c Out_Regs[3,l] * 
* * AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAkkAAAAAAAAAA*******"kAAAAAAAAAA*******) 
procedure HEcodel 
(var PE: PEtype); 
begin 























{get Xin and .. } 
(pivoting TAG bit values} 
(if pivoting is allowed and Xin is} 
(greater in magnitude than X, then .. } 
(tell the East neighboring cell to} 
(pivot and send it a modifying value} 
(else, no pivoting .. } 




* * * 'lllis procedure represents the execution code of HE's systolic array * 































(var PE: PEtype); 
begin 





TAG:=Trunc(Out_Regs[2,l]); {get TAG bit for display} 














* * * '!his procedure represents the execution code of NASH's systolic array * 









Cout or Xout is 
* X_Reg * In_Regs[l,2]A * In_Regs[l,l]A * Out_Regs[2,l] * 
* ·~) :AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 
Sout is Out_Regs[2,2] * 
* 
procedure NASHcodel 
(var PE : PEtype); 
var T : real; 
begin 




if TAG=O then 

















{'!his line will give us incorrect} 




* * * 'lbis procedure represents the execution code of NASH's systolic array * 










































(var PE: PEtype); 
begin 





if TAG=O then 
begin 
Out_Regs[3,l]:= 















* * * This procedure represents the execution code of my systolic array * 



















































(var PE: PEtype); 
begin 
with PE do 
begin 
Cl24 :- Trunc(In_Regs[l,2]"); 
C3 := Trunc(In_Regs[4,l]"); 
if Odd(Cl24) then 
X_Reg :-= 0.0 
if Cl24>7 then 
begin 
if (abs(In Regs[l,l]")>= 
abs (X _Reg) ) and 
(Cl24>11) then 
begin 
if Cl24 in [12,13] then 
Out_Regs[2,l]:=Cl24+2 
else Out_Regs[2,l]:=Cl24; 















{stores Cl, C2, C3, and C4} 
{if C4 is 1 then clear X} 
{if Cl is 1 then Triangle mode} 
{if 3Xin3 r 3X3 and C2 is 1 then} 
{pivoting is needed and allowed} 
{set C3 to 1..} 
{else pivoting is not allowed} 





If C3 in [2,3,6,7] then 
begin 
Out_Regs[3, l] :=X_Reg 
+In_Regs[4,2]" 














(var PE: PEtype); 
begin 
with PE do 
begin 
Cl24 := Trunc(In_Regs[l,2]"); 
C3 := Trunc(In Regs[4,l]"); 
if Odd(Cl24) then 
X_Reg := 0.0 ; 






if Cl24 in [12,13] then 
Out_Regs[2,l]:=Cl24-6 
else Out_Regs[2,l]:=Cl24-8; 







{display that cell is triangle mode} 
{else Cl is in Square mode.} 
{if C3 is 1 then .. } 
{else if C3 is 0 then .. } 
{pass on C3.} 
{Pass on Min.} 
{display that cell in square mode} 
{In any case, pass on Cl, C2, C4.} 
{stores Cl, C2, C3, and C4} 
{if C4 is 1 then clear X} 
{if Cl is 1 then Triangle mode} 
{if 3Xin3 r 3X3 and C2 is 1 then} 
{pivoting is needed and allowed} 
{set C3 to 1..} 
{else pivoting is not allowed} 
233 
end; 









-In_Regs [l, l]" 
/X_Reg; 

















{set C3 to 0 .. } 
{display that cell is triangle mode} 
{else Cl is in Square mode.} 
{if C3 is 1 then .. } 
{else if C3 is 0 then .. } 
{pass on C3.} 
{Pass on Min.} 
{display that cell in square mode} 




* * * This procedure represents the execution code of my systolic array * 




























(var PE : PEtype); 
begin 
with PE do 
begin 
end; 
C3 := Trunc(In_Regs[4,l]"); 
TAG:=C3; 
if Odd(C3) then 
X_Reg :- 0.0 








else Out Regs[3,l]:= 









{stores Cl, C2, C3, C4.} 
{display control code} 
{if C4 is 1 then clear X} 
{if C3 is 1 then .. } 
{else if C3 is 0 then .. } 
{pass on Cl, C2, C3, C4.} 












(AAAAAAAAAAAAAAAAAAAAAAAAAAAAA********AAAAAAAAAAAAAAAAAAAAA * ********~* 
* 
* 'lllis procedure represents the execution code of my systolic array * 
* design for square cells. 
* 
* x is X_Reg -
* Xin is In_ Regs [ 1 , 1] " -
* Cl24in is In_ Regs [l, 2]" -
* C3in is In Regs [ 4, 1 ]" -
* Min is In_ Regs [ 4, 2] " -
* Xout is Out_Regs[3,l] -
* Cl24out is Out_Regs[3,2] -
* C3out is Out_Regs[2,l] -
* Mout is Out_Regs[2,2] -
* 
procedure LEcode4 
(var PE: PEtype); 
begin 
with PE do 
begin 
end; 
Cl24 := Trunc(In_Regs[l,2]"); 
C3 := Trunc(In_Regs[4,l]"); 
TAG:=C3; 
if Odd(Cl24) then 
X_Reg := 0.0 ; 




























{stores Cl, C2, C3, and C4} 
{display control code} 
{if C4 is 1 then clear X} 
{if C3 is 1 then .. } 
{else if C3 is 0 then .. } 
{Pass on Min.} 
{In any case, pass on Cl, C2, C4.} 
237 
( *** 
* * * 'Ibis procedure updates the image of an array in its window to reflect * 
* the state of the computation at a particular step. Depending on the * 




(Ptr : ArrayPtrType); 
var X,Y,I integer; 
begin 





with PEtxtArray[DPmode] do 
for X:-1 to ArraySize do 
for Y:-1 to ArraySize do 
with PE[X,Y] do 
if Code<>O then 





{brings out the proper window,} 
{selects it, and .. } 
{erase the old texts .. } 
{depending on array's display mode.} 
{within every PE of the array .. } 
{if the PE has a valid code then .. } 
{erases all displayable registers} 
{values} 
end; 
for X:=l to ArraySize do (Then, with every PE of the array .. } 
for Y:=l to ArraySize do 
with PE[X,Y] do if Code<>O then (if it has a valid code .. } 
begin 
Str(X_Reg:6:2, (updates its text storages of X,} 
Regs_Txt[2]); 
Str(Out_Regs[2,1]:6:2, (of Vout,} 
Regs_Txt[3]); 
Str(Out_Regs[2,2] :6:2, (of Mout,} 
Regs_Txt[4]); 
Str(Out_Regs[3,1]:6:2, (of Xout,} 
Regs_Txt[S]); 




with PEtxtArray[DPmode] do 
for X:=l to ArraySize do 
for Y:=l to ArraySize do 
with PE[X,Y] do 
if Code<>O then 







(At last, write in the new texts .. } 
(depending on array's display mode.} 
(within every PE of the array .. } 
(if the PE has a valid code then .. } 
(rewrites all new registers} 
(values} 




* * * This procedure simulates a single step of execution of the systolic * 
* system of arrays. It first links all necessary IO channels for the * 
* current step to the system of arrays, then it feeds data into input * 
* buffers, gets data from arrays into output files, then for each PE, it * 
* executes its microcodes until the entire system of arrays is traversed. * 
* At last it will move the result of each PE's micro-execution into its * 
* suitable output register and updates the graphics image of each array * 
* and the status panel. It really can do that much works in so short a * 
* time span. * 
* This procedure is called by : * 
* - main block. * 
* This procedure calls : 
* - procedure LinkIOFlow(). 
* - procedure LinkDataFlow(). 
* - procedure UpdateArray(). 
* - procedures HEcodel(), HEcode2(). 
* - procedures NASHcodel(), NASHcode2(). 
* - procedures lEcodel(), LEcode2() 

























while IOPntr<>NIL do 
with IOPntr" do 
begin 
if Active then 
case IO of 
INPUT: 




for I:=l to ArraySize 
do InR.egs[I]:=O.O; 
end 
{update status panel's registers} 
{increments time .. } 
{and step counters} 
{establishes all necessary links and} 
{IO channels for this step} 
{starts at begining of IO linked list} 
{for each I/O channel .. } 
{if channel is still active, then} 
{depending on the type of IO channel} 
{for input channel .. } 
{if all data in file are read} 
{then closes input file,} 
{marks input channel as inactive} 
{and grounds input buffers. } 
else begin 


















while SysPtr<>StatPtr do 
with SysPtr" do 
begin 
for X:=l to ArraySize do 
for Y:=l to ArraySize do 


















while SysPtr<>StatPtr do 
with SysPtr" do 
begin 
for X:=l to ArraySize do 
for Y:=l to ArraySize do 
with PE[X,Y] do 
if Code<>O then 
for l:=l to MaxRegs do 
for J :=l to MaxBus do 
{else reads in data on line .. } 
{and go to next line} 
{for output channel .. } 
{write data to file} 
{then goes to next IO channel} 
240 
{start with the 1st array in system .. } 
{as with all arrays except STATUS .. } 
{with every single PE of array .. } 
{depending on its individual code .. } 
{do nothing, or .. } 
{executes the proper PE's microcode} 
{then go to the next array} 
{THEN moves the flow of data} 
{of each array except the STATUS} 
{by updating each PE' s Last_Out} 
{buffers on all sides and bus .. } 

































while SysPtr<>CurrntPtr do 















{then of course, go to next array} 
{updates graphics image of system 
{starts here. First, invert current} 
{window to normal .. } 
{ .. then stores it.} 
{Now, on the RAM screen,} 
(clears it .. } 
(then, if current window is not .. } 
(the status window, updates it and} 
(clears RAM screen again} 
(then with the status panel,} 
(restores it to the RAM screen .. } 
(and starts updating the panel .. } 
(by erasing the old status text} 
(and write in new status. text} 
(and stores the new panel.} 
{then updates all other windows} 
(to the RAM screen .. } 
(except the status panel .. } 
(which, if it's not the current .. } 
(window, restores it last to the .. } 
(RAM screen} 
241 
(Now, dump contents of RAM screen to} 
(the MAIN screen, and select it .. } 
(and restore the current window to it} 
(then select current window and .. } 
(hilite it. } 
242 
(AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA****"AAAAAA***** 
* * * This procedure simulates a single step of execution of the systolic * 
* system of arrays. It first links all necessary IO channels for the * 
* current step to the system of arrays, then it feeds data into input * 
* buffers, gets data from arrays into output files, then for each PE, it * 
* executes its microcodes until the entire system of arrays is traversed. * 
* At last it will move the result of each PE's micro-execution into its * 
* suitable output register and updates the graphics image of each array * 
* and the status panel. It will keeps executing until a key is hit on the * 
* keyboard. * 
* This procedure is called by : * 
* - main block. * 
* This procedure calls : 
* - procedure LinkIOFlow(). 
* - procedure LinkDataFlow(). 
* - procedure UpdateArray(). 
* - procedures HEcodel(), HEcode2(). 
* - procedures NASHcodel(), NASHcode2(). 
* - procedures LEcodel(), LEcode2() 





































with StatPtrA do 
begin 
Times:=Times+TimeUnit; 
(first, if current window is not} 
(the status panel then stores it} 
(as a non-current window and} 
(then make the status panel current} 
(by inverting it.} 
(else stores the status panel as} 
(current} 
(Then select RAM screen} 
(REPEAT all following until a key is} 
(pressed .. } 
(update status panel's registers} 





while IOPntr<>NIL do 
with IOPntr" do 
begin 
if Active then 
case IO of 
INPUT: 


























while SysPtr<>StatPtr do 
with SysPtr" do 
begin 
for X:-1 to ArraySize do 
for Y:=l to ArraySize do 











{and step counters} 
{establishes all necessary links and} 
{IO channels for this step} 
243 
{starts at begining of IO linked list} 
{for each I/O channel .. } 
{if channel is still active, then} 
{depending on the type of IO charm.el} 
{for input charm.el .. } 
{if all data in file are read} 
{then closes input file,} 
{and marks input charm.el as inactive} 
{and grounds input buffers.} 
{else reads in data on line .. } 
{and go to next line} 
{for output channel .. } 
{write data to file .. } 
{and go to next line} 
{then go to next IO channel} 
{start with the 1st array in system .. } 
{as with all arrays except STATUS .. } 
{with every single PE of array .. } 
{depending on its individual code .. } 
{do nothing, or .. } 








while SysPtr<>StatPtr do 
with SysPtr"' do 
begin 
for X:-1 to ArraySize do 
for Y:-1 to ArraySize do 
with PE[X,Y] do 
if Code<>O then 
for I:=l to MaxR.egs do 
for J:=l to MaxBus do 
La.st_Out[I,J] := 
































while SysPtr<>CurrntPtr do 
begin 
if SysPtr<>StatPtr then 
(then go to the next array} 
(THEN moves the flow of data} 
(of each array except the STATUS} 
(by updating each PE' s La.st_Out} 
(buffers on all sides and bus .. } 
(if its code is not 0} 
(then of course, go to next array} 
(start with the first array .. } 
(clears the RAM screen .. } 
(select the array's world .. } 
(for all windows that are not status} 
(panel or current, updates them to} 
(reflect the new values in each} 
(PE's registers.} 
(Then updates the status panel..} 
(and copy RAM to displayed screen.} 
(end of REPEAT} 
(clears stdin of recent key pressed} 
(Now that multiple step execution .. } 
(is stop, clears the RAM screen to .. } 
(start re-displaying all system in} 
(the same order before execution .. } 






if StatPtr<>CurrntPtr then 


















{current, non-status windows first .. } 
(then restore status panel .. } 
(then updates displayed screen .. } 
(and at last, select displayed screen} 
(to restore current window} 
(and invert it if it's not already} 
(invert, meaning if the current} 
(window is not the status panel} 
246 
**************************"'*'"'*'"'*'~~hhlhhl~~~~***************************""kk 
* * * This is the script file for the simulation of Nash's array solving * 
* example (A.4). It allows SAGS to produces the sequence of snapshots B.l * 
* with the data and control files below. * 






1 1 1 21 129 ' 
Pecodes : 7 8 
0 7 
0 0 
2 0 2 37 129 ' 
Pecodes : 8 8 
8 8 
8 8 
3 4 2 21 29 ' 
Pecodes : 0 0 
0 1 
1 1 
4 4 2 37 29 ' 
Pecodes : 0 0 
0 1 
1 1 
5 3 2 37 229 




triang34 3 1 1 1 
trtag3 3 1 2 1 , 
square34 4 1 1 4 
sqtag3 4 1 2 4 . 
OUTFILES 


















Northlnput : 3 3 1 1 
2 
Northlnput : 4 3 1 1 
Westlnput : 1 2 1 1 ; 
5 
Northlnput : 2 3 1 1 . 
247 
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA~AAAAAAAAAAAAAAAAAAAAAAAAAAA~*** 
* * * Infile triang34. * 
* Contains the input data flow to be fed into the triangular array of the * 
* system. * 




















* * * Infile square34. * 
* Contains the input data flow to be fed into the square array of the * 






















* * * Infile trtag3. * 
* Contains the control signals necessary for the triangular array of the * 
* system. * 








































* Infile sqtag3. * 
* Contains the control signals necessary for the square array of the * 
* ~st~. * 







































* * * This is the script file for the sinru.lation of Chuang and He's array * 
* solving example (A.2). It allows SAGS to produces the sequence of * 
* snapshots B.2 with the data and control files below. * 






1 1 1 22 121 
Pecodes : 5 6 6 
0 5 6 
0 0 5 
2 0 2 38 121 
Pecodes : 6 6 6 
6 6 6 
6 6 6 
3 4 2 22 21 , 
Pecodes : 0 0 1 
0 1 1 
1 1 1 
4 4 2 38 21 , 
Pecodes : 0 0 1 
0 1 1 
1 1 1 
5 3 2 38 221 
Pecodes : 1 1 1 
1 1 0 
1 0 0 
INFILES : 
triang32 3 1 1 1 
trtag3 3 1 2 1 , 
square32 4 1 1 4 , 
sqtag3 4 1 2 4 . 
OUfFILES 
result 5 3 1 14 . 
SETUP : 
1 
Northlnput : 3 3 1 1 
2 
Northlnput : 4 3 1 1 
Westlnput : 1 2 1 1 ; 
5 
Northlnput : 2 3 1 1 . 
250 
******~~rlrl'rlri'rlrilnhlnh"*""*""*"********************:-k-A:-k-Arlrlrlri~lnhlnhl.-*1k*.**************** 
* * * Infile triang32. * 
* Contains the input data flow to be fed into the T array of the system. * 




















* * * Infile square32. * 





















* * * Infile trtag3. * 









































* * * Infile sqtag3. * 
* Contains the control signals necessary for the S array of the system. * 







































* This is the script file for the simulation of a double arrays system of * 
* our own design. This system is shown in the sequence of snapshots B.3 * 
* solving example (A.3). It uses the data and control files below. * 
* Re100ve all comments before using them with SAGS. * 




1 0 1 19 106 ' 
Pecodes : 9 11 
12 10 
2 0 2 30 106 . 
Pecodes : 4 4 
4 4 
3 0 2 40 106 
Pecodes : 4 4 
4 4 
4 0 2 so 106 ' 
Pecodes : 4 4 
4 4 
s 0 1 19 172 ' 
Pecodes : 9 11 
12 10 
6 0 2 30 172 ' 
Pecodes : 4 4 
4 4 
7 0 2 40 172 ' 
Pecodes : 4 4 
4 4 
8 0 2 so 172 ' 
Pecodes : 4 4 
4 4 
9 4 2 19 37 ' 
Pecodes : 0 1 
1 1 
10 3 2 19 242 . 
Pecodes : 1 1 
1 0 
INFILES 
data241 9 1 1 1 ' 
controll.24 9 1 2 1 
control2.24 S 1 2 14 
OlITFILES : 
result 10 3 1 28 . 
SETUP : 
1 
Westlnput : 4 2 1 1 , 
Northlnput : 9 3 1 1 ; 
2 
Westlnput : 1 2 1 1 
3 
Westlnput : 2 2 1 1 
4 
Westlnput : 3 2 1 1 
5 
Westlnput : 8 2 1 1 , 
Northlnput : 1 3 14 14 
6 
Westlnput : 5 2 1 1 
7 
Westlnput : 6 2 1 1 
8 
Westlnput : 7 2 1 1 
10 




* * * Infile data241. * 
* Contains the input data flow to be fed into the first array of the * 
* system. * 



































































* * * Infile controll.24. * 
* Contains the control signals necessary for the first array of the system. * 



































* * * lnfile control2.24. * 
* Contains the control signals necessary for the second array of the * 
* system. * 
* * 
****************************************************************************** 
13 0 
12 0 
8 0 
8 0 
8 0 
8 0 
8 0 
8 0 
1 0 
0 0 
0 0 
0 0 
0 0 
0 0 
0 0 
0 0 
1 0 
0 0 
0 0 
0 0 
0 0 
0 0 
