Commutative 1-D systolic arrays for linear transformation applications by Gunara, Ray Antonio
Lehigh University
Lehigh Preserve
Theses and Dissertations
1990
Commutative 1-D systolic arrays for linear
transformation applications
Ray Antonio Gunara
Lehigh University
Follow this and additional works at: https://preserve.lehigh.edu/etd
Part of the Electrical and Computer Engineering Commons
This Thesis is brought to you for free and open access by Lehigh Preserve. It has been accepted for inclusion in Theses and Dissertations by an
authorized administrator of Lehigh Preserve. For more information, please contact preserve@lehigh.edu.
Recommended Citation
Gunara, Ray Antonio, "Commutative 1-D systolic arrays for linear transformation applications" (1990). Theses and Dissertations. 5266.
https://preserve.lehigh.edu/etd/5266
.. 
.! 
" 
. . , 
. .......... -.--4•· 
~~-
.--
C01\1MUTATIVE 1-D SYSTOLIC ARRAYS 
FOR 
LINEAR TRANSFORMATION APPLICATIONS 
by 
Ray Antonio Gunara 
A Thesis 
Presented to the Graduate Committee 
of Lehigh University 
in candidacy for the degree of 
. 
lv.f aster of Science in Electrical Engineering 
Lehigh University 
Bethlehem, Pennsylvania 
1989 
" 
, 
- ... -· -
;.' 
This thesis, is accepted in partial fulfillment or the requirements for · 
the degree of Master of Science in Electrical Engineering. 
~ 
-
5EPT£MBE:R l8 _, Jg39 
(date) 
• 
l 
• 
( 
Professor in Charge 
,,. Chairman of Department 
,. 
,. 
,; 
,, 
._''I 
' ., 
for my brothers Roy and Dean 
. ' 
.- : 
• 
... 
• • 
11 
,: j 
\ J 
. , 
'' ' 
... 
I 
• I 
I 
4 ~-· ,,., ••• :. 
. -:- .. --,-..~ -
• 
' ... 
I • 
. I 
ACKNOWLEDGEMENT 
I would like to express my· deepest appreciation to those individuals 
who in one or many ways have contributed a great deal to the successful comple-
tion of this thesis. Especially, I would like to thank my dear advisor Prof . 
Meghanad D. Wagh for his endless support and encouragement throughout the 
.... 
· entire research. His ideas and vast experience in the related field has helped me 
tremendously in gaining more insight on the subject of this thesis. 
I would also like to thank my family for all their love and support, 
particularly my parents as a source of motivation for me to achieve m.y highest 
goals. 
Last, but not least, many thanks to all my friends here at Lehigh 
who made my stay here a memorable one. 
'' 
.. 
.,. 
• • • 
Ill 
t"' 
, ' 
. ·1 . 
- .- -' 
-- ----~. l .,, •• 
·-··-· ... · ···-·· 
1 
\ 
• 
•· 
. . 
I 
... 
. 
-
Table of Contents 
ABSTRACT 
1. Introduction 
1.1 The Systolic Array Architectures 
1.2 Main Objectives 
_ 1.3 Organization of the thesis 
2. 1-D Systolic Array 
2.1 The Structure of the 1-D Systolic Array s-
2.2 1-D Systolic Array for Matrix and Vector Multiplication 
2.2.1 Some Existing Architectures 
2.2.2 Proposed Architecture 
2.3 Geometrical Model for The Data Flow Representation 
2.4 Design of Systolic Algorithms 
2.4.1 Matrix and Vector Product Algorithms 
2.4.2 Toeplitz Matrix and Vector Product Algorithms 
2.4.3 Cyclic Convolution Matrix and Vector Product Algorithms 
2.4.4 m-diagonal Matrix and Vector Product Algorithms 
3. Design of Commutative Algorithms ' 
-I 
3.1 On Commutativity of the Systolic Algorithms 
3.2 Matrix and Vector Product Algorithms 
3.3 Toeplitz Matrix and Vector Algorithms 
, 3.4 Cyclic Matrix and Vector Product Algorithms 
3.5 m-diagonal Matrix and Vector Product Algorithms 
4. Architectural Considerations 
4.1 Sequencing Architecture 
4.2 Systolic Array Simulator 
5. Conclusion 
5.1 Summary of Results 
5.2 Future Extensions 
REFERENCES 
VITA 
ACKNOWLEDGEMENT 
-------'------~----;p-~-~ .. ·- ------------- ........ -~--
~ -~ \ \ 
., I . 
• 
IV 
. , . 
• 
1 
2 
2 
3 
4 
7 
7 
7 
8 
9 
12 
14 
15 
15 L. 
16 
16 
19 
19 
20 
30 
38 
40 
54 
54 
62 
66 
66 
68 
69 
70 
••• 111 
,. ""'·-· 
., 
,, 
I 
List of Figures 
, •· I 
Figure 2-1: Linear Systolic Array for Convolution with Global Data . 8 
Broadcast 
Figure 2-2: Linear Systolic Array for a Nodd linear phase FIR digi- 8 
tal filter with Non-Global Data Broadcast 
Figure 2-3: Linear Systolic Array for Matrix - Vector Multiplication 10 
Figure 2-4: Linear Systolic Array for Matrix - Vector Multiplication 11 
.. '. and Addition 
Figure 2-5: \ ~ro?os~d Linear Systolic Array for Matrix-V~ctor Mul- 12 
' t1plicat1on 
Figure 2-6: Sample data flow model for the Linear Systolic Array 13 
Figure 3-1: Matrix and Vector multiplication for N = 4 22 
Figure 3-2: Matrix and Vector multiplication for N = 5 27 
Figure 3-3: Toeplitz Matrix and Vector multiplication for N = 4 32 
Figure 3-4: Toeplitz Matrix and Vector multiplication for N = 5 36 
Figure 3-5: Cyclic Matrix and Vector multiplication for N = 5 41 
Figure 3-6: Cyclic Matrix and Vector multiplication for N = 4 • 42 
Figure 3-7: m-diagonal Matrix and Vector multiplication for N = 4 44 
and m = 5 
Figure 3-8: m-diagonal Matrix and Vector multiplication for N = 4 46 
and m = 3 
Figure 3-9: · m-diagonal Matrix and Vector multiplication for N = 5 49 
and m = 7 
Figure 3-10: m-diagonal Matrix and Vector multiplication for N = 5 52 
andm = 3 
. Figure 4-1: Sequ·encing Architecture for Cyclic Matrix and Vector 58 
multiplication for N = 4 0 
Figure 4-2: Sequencing Architecture for Cyclic Matrix and Vector 59 
multiplication for N = 5 
Figure 4-3: Sequencing Architecture for a 5-diagonal MatrL~ and 60 
Vector multiplication for N = 4 
Figure 4-4: Sequencing Architecture for a 3-diagonal Matrix and 61 
Vector multiplication for N = 4 
, 
V 
.... - """mo ' 
• 
J 
Table 4-1: 
Table 5-1: 
' 't" , 
' 
.. 
List of T·ables 
Characterization of ·the Input Sequences in case of Com-
mutative Systolic Algorithms _ 
Comparison of Algorithms Time Complexity 
. - ' " .. ,, 
• 
VI 
. ot"• 
•' 
,. 
..... , .... , . 
• 
" 
., ,' 
.: . 
,), 
56 
67 
• 
, 
. , ' 
.• 
-../ 
\ 
• 
... .., 
••• 
• 
1 • • 
.... 
• 
,. 
.ABSTRACT . • I I I 
·' 
I 
Systolic arrays can be used efficiently to. perform linear transfor-
mations through sliding inner products of two vectors. Most linear systolic ar-
rays use reserved data paths for moving theses vector& through the arrays. This 
thesis investigates intermixing of the elements of the two sequences to improve. 
the time complexity of systolic computations. This method exploits the com-
mutativity of the operations performed in the processing.elements of the array. 
Four separate applications have been studied in this thesis: matrix-vector ·mul-
tiplication, Toeplitz product, cyclic convolution and banded Toeplitz matrix 
products. The commutative systolic algorithms are proposed and proved in each 
case and the hardware implementation is studied. It is found that this method 
improves the time complexity by upto 50% with a minimal addition to the 
hardware . 
. . ' 
)'- ' 
. \ 
1 
"--¥~ 
.•, 
• 
.. 
• 
·t 
' ...... 
. > 
I 
1 Chapter 1 
'.' ) 
Introduction 
1~.1 The Systolic Array Architectures 
' 
. 
. 
I ' . . 
The ever increasing-demand for real time signa1 processing applica-
. tions has forced researchers to explore a variety of architectures and algorithms 
which would provide the neccessary speed and throughput. Parallel processing 
. on general multiprocessor networks accomplishes this end but often results into 
an expensive hardware. 
An alternative to a general purpose machine is a specialized ar-
chitecture designed for the specific application. If the architecture satisfies cer-
tain basic constraints such as the repeatability and the modularity and if the 
communication between modules is regular and minimal, then it is feasible to 
implement it as a Very Large Scale Integrated (VLSI) circuit [1]. A class of ar-
chitectures called the systolic architectures are known to posses these desirable 
properties. They are therefore best candidates for the Application Specific In-
tegrated Circuits (ASIC) providing the high throughput desirable in most signal 
processing applications [2]. 
Systolic arrays are multiprocessor networks that are laid out in 
some regular pattern. This thesis will concentrate on one-dimentional arrays 
consisting of processors linked in a chain, but there also exist two dimensional 
systolic arrays connected in a rectangular or hexagonal grid patterns. Each 
processor of a systolic array is a relatively simple computing element dedicated 
to perform a single task. Because the processors prior knowledge of the task to 
() 
\ 
2 
.4, 
, 
... i 
' 
( -'? 
{) 
/ 
I '•)' 
ii' 
be carried out, the instruction fetch cycles as encountered in a Von Neumann 
~chitecture are eliminated helping to speed up the execution. Th~ processing 
elements send data and results only to their immediate neighbors. This ·helps 
reduce communication delays, and more imp_ortantly, keeps the chip area 
wasted in communication to a minimum. Thus relatively more chip area is 
available f.Pr computing. Moreover, the lack of broadcasts contributes in keep-
ing the transistor sizes within reasonable limits. 
Systolic arrays have been used in recent years in a variety of dif-
ferent applications such as the IIR and FIR filters, Error correcting encoders 
~ 
and decoders, sorting and searching, pattern matching and even for discrete· 
Fourier transforms [3] [ 4] [5] [6] [7]. Results are also reported in the area of 
designing systolic arrays for new applications and optimizing existing systolic 
designs. 
1.2 Main Objectives 
Almost all of digital signal processing computations can be classified 
as linear transformations of signal vectors. A linear transformation can be ex-
pressed as a product of a matrix and the signal vector. Fo·r examples, cyclic con-
volution is the product of a cyclic matrix and a vector, aigital filters can be inter-
preted as a matrix-vector multiplication, and implementation of optimum linear 
Wiener filter can be viewed as a product of a Toeplitz matrix and the signal [8]. 
When linear tranformation problems are solved on systolic architectures, two se-
quences made up of the signal and the matrix components are marched through 
the architecture, and concurrently sliding inner products of the sequences cor-
responding to the components of the resulting vector are accumulated in in-
dividual processors. In all the systolic arrays that are available to date, the two 
• 
3 
' . 
• 
It 
· •• ·i: .""- •. ---~ , ...•• _____ ,., ~ .. 
'··- ) 
" ~!-· " • 
' ' ' 
..... , ... --;,, 
~-
. 
. . 
sequences are made up exclusively from components of a particular type only .. · 
. 'i!F 
·~ 
. 
Thus, one of the sequences, say X seq1tence, is made up from the signal com-
ponents and the other, say Y, from the elements in the matrix . 
... 
This thesis experiments with the intermixing of the two sequences. 
Thus each sequence 'Will contain a few X components and a few Y. Since the 
operations performed within each processing element, the multiplication and ac-
cumulation, are commutative with respect to the operands, this switching of X 
and Y elements should provide the same result as before. This additional 
flexibility allows one to improve the time complexity of the array. This architec-
ture that exploits the commutativity of the operations is referred here as 
Commutative Systolic Array. This thesis reports on fou~ applications of com-
mutative systolic arrays: matrix-ve~tor multiplication, cyclic convolution, 
Toeplitz product and banded Toeplitz matrix product. In each case, it was found 
that the commutative arrays performed better than the noncommutative arrays, 
the improvement sometimes reaching 50%. The hardware required to imple-
ment these arrays is not greatly increased from the noncommutative case. 
1.3 Organization of the thesis 
This thesis consists of five major parts. The first part, where this 
section is contained deals with presenting the general idea of the thesis. A brief 
background of systolic arrays is given, including short discussions on their ap-
plications, their advantages, and their current development in the field of signal 
processing in particular. In addition, a new idea of systolic array algorithm is 
mentioned. 
4 
' . 
• 
•,! 
•• ! 
1[,.. '._,,, 
·" 
' 
The second chapter of the t~esis. explores the 1-D systolic arr.ay ar-
.. 
chitectures in detail. Currently existing architectures are reviewed in the first 
section, while in the second section specific architectures for solving matrix by 
vector problems are observed. The architecture to be used throughout this thesis 
is also introduced here, together with the introduction. of its data flow model 
• 
representation. The last section of this chapter, shows a number of algorithms 
for solving various matrix by vector problems. 
The third chapter is the most important part of the thesis. Here, a 
new method of generating more efficient systolic array algorithm is presented. 
' Step by step approach on designing this so called Commutative Systolic 
Algorithms is described in detail. Some algorithms based on this new proposed 
method designed for solving matrix by vector problems such as matrix and vec-
tor multiplication, Toeplitz product, Cyclic convolution, and banded Toeplitz 0: 
product are given. Also provided, are proofs and the verification process on these 
newly proposed algorithms. 
1,· I I I • 1,•I,' • 
Chapter 4 closely inspects the various input sequences that are for-
mulated in Chapter 3. And by examining the pattern within these sequences, a 
guideline on building the necessary sequencing circuits is constructed. Also, a 
simulation software used in the analysis and the design of.~the systolic 
gorithms during the making of this thesis is briefly reviewed. 
Finally, Chapt:3r 5 concludes the entire research by discussing the 
advantages and disadvantages of the proposed algorithms. Provided, is a table 
for comparing different time complex~ties of various algorithms utilizing the 
new method. Possible improvements and modifications over the current al-
gorithms and their feasibility in other type of architectures are also inves-
5 
'\ 
\. 
J 
r··· 
• 
• 
_ tigated. \ 
• 
• 
-
• 
• 
r 
) 
I 
.I" 
.. 
6 
·"" 
•• 
"\ 
"\ _....._,,_,,_ 
' 
.\ 
' . 
r-
.\ .. 
Chapter 2 
·' 
·1-D Systolic Array 
\ 
2.1 The Structure of_ the 1-D ·Systolic Array 
In one-dimensional systolic array, all the processing elements are 
. • •. ' •. ,. 
1~ 
laid out in a linear fashion, which is the reason why it is often referred to as 
linear systolic array. It can also be viewed as a string of interconnected identical 
processing elements where data is being passed along from one end of the array 
to the other. Based on the manner that the input data is being entered into the 
array, systolic arrays can be categorized into two major classes, which are sys-
tolic arrays with global data broadcasting and non-global data broadcasting [9]. 
In systolic arrays with global data broadcasting, copies of the same data are 
broadcasted into several processing elements of the array simultaneously. Ex~'. . · · -'~ 
ample of this type of systolic array is shown in FIG. 2-1 [9]. The second class of 
systolic array is one with non-global data bro~dcasting. Here, data is entered 
\ 
only into a single processing element, from which it can be broadcasted to the 
other processors in the array. Example of this type of array is shown in FIG. 2-2 
[10]. Trade-offs between these two different models are discussed as we go 
along to the next sections of this chapter. 
2.2 1-D Systolic Array ·for Matrix a1,1d Vector Multiplication 
A popular implementation of linear systolic array is for performing 
the matrix and vector multiplicat.ion operation, which is the _main topic dis-
-
cussed in this paper. First, some existing· architectures that are already 
designed for this purpose are reviewed in the following subsection. The next 
7 
,,. 
.. ,i. 
• 
.. 
,. X(n) 
' 0 
X3 ,;, X2 Xl 
---• ---• . 
" ~ , .. , .. , 
Yl ,--, ,--, ,--, 
.... I Wl I .... I W2 I .... I W3 I r r r 
, __ J I __ J , __ J 
Y3 Y2 
---• ---• 
• 
Xin 
Yin ,--, Yout 
I w I 
, __ J· 
Yout <-- Yin +·w.Xin 
---
Fi~e 2-1: Linear Systolic Array for Convolution with 
Global Data Broadcast 
h (N-1) 
..,,..-
.. 
2 
h (n) 
Xin(n) 
Yin(n) a 
... 
..._ ....... . ,, 
... h(2) 
.. 
a--....... . 
... 
..._ ........ . ,, 
~• Xout(n)=Xin(n-1) 
1---• Yout(n)=Yin(n-3) 
... ... 
,. ,, 
.. h ( 1) ... h ( 0) ,.. ,, 
.. ... 
,, ,.. 
. 
.... 
~ 
Zin(n) Zout(n)=Zin(n-2)+a[Xin(n-2)+Yin(n-2)] 
Figure 2-2: Linear Systolic Array for a 1V odd linear 
phase FIR digital filter with ·Non-Global 
Data Broadcast 
·, 
... 
~ 
... 
,.. 
... 
,.. 
y(n) 
subsection covers in detail the architecture that we use throughout this thesis. 
. - , 
2.2.1 Some Existing Architectures 
' 
Many systolic array architectures have already been designed for 
solving matrix and vector multiplication. Each one of them is different from the 
others by ways. of the matrix and vector elements are being ~:p.tered into the ar-
8 
,r·····--
' ' ( 
l 
,, 
, . ,:•·.-- . . ' ~ ... · ,: 
' 
,. 
• 
1 .•.• 
I' 
• 
·• 
,• 
• 
.. 
' 
• 
' 
-~ ---- --..----- ~ -
------ --
.. 
. . 
'• 
ray and their computation complexity. Some designs also require t~at the input 
matrix be broken up into submatrices before the computation Two examples of 
.. 
these array architectures are shown in FIG. 2-3 [2] and FIG. 2-4 [11]. The first 
architecture is quite simple in the sense that it does not need any complex ar-
' 
rangement of the matr~ elements for creating the necessary array input ,data 
sequence. Each processor's task is to output a new value of y to the top left line 
,- -
which is obtained by adding the y from the top right line with the product of a 
from the top line and x from the bottom left line, i.e y = y + a.x . The ·algorithm 
also specifies that only odd-numbered processors are activated on odd-numbered 
clock cycles, and only even-numbered processors are activated on even-
numbered processors. For N X N matrix with w bandwidth of diagonal elements, 
all N components of y can be computed after 2N + w clock cycles. 
The second architecture is also somewhat similar to the first one in 
terms of its I/0 bandwidth and the interprocessor communication. This ar-
chitecture is designed to solve the equation Y =AX+ B, where A is an N by M 
' matrix and X and B are vectors of dimensions M and N respectively. The 
gener9.tion of the data sequence for this particular architecture is governed by . 
some algorithms employed exclusively to partition the matrix elements into se-
quences appropriate for array input. Since this architure is only shown for ex-
ample purpose, details about the algorithm are not explained here. 
2.2.2 Proposed Architecture 
The architecture that is employed throughout this thesis, is the one 
shown in FIG. 2-5. Initially, it was proposed by Kung [9] for solving convolution 
operations and pattern matching. But, in this thesis, we try to m~e use of it not 
... , 
. "' 
merely for performing convolution, but also for solving Toeplitz system5 sign~l 
filtering and ordinary matrix - vector multiplication. In this architecture, the 
9 
. ·, 
'(! 
........... 
I,,.,.,-
·-··,1 
' ... _. 
x2 •••••• .,.. 
..... 
.... . -...,. 
.. 
'· 
.. .. .... 
a34 
.. 
·····1······ a11 . . ................. .. 
····· ····· 
.,... . .. ~· 
·~---....... .. 
t 
' 
. ' .. 
... 
. ....... 
.... 
. . . ' .. '....... . . ''' ... ' . . . . ....... '......... . ..... . 
.... ~ . 
t:::uuuru /1=!!: .. =-1::::tt :::: :: :c: ;::t .. ~-=--::rr u::uuuu:a .. ~-=--n;: ;:::::;: : : : : : : Y2 
./\)}):):)/ ): './)// ::/ .: _:,:_: )i)(/()i)'.):\t~ : : : : : : : ..... 
-=~·:.J:::x:_+ r::: _::::: ::;; .. : : : : : . . . 1-=~·:.J :-:·:·:-:·:·:•::::::::::::::::::1-=~·::.1::: : : : : :: · :·: · 
. . ·.·.·.·-:,:,·.· .·.·.· .·.·,:.· .. '' .. 
0 
0 
X 3 
Figure 2-3: Linear Systolic Array for Matrix - Vector 
Multi plication 
I 
~------
two data sets are being fed synchronously ; the matrix and the vector elements 
are entered at separate sides starting from the leftmost and the rightmost 
processor respectively. At every cycle, each set of data is propagated to the next 
processor according to the data flow. The number of processing elements in this 
array itself corresponds to the size o~ to the number of entries of the result~ng 
vector that is to be computed. Each processing element performs a simple 
multiply-add and store operation, that is to multiply the two input data and 
adds the result with the current content of the buffer and then replaces the old 
content with the new value obtained. After all cycles are completed, the final 
,rj' 
. ' 
. 
result should be contained inside each of the processing element register where 
each one of them eorresponds to an entry in a specified row of the rBsulting vec-
10 
• 
0 
"'·--· _, 
-
.I 
,. 
\ 
,. 
--
"' "' >. >. cu cu 
- -(P (D 
"O "O 
.c .c 
.... .... 
·- ·-~ ~ 
"O "O (P (P 
X X 
·- ·-E E 
s s 
C C: (P (P 
E E (P (P 
-(P (P 
X >-
• 
.. 
1 
• 
' . r---·-----------·--------
. I 
I 
Prearranged elements 
of Matrix A 
.. 
~------------------------
~ 
cu 
-(P 
"O 
£ 
·-
"O (P 
X 
·-E 
"' .... C 
CD· . 
E 
(P 
-0) 
co 
I 
I 
• 
I Feedback Selection Node ! ~:::::::::: 
I - 1-;:•:•:-:••:...,...--
: - (ft 
I J 
• 
• 
I 
•·.·. •. •'. ,',', • :::::,:,:::,:,:::::,:-:: L'.->.·>:·:.·.·.·.·.·.• I 
::::::::: ::::::::: ::::~:' • • • • • • • ' ' ' • • '• • •,•,•,•,•,•,•' •, • • • •' I 
•' • • • ••' '•' •' •,•,•,•,•,•,•,•,•,•,•,•, ·,:.-•• : •• :,•,:.•,:.·, •• '•.: •• : •• • •• '·.:.· •••  •• : - I 
-- -:-:-:-:-:-:,:,:,:,:.;.; - ............ -----4 ~----------
............. ::::::::::::::::::::::::.........- :-:-:,:.;,:.:,:-:-:-:-:~-
- ·.:.·.:_-.:.·.·.·.·: .. ·:_ •. :_·.:.·.:.·.: .. -:.·.: .. · ~ ........... . :-:,:, :-:,:,:,:,:.:. :,:, ............ ,' .· .· .· ,• .· :::::::::;:;::::::::::::: ,',',',',',',',',',',',' -:,:,:,:,:,:,:,:-:-:-:-: 
·.·.·.•.•,•,•,•,•,•,•,•,• :::::::::::::::::::::::: :::::::::::::::::::::::: :,:.:-:,:,:.:,:,:,;.:,:-: . . . . . . . . . . . . ,:,:,:,:,:.;,:,:,:-:-:•: 
J 
,•,•,•,•,•,•,•,•,•,•,•,•, :::::::::::::::::::::::: •.•.•,•,•,•,•,•,•,•,•,•, 
--------lllot.:_::;•::;•:::•:::•:::,:::,:::•::;•:::,::;•:::,:_ ..:.~;.;.;.;,:.;.:.;.;,:,:.; :-:•:•:-:-:,;.:,:,:,:.:,t----...... : ::.: :.: :_: :_: :_:·.,: :.: :. :  ::_; :_: . :, :, : . :.:, : ,: . :-:-:-:, : 
·,·.·.·.·.•.•,•,•,•,•,•,• ,•,•,•,•,•,•,•,•,•,·.·.· :;:::::::::::::::;:;:::: 
NW 
SW 
N 
.·.·.·.·.·.·.·.·.·.·.·.·. 
---1\\)\\:\(\:\:\:\:\\j\)\j\ . i--; - NE 
·.·.·.·.·.·.·.·.·.·.·.·.· 
::::::::::::::::::::::::: 
. :-:,:-:,:,:•:•:•:•:•:•:• 
·.·.·.·.·.·.·.·.·.·.·.·.· SE 
---1···-·.·.·.·.·.·.·.·.·.·.·~ 
::::::):i:{:}) 
,',',',','.'.',',',','.' 
-NW= NE + SW .N 
SE =SW 
Figure 2-4: Linear Systolic Array for Matrix - Vector 
Multiplication and Addition 
.. 
.. 
tor. For its processor, this design can make efficient use of the many available 
multiplier-accumulator hardware. 
As briefly mentioned earlier, this architecture is of the type with 
. 
non-global data broadcasting. It is selected among the other existing architec-
tures, mainly. because of its efficiency in solving matrix ·and vector multiplica-,, 
tion type of problems. It is also selected, because of its simplicity in structure, in 
-• .. 
' ' 
~ 
the sense that it does not require the use of a specialized bus for providing input 
' . 
data into the array, and it does not require any complex wiring on actual im-_ 
plemei;i.tation even when th_e number of processors is large. 
11 
,r 
.. ,I, 
~---·~ 
,, 
\' 
,,. 
.... 
.. 
" ... ftr':,:r•,-, •. J •.• ~" 
I 
' 
.. 
xo· x1 X2 X3 . . . . . 
PE(O) PE (1) PE(n-1) 
-----·-~ -1 
PE (n) 
•
,::::::::• 
. 
. 
Yn ~ .... Y3 Y2 Yl· YO 
Xin Xout=Xin 
Yout-Yin Yin 
• 
I 
I 
. 
PEout-PEout+Xin.Yin 
Figure 2-5: Proposed Linear Systolic Array 
Matrix-Vect<:r Multiplication 
,. 
IOr 
2.3 Geometrical Model for The Data Flow Representation 
In this section, a model for representing the flow of data inside the 
proposed systolic array is introduced. This model is not only helpful for verify-
ing results obtained from applying different input data sequences· into the array, 
but more importantly, it is helpful in determining the optimality of an algo-
rithm. For the first example, please examine FIG. 2-6. In this model, the array is 
pictured .as rows of .squares where each row of squares represent the same sys-
tolic array 3:t different stages of the execution. A square represents a processing 
element, and a shaded square represents a processing element where a product 
of the data set is formed. For ease of reference, in FIG. 2-6, the four processing 
elements in the array are numbered from PE0 to PE3. The current contents of 
each processor are included above the corresponding square. Execution is 
. started at clock O with Xo enters PE0 from left, and simultaneously YO enters 
PE3 trom the right. At the next clock cycle (clock 1), Xo moves to the right to 
12 
i . 
' . --·· .. .. . 
j • 
... 
,\ 
Clock 
0 Xo ----
1 
2 
--3 
4 
PEo 
Xo 
Yo 
... 
... 
... 
.. 
• 
Yo 
' 
... Xo Yo 
... 
... 
.. Xo Y1 
II! II 
.. Xo 
Figure 2-6: Sample data flow model for the Linear ,. 
Systolic Array 
PE1, and YO moves to the left ·to PE2, while a new pair q.ata, X1 and Y 1 enter 
into PE0 from the left and PE3 from the right respectively. At clock 2, X1 arrives 
at PE1 the same moment as Y0 does. This results with the register of PE1 be 
accumulated with the product of the two elements1. Similar situation also oc-
curs at PE2 at this cycle, in this case it accumulates product from Xo and Y l: 
Finally, at the end of the computation (in this case at clock 4), we have the fol-
•. 
lowing products accumulated: X1.Y0 at PE1 and Xo·Y 1 + X1.Y2-at PE2, while the 
. 
rest of the processing elements (i.e PE0 and PE3) are empty, or_in other words 
they accumulate no pr-oducts. The final value_~ stored inside PE0 to PE3 cor-
respond to entries of row Oto 3 of the resulting vector respectively. 
I • 
1Prior to any computation, all processing elements registers must be initialized to 0. 
.. 
13 
/ 
' .. -· .t • 
• I 
I . 
.~ 
·'!· 
•. 
.. 
.. 
• 
...... 
. As mentioned earlier, the optimality of an algorithm can be ap .. 
proximated · using the data flow diagram. This is done by examining the dis-
•; 
tribution of the shaded areas acro~s._ the entire diagram. An algorithm can be 
, · considered optimum, when there is a dense· distribution of the shaded areas, im-
plying that many products are· formed per unit time, thus yielding a high 
utilization factor of the systolic array. On the other hand, a scattered or scarce 
. . 
. 
distribution of these shaded areas, indicates the opposite, meaning a low utiliza-
, 
tion of the systolic array. In addition to the above, the distribution of the shaded 
areas can also aid in the improvement of the algorithm, that is by fmding some 
shaded areas that can be merged together to increase the density of the 
diagram. 
2.4 Design of Systolic Algorithnis 
Having introduced the model for the data flow representation, we 
can now start designing some systolic algorithms for matrix and vector product. 
In designing these algorithms, these following parameters : velocity of the data 
flow, spatial distributions of data, and periods of computation are considered 
[12]. The resulting algorithms deal mostly with rearranging the order of the 
elements of the input matrix and vector into appropriate data sequences, so that 
the product vector can be computed in the least number of clock cycles. Follow-
ing, some systolic algorithms are provided for solving different cases of matrix 
and vector multiplication problems. These algorithms about to be presented., 
are not described in great detail, they are shown merely for comparison pur-
.. 
. 
. 
poses with the proposed algorithms explained. in the next chapter. For better 
. . 
" 
understanding of the procedures followed in the derivation of these ·algorithms, 
one can refer to Chapter 3 . 
14 
..... 
\ 
0. 
" 
,,. 
-
. 
. . l 
....... 
2.4.1 Matrix and Vector Product Algorithms 
This section shows an algorithm. for calculating the matrix and vec-
tor product by specifying which matrix and vector elements should be entered at 
which clock ·cycles. Details are provided below , 
Systolic Multiplication of matrix M=[X(i+j.N)], and vector [Yj] 
with O < i < N, 0 < j <. N. 
1~-··There are N processing elements connected in the linear systolic 
~ 
array 
2. Each processing element in the array, accumulates products of 
data elements from the Left and the Right array inputs. 
3. Following input sequence are used; 
k I ~ 
a. At clocks 2s + 2NL ; J + (k + 1) mod 2, Left Input ~ 
XkN-I-s with O < s < N, 0 < k < N. 
b. At clocks t + N - 1 + 2 kN, Right Input ~ Y2 k+t with 
0 < k < l~l O < t < 2. 
4. At the end of N 2 cycles, the i-th processing element of the array 
will hold the i-th component of the product vector, for even N. For 
odd N, it takes N 2 + N - l cycles. 
2.4.2 Toeplitz Matrix and Vector Product Algorithms 
] Algorithm B]: Systolic Multiplication of Toeplitz matrix M=DCcN-l-i+j)], and 
vector [Yj] withO < i < N, .0 <j < N · 
_ 1. There are N processing elements _connected in the linear systolic 
... array. . 
2. Each processing element in the array,, a.ccumulates products of 
.. data .elements from .. t~e Left fl:nd the Right ~rray input, provided 
that those two inputs are of different types. 
3. Following input sequences are used ; 
,, .. 
.. • r 
'\ . 
15 
• 9 
,.. : ''.,~ . . ...... 
l<i.4,;;.; ,, .i' ... 
,,,, 
,., 
l 
, 
" 
\ 
a. At clocks 2s, Left Input ~ XS with O < s < N. 
b. For N even, at clocks 2 s + N - l, or for N odd, at clocks 
2s +N + 1, Left Input f- XN+s with O < s < N-:- 1. 
c. At clocks 2t + N - l, Right Input f- Y1 with O < t < N. 
d. For N even, at clocks 2 t, or for N odd, at clocks 2 t + I, Right 
Input f- Yt+l with O:::; t < N - l. 
4. At the end of (3N - 2) cycles, the i-th processing element··of the ar-
ray will hold the i-th component of the product vector. 
2.4.3 Cyclic Convolution Matrix and Vector Product Algorithms 
I Algorithm C I: Non-Commutative Systolic Multiplicl;;ltion of Cyclic matrix 
M=[X(N-l-i+j) mod NJ, and vector [Yj] withO ~ i < N, 0 :::; j < N 
1. There are N processing elements connected in the linear systolic 
array. 
2. Each processing element in the array, accumulates products of 
data elements from the Left and the Right array input, provided 
that those two inputs are of different types. 
3. Following input sequences are used ; 
a. At clocks 2s, Left Input f- XsmodN with O < s < N. 
b. For N even, at clocks 2 s + N - l, or for N odd, at clocks 
2s +N + l, Left Input f- XcN+s)modN with O < s < N- l. 
c.Atclocks 2t+N-l, Right Input f- Y1 with O<t<N. 
d. For N even, at clocks 2 t, or for N odd, at clocks 2 t + l, Right 
Input f- Yt+l with O < t < N- 1. 
4. At the end of (3N - 2) cycles, the i-th processing element of the ar-
ray will hold the i-th component of the product vector. 
2.4.4 m-diagonal Matrix and Vector Product Algorithms 
16 .. 
. ~t ... 
"" I 
\ . 
.... " 
,. 
' 
r • I ) 
" . 
• 
I Algorithm Dl I: Systolic Multiplication of m-diagonal matrix 
,. 
M=[ ~m
2 
1 
-i+J)] and vector. [Yj] withO ~ i < N, 0 ~ j < N, and 
N~m < 2N. 
1. There are N processing elements connected in the linear systolic 
array. 
2. Each processing element in the array, accumulates products of 
data elements from the Left and ,the Right array input, provided 
that those two inputs are of different types . 
. · 3. Following input sequences are used; 
a. At clocks 2s, Left Input f- Xs with O ~ s < m;\ 
/ b. At clocks 2m - 2 + 2s, Left Input ~ X m+1 s+-2 
n1-l 
O<s< 2 . 
c. For N even, at clocks 
Right Input ~ Yt+l 
2 t, and for N odd, at clocks 
with O<t<N-l. 
with 
2t+ 1 
d. For N even, at clocks 2 t + l, or for N odd, at clocks 2 t + 2, 
Right Input ~ Y 1 with O < t < N. 
4. At the end of (N + m - l) cycles, the i-th processing element of the 
array will hold the i-th component of the product vector. 
I Algorithm D2 I : Systolic Multiplication of m-diagonal matrix 
•.\ 
M=[X,m-1 . ] and vector 
... "12-l+J) [Y.] withO<i<N,O<j<N,and J 
0 ~ m < N. 
1. There are N processing elements connected in the linear systolic 
array. 
2. Each processing element in the array, accumulates products of 
data elements from the Left and the Right array input, provided· 
that those two inputs are of different types. 
3. Following input sequences are used ; 
a. At clocks_ 
b. At clocks 
2t, Right Input f- Y 1 with O ~ t < l~l. 
•··-·· 
~ ' . 
2121 + 1 + 2t, Right Input f- Y, + ~l with 
17 
. ' 
j • I l ~e 
.. 
' 
.,. 
.,. 
V 
'· 
. ..,. ··~· 
e 
. •'\ 
'· 
. '. 
N . o :S:: t < L). 
• 
... 
I 
c. For N even, at clocks 2s, and fpr Nodd, at clocks 2s+ 1 
Left Input ~ Xs · with O ~ s -< m 
0 
d. For N even, at clocks 2s + 1, or for N odd, at clocks 2s + 2, 
. Left Input ~ Xs· with O ~ t < m. 
4. At the end of (N + m) cycles, the i-th processing element of the ar-. 
ray will hold the i-th component of the product vector. 
' 
~-
... 
, 
18 
,.. 
.  
• 
• 
r 
,. 
Chapter3 
Design of Com~utative Algorithms 
d, 
.. 
I • 
3.1 On Conunutativity of the, Systolic Algoritluns 
Chapter 2 has presented a variety of well explored problems for 
solving the matrix and vector multiplicati.on operation by sliding the inner 
products. However, with regard to the efficiency of the algorithm, these results 
are still far from being the optim~m solution. By observing the output pattern 
of t~e processing elements from the data flow graph, it can be easily seen that in 
-
most cases, generally only one or two products are formed at every clock cycle. 
This results in a low utilization factor of the array. Since a given linear trans-
formation problem involves a fixed number of multiply-accumulate operations 
(irrespective of the algorithm used), a low number of operations per clock cycle 
implies a greater delay for the algorithm. 
Improving the algorithm optimality (i.e minimizing the delay) is the 
main subject of this thesis. We do this by exploiting the commutativity property 
.. 
of the elementary operation taking place in each processing element. In the al-
gorithms available in the literature, the data elements and the problem con-
.,.. . 
stants (comprising of the matrix elements) enter the systolic array from two dif-
ferent directions. However, since both multip!ication and accumulation are 
commutative operations, it will make no difference if the two elements are 
t 
• 
switched in place, .thereby allowing X-matrix and Y-vector elements to he en-
tered fro1n both sides of the array rather. than just from one predetermined 
direction. This implies that any input data sequence to the array can contain a 
combination ofX-Matrix and Y-Vector elements. 
19 
' I 
. < 
' ,· . 
-
) . 
- - - ~. 
* ' 
. In order to avoid forming a product of ,two· elements of the same type 
i.e. X-elements with X's and Y-elements .. with Y's at a processor, appending an _ 
extra bit to a data element entering the array is proposed. This extra bit -will 
function as a tag bit to identify .the data type or the data otigin, and since the 
I 
array only takes two different type of data elements (X and Y), a single tag bit 
will serve the purpose. By applying this simple method, the processing element 
wjJl only have to go through a slight modification, which -will most likely involve 
using a XOR gate to check the tag bits. Only if the tag bits corresponding to the 
two operands available to a processing element are different, a product may be 
formed and accumulated, else the operands may be ignored. Additions of the 
tag bit will also imply an additional data bit communication between the 
modules. But since in systolic· arrays, communication is regular and nearest-
neighbor, this is not expected to cause problems. 
3.2 Matrix and Vector Product Algorithms 
This section describes the mathematical representation of all the 
necessary data sequence to be entered into the array to achieve a matrix-vector 
~ 
product. In detail, input data sequences to the array shall be presented to 
describe which matrix elements should be entered from which direction at which 
specific clock cycle. For ease of reference, throughout this section we shall use 
... 
-the_, following ordering of the given matrix M. The i-th product element of the 
vector M.Y that is to be calculated is given by 
·1. 
N-1 
I, M(iJ).Yj = 
j=-0 
,· 
N-1 
r xi+j.N·yj 
j=-0 
All the matrix elements are considered independent. The standard (commuta-
20 
' ' 
! 
.. 
D 
/ 
' .-
··' 
• r f, 
. .,. .. ____ ~ 
-. 
/ 
' 
\ 
.-· 
u ) ' 
. ~ 
tive) systolic algorithm .to solve this problem is known to require N 2 clock cycles. 
By using the noncommutative algorithms below, one can see that this delay-can , 
., 
be reduced significantly. 
I Algorithm la j : Cammutative Systolic Multiplication of matrix M=[~i+j.N)], 
and vector [Yj] with 0-~ i < N, 0 < j < N and N even. 
1. There are N processing- elements connected in the linear systolic 
array 
, 
2. Each processing element in the array, accumulates products of 
data elemen.ts from the Left and the Right array input, provided 
that those two inputs are of different types. 
3. Following input sequence are used.; 
a.- At clocks s (N + 1) + (N - I), Left Input f- Ys with 
O< N 
- s < 2· 
b. At clocks 2s + r (N + 1), 
N. 
0 < s < 2 and O < s < N. 
. 
Left Input f- XN2-l-Nr-s 
·~-
c. At clocks t(N + 1) + (N - 1), Right Input f- Y N-l-t 
O< N 
- t < 2· 
with 
with 
d. At clocks 2t + r(N + 1), Right Input f- XNr+t with 
0 ~ r < ~ and O < t < N. 
4. At the en_d of (N2 + 3N - 4 )/2 cycles, the i-th processing element .. 
of the array will hold the i-th component of the product vector. · 
.. 
A typical execution of this algorithm is shown in FIG. 3-1. This 
figure displa}.TS the complete data flow in the array for every clock cycle for a 4 X 
4 Matrix and a 4 X 1 Vector multiplication. A mathematical proof for the algo-
rithm above is presented below. This proof shows that by following the above 
algorithm, the correct X element and Y element product will be formed at the 
0 
right location and at the right time. 
21 · 
' 
- .. i..,,: - ·- ·- -
-- " ,l 
• 
, 
-.. 
/• 
.. 
, .. ~·· .. 
Clock 
0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
ll 
). : 
X15 
X14 
YO 
----- ... Q 
D 
. 
~-----Q 
D 
' 
' 
' 
' 
' Q 
D 
' , 
' , 
A 
,' ', 
' ' , 
A 
, 
' , 
D 
,[] 
, 
' q 
, 
, 
, 
, 
' , 
A 
, ' 
, 
, ' 
.. 
J]------XO 
D • 
J]------X1 
~ Q';,:'D . jfil-Y3 
"' , ' ~ . X13 ----- ... q D------X2 
' , ' , 
' , ' , 
A A 
, ' , ' 
; ' , ' 
X11 ___ ---Qx===== , ,x:=~: __ JJ-- ____ X4 
' ' " , A 
' , ' / 
' ,,, ' ,,, X12 ----- ... m_ 0 Q jill------X3 
' , ' , ' " ' , ' / ' , 
A A A 
, ' , ' , ' 
; ' ; ' , ' X10 ------Q U ,Q ,D------X5 
Y1 
X9 
' , ' , ' , 
' " ' , ' , A A A 
/ ' , ' / ' 
; ' ; ' ; ' 
XO, ,ox-==::-Y2 ' , ' ... 
, ' ' 
,,, ' ' 
------Q ,D------X6 
' , 
' , 
A 
' , 
' , 
... 
/ ' / ' 
; ' ; ' D/;:<~D 
XS ------m ,D Q ,m------X7 
' " ' , 
A 
" ' ; ' 
' , 
' , 
... 
" ' ; ' D q D D 
' 
, 
' " ... 
, ' ,,, 
' D D Q 
, 
, 
, 
" ; 
D D D 
XO X4 XS X12 
Xl XS X9 X13 
X2 X6 XlO X14 
X3 X7 Xll Xl5 
X 
D 
' 
' 
' 
' 
' D 
YO 
Yl 
Y2 
Y3 
' 'I 
y 
Figure 3-1:· Matrix and Vectci'r multiplication for N ~.4 
22 
• 
•••• .i, 
I 
. ... ··-·· 
• 
.., 
• 
,. 
·. 
• 
1 
,. 
.J ~., 
' .... 
It is not difficult to predict which X component will meet which Y 
, 
· component, arid when and where such a meeting will occur. For example, if the 
, • array has N number of processors, and we have ~ matrix element enters the 
left side of the array at time s, then ~ will be in processor l at time s+l. 
Similarly, if Yj vector element enters the right sid~ of the array at time t, then 
-
Yj will be in processor lat time t+(N-1)-l. Therefore, in order for a product of~ 
,, 
and Y1 to be formed in processor l, the necessary condition is 
s + l = t + ( N - I ) - l or, 
s-t = (N-1)-21. (2.1) 
Further, sinceO::; l < N, one also finds that a meeting of X and Y elements 
released at time s and t respectively is possible only if 
- (N - 1) < s - t < N. (2.2) 
The expression (2.1) and (2.2) shall be referenced throughout this chapter for 
the sake of consistency and simplicity. 
Algorithm-la can now be proved by showing that the l-th processor 
indeed accumulates a result 
N-I 
L X<z+JN). Y1. 
j=O 
Following two cases ofX and Y meetings may be considered. 
Case 1. : (X elements released from right) · 
(2.3) 
Consider Y8 input from left at clocks s(N+l)+(N-1) (step 3a) and XNr+t from right 
at clocks 2t+r(N+l) .(step 3d) withO < s,r < N/2, 0::; t < N. In this case, eqn. (2.2) 
gives the following the necessary condition fo:r a meeting of 'X and Y elements : 
'I 
23 .. 
•· . ,t ·, 
.. 
• 
'. 
... 
,.,. 
., 
or, 
- ( N - l ) ~ [ s (N +. 1) + (N - 1) ] - [ 2 t + r (N + 1) ] ~ ( N - l ) , 
.2t - 2(N - 1) ~ (s - r)(N + 1) ~ 2t + 2. 
Also from (2.1), one gets 
or, 
[s(N + 1) +·(N -1) - [2t+ r(N-+ 1)] = (N-1 )- 2/, 
(s - r) (N + 1) = 2 (t - /). 
(2.4) 
(2.5) 
Equation (2.5) gives (s - r) mod 2 = 0 because N is even. One can now prove that 
this implies that s = r. Because if this is not true, thep. the ismallest positive 
I 
value of s - r is 2 and the right hand side of (2.4) would become 
2(N + 1) ~ 2t + 2. This is impossible, because t < N,or2t + 2 < 2N. Similarly, if s-l 
is negative, its largest value is -2, and with this applied on the left hand side of 
the inequality (2.4), would give2t-2(N-1)<-2(N+l), or 2t<--4., which is 
again impossible, because O < t. Thus s - r must be· 0, and consequently from 
(2.5), t = l. Now, we fmd that Y8 meets XNr+t only if r = s and t = l. So, one con-
cludes that Y 8 indeed meets Xz+N.s in processor Z for O < s < ~ as demanded by 
eqn. (2.3) 
Case 2. : (X elements released from left) 
Consider XN2-rN-s and Y N-l-t entered in the array at times 2s+r(N + 1) and 
t(N+l)+(N-l) withO<r,t<~, O<s<N from left and right respectively. This 
represents inputs 3b and 3c specified flgorithm-la. In this case, using eqn. (2.2) 
• gives 
-(N-1)<[2s+r(N+l)]-[t(N+l)+(N~l)]<(N-1), ". 
or, 
- 2 s < (r - t) (N + 1) < 2 (N - 1) - 2 s. (2.6) 
.. Also from (2.1), one gets 
.. 
24 q' 
.. 
·-
• 
.. 
\ 
.. 
... 
• 
or, 
~ 
[ 2 s + r (N + l)] - [ t (N + 1) + (N - 1)] = ( N - 1 ) - 21, 
,(r- t)(N + 1) = 2(N-1)-2(/ :+ s). 
. I 
,. 
(2.7) 
As-?n the earlier case, since N is even, eqn. (2. 7) implies that-(r-t) mod 2 = 0. One 
can now also show that r-t = 0. Because if (r - t) > 0, then 2 ~ (r - t) and the right 
hand. side of inequality (2.6) gives 2(N + 1) ~ 2(N ~ 1) - 2s,or 2s ~ -4, which is 
impossible since O ~ s. Similarly, if (r - t) < 0, then (r - t) < -2 and the left hand I 
side of (2.6) gives -2ss;-2(N+l), or N.+1 <s. This is again impossible, 
because s ~ N - 1. Thus, one concludes that r - t = 0 and from (2.7), 
l = N - l - s. Consequently, if N - 1 - t is denoted by p, then 
l +pN=l +N2 -N-Nr =N2 -1-Nr- s. Thus, meets 
Xz+pN = XN2-l-Nr-s in processor / as required by eqn. (2.3). Further, from the 
range of t, 0 < t < ~ one can find that only those Y/ s take part in these 
meetings, for whom ~ < p < N . 
One can now see that both the cases above do generate products 
satisfying (2.3). Also, Case 1. computes these products forO <j < ~ and Case 2., 
for : ~ j < N, where j is the index of Y. Thus, we see that these two cases cover 
the complete range of j in eqn. (2.3) without any overlap. One can therefore con-
clude that Algorithm-la does indeed give the desired matrix and vector product. 
·" 
Whereas Algorithm-la useful for even N, the following algorithm is 
provided for odd N. 
" 
,, 
I Algorithm lb I : Commutative Systolic Multiplication of matrix M=~+j.N], 
.: '' 
and vector [Yj] with O < i < N, 0 ~j < N, and N odd. 
.. 25 · 
_,/ 
\ 
\. /' 
.J 
,.. 
.. 
IP 
i 
1. There are N processing elements connected in the linear systolic 
array. 
2. Each processing element in the array, accumulates products of 
data elements .from the Left and the Right array input,' provided 
that those two inputs are of different types. · 
3. Following inyut sequence are used; 
Let x = 2LI N/2l/2J, 
a. At clocks · (2s + .1) + rN + 2 Lr/2J, Left Input f- XN2--I-rN-s 
with O < r < x and O -5: s < N. 
:I 
b. At clocks (N - 1) + sN + 2 I s/2 l, Left Input f- Ys with 
0 < s < (N - x} .. 
c. At clocks 2t + rN + 21 r/2 l, Right Input f- X,.N + t with 
0 ~ r < (N - x) and O < t < N. 
_ d. At clocks N + tN + 2Lt/2J, Right Input f- YN ~ 1 _ 1 with 
0 < t < x. 
4. At the end of (N2 + 4N - 3 )/2 cycles, the i-th processing element 
of the array will hold the i-th component of the product vector. 
A sample execution of this algorithm showing the data flow in the 
array for every clock cycle for a 5 by 5 matrix and vector multiplication is shown 
in FIG. 3-2. 
The validity of the proposed algorithm can be proved with similar 
approach to that taken in the proof of Algorithm-la. First, the two cases of X 
-and Y elements meetings are as follows : 
Case 1. : (X-elements released from right) 
Here one hasYN-l-t input from left . at clocks and 
XN2-I-rN-s from right at clocks ~2 s + l) + r N + 2 L;J with 
0 ~ r, t < x, 0 < s < N. This is consistent with schemes 3a and 3d of the algorithm. 
In this case, by applying ~qn. (2.2) gives us the following as the necessary con-
Oji, 
• 
dition for X and Y elements meetings, 
I 
• 
26 
• 
.  
. "" .·,~-- .. 
- J 
- •.(,,__,_~~ .' 
' ~ .. - • 
.. 
• 
------
... 
• 
" 
Clock 
0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
-
,_ 
D D 
X24 - - - -.,--Q D 
X23 
YO 
X22 
X19 
X21 
X18 
X20 
X17 
Y1 
X16 
X15 
Y2 
.. 
_..__ 
-
.-
XO 
Xl 
X2 
X3 
X4 
' 
_.· ..... ·. "· 
' 
' ...... 
D .Q 
D 0, 
XS XlO 
X6 Xll 
X7 X12 
X8 Xl3 
X9 X14 
·-
' 
'"' ,~ 
_-:~: .. .is 
-x 
D D D------- XO 
D D 
;' 
, 
, 
D D 
D 
X15 X20 
X16 X21 
X17 X22 
XlB X23 
X19 X24 
:.,. :•: . 
-
.,.. 
-
' 
-
;' 
.. 
·, 
, 
,· 
........ -, 
D 
D------- X1 
D 
YO 
Yl 
Y2 
Y3 
Y4 
y .. __ ...,_ -~ .. .... -
.... 1.. 
---
-~ 
Figure 3-2: Matrix and Vector multiplication for N = 5 
27 
"' 
,. 
····~ ..... 
... ~ ,_, 
·~~-·-·c-,,..:"-·· ~-_.:, __ __,-.-:. 
-" 
- -" ~ 
-'" -- ~ -...: .... , 
• 
. :-, 
or, 
· r ·,. · t ~ (N - 1 ) S: [ (2 s + 1) + r N + 2 L -J ] ~ [ N ( t + l ) + 2 L-J ] S: N - 1, 
• . 2 2 
-2s~N(r-t)+2(lrJ-L!J)~2(N-1-.s). 2 2 
Also from eqn. (2.1) one obtains , 
[ (2 s + 1) + r N + 2 L:. J] - [ N ( t + l) + 2 L .! J] = (N - l) - 2 l, 2 2 
or, 
r t . 0 
N(r - t) + 2d_)-l)) = 2(N - I - l - s). 
(2.8) 
(2.9) 
Equation (2.9) gives (r-t) mod 2 = 0, be<::ause N is odd. One can now prove that 
r = t. Because, if this is not true, then the smallest positive value of r - tis 2, and 
this makes the right hand side of eqn. (2.8) become 2N + 2 < 2 (N ,- 1 - s), or 
s < - 2. This is impossible, since it is known that s ~ 0. Similarly, if ---
(r - t) < 0, then · r - t must equal to - 2, and with the left hand side_ of eqn. 
(2.8), this becomes - 2s < - 2N - 2, or s > N + l. This is also impossible, since 
s < N. Thus, one concludes that r - t = 0, and from (2.9), we have 
I= N - 1 - s. Consequently, if we let p = N - I - t, then we have 
I + p N = I + N2 - N - N r = N2 - 1 - N r - s. Thus, meets· 
Xz+pN = XN2-l-Nr-s in processor / as required by (2.3). Also, by considering the 
range of t, 0 < t < x, one finds that only YP' s take part in these meetings for 
whom N - l - x < p < N. 
Case 2. : (X-elements released from right) 
Let us observe Ys sequence from left at clocks (N - 1) + sN + 21;1 and 
XrN+t fro~ right at clocks 2t + rN + 2,;1. This is consistent with schemes 3b and 
3c of the algorithm. Now, from eqn. (2.2), we have the following condition for 
, · the meetings ofX and Y components : 
• 
(' . . .. ~ .. ,. 
' . . . 
. . .. '. ·. .. . "·/ . . 
. •., 
28 
,. 
'' 
.... 
) 
•• 
• 
' 
'. 
or, 
. . 
-(N-1)~ [(N- l)+sN+21sl]- [2t+rN+21!:l]· ~N-1, 
.. · 2 2 
2t- 2(N - 1) ~ N(s - r) + 2 (Isl -1'1) ~ 2t. 
2 2 
From eqn. (2.1) we also have: 
[ (N - 1 ) + s N + 2 Isl ] - [ 2 t + r N + 2 I 'l ] = (N - 1) - 2 l, 2 2 
or, N (s - ,:) + 2 (f ~l - r;l) = 2 (t - /). 
.• 
(2.10) 
(2.11) 
Since N is odd, eqn. (2.11) implies that (s-r) mod 2 = 0. One can prove that this 
also implies s=r, because if this is not so, then the smallest positive value fo:t s-r 
is 2, and thus causing the right hand side of (2.10) become 2N + 2 < 2 t, or 
t > N + 1. This is impossible, because t < N. Similarly, if s - r is negative, then 
the value for s - r is -2. With this assumption~ the left hand side of (2.10) 
becomes 2t-2(N-1)<-2N-2, or t<-2. This is again impossible, since 
t > 0. Thus, one can now be sure that s - r must equal to 0, and consequently 
from (2.11), t = l. Knowing this, we now find that Y 8 meets XNr+t only if r=s and 
t=l. Thus it can be concluded that Y 8 indeed meets Xz+Ns in processor l for 
0 ~ s < N - x as demanded by eqn. (2.3). 
Both cases of X and Y elements meeting to form a product in eqn. 
(2.3) have been succesfully proved. The twb cases also cover the necessary range 
specified in eqn. (2.3). In Case 1., products withN - x <j < N are computed, 
while similarly in Case 2. products with O < j < N.- x are computed. Thus, one 
' 
can conclude that Algorithm-lb· is valid to generate products of matrix and vec-
. tor multiplication of size odd. 
·' 
29 
! " 
• 
' 
,. 
,. 
•. 
• I 
·• 
. 
" 
· 3.3 Toeplitz Matrix and Vector Algorithnis 
. ' 
Multiplication of a Toeplitz matrix and vector is a fairly common 
_operation in digital signal processing. A typical scenario where such· a problem 
arises is the use of Wiener optimum linear filtering to separate signal from a 
noisy background. The Wiener filter is a Toeplitz matrix and it refines the sig-
nal vector through its product. It is not inconceivable to have a dedicated sys-
tolic array for such an application. Commutative systolic algorithm for Toeplitz 
matrix and vector multiplication are discussed in this section. These algorithms 
take advantage of the commutativity case presented in the preceeding section. 
The procedure and the format of the algorithms for generating the input data 
sequence of the array is similar to that of the full matrix product. The elements 
of the Toeplitz matrix M are labeled as follows : 
M (iJ) = XN-l-i+J O < ij < N.· 
Please note that we do not assume that M is symmetric. Thus there are 2N-1 
distinct X elements present in matrix M. 
even. 
Following algorithm deals with a N X N Toeplitz matrix where N is 
Commutative Systolic Multiplication of Toeplitz matrix M = 
D\N-l-i+j)], and vector [Y1] wi~hO < i < N, 0 <j <,.N and N 
even. 
1. There are N processing elements connected in the linear systolic 
·array. 
2. Each processing element in the array, accumulates products of 
data elements from the Left and the Right array input, provided 
that those two inputs are of different types. 
3. Following input sequences are used; 
Let m = N/2, 
30 
·' 
·' 
• 
., 
···~ 
a. At clocks 2s + 1, Left Input f- ~ + s with O ~ s < N. 
b. At clocks 2~, Left Input f- Xs with O ~.,r < m. 
c. At clocks 2s + N, Left·Input f- Y N-I-s with· 0 ~ s <·m - 1. 
d. At clocks 2t, Right Input f- Y 1 with O ~ t < N. 
e. At clocks 2t + 1, Right Input. f- Y. with J.~(N-1)-t 
O<t<m-I. 
f. At clocks 2 t + N - l, Right Input f- Y1 with O ~ t < m. · 
4. At the end of (2.5N - 1) cycles, the i-th processing element of the 
array will hold the i-th component of the product vector. 
' ' 
A typical data flow in the array for every clock cycle for a 4 X 4 
"Toeplitz matrix and a 4 X 1 vector multiplication is shown in FIG. 3-3. The proof 
of these algorithms is provided on the next paragraph. This proof will try to 
show that if the data input sequence as described in step 3. of the algorithm is 
followed, then each necessary X and Y product is formed at the right processor 
at the right time. 
" 
One proves the validity of Algorithm-2a by establishing the neces-
sary ·condition for a Toeplitz matrix and Vector product, i.e by showing that any 
processor l of the array should contain elements 
N-1 
L (XN-l-l+t·Yt). 
t=O 
(3.1) 
Since in this case N is even, so the only X-matrix and Y-vector ele- . 
ments that need to be considered are the ones released at odd and even clock 
cycles respectively. This is so, because when the' number of processing elements 
in the array is even, any element entering the array from one side at an odd 
clock cycle will never meet with another element entering the array from the 
other side at also an· odd clock cycle. This same statement is true for the even 
31 
• 
. . "!"" 
• 
• 
' ,,., 
I 
' . 
Clock 
0. 
1 
2 
3 
4 
5 
6, 
7 
8 
.. 
' ' 
XO ------0,',,, D .D /--YO 
X2 ------Q Qx D------X6 
' ' , 
' ' , ' , 
' ' , . 
' ' > X1 .------Q :;,;: DX Y1 
' ' , ' 
' ' , . ' A 
' " ' ' 
' > ' ' X3 ------mt 0x·===: x·=·· --YO 
' , ' ' 
' , ' ' A 
/ ' ' ' 
> ' ' ' Y3 11: xx--Y2 
X4 ------Ill 
D 
' 
' 
' ', 
:.::. x·=·· -Y1 
' . 
' 
' 
--Y3 
D D Q D 
D D 
X3 X4 XS X6 
X2 X3 X4 XS 
Xl X2 X3 X4 
XO Xl X2 X3 
X 
' 
' 
' 
' 
' D D 
YO 
Yl 
Y2 
Y3 
'y .~ 
Figure 3-~: Toeplitz Matrix and Vector multiplication 
for N = 4 
clock cycles as well. Thus the difference in terms of clock cycles of the entry 
time of any two elements entering the array from different directions l1as to be 
an odd number, in order that a p:r:oduct of the tw~ elements ~e formed. 
32 I 
' 
. ' 
0 
• 
l 
....... .._,,.,. 
'·· 
• 
'' 
' . 
--we now consider the following three cases of X and Y meetings. 
. 
' 
Case 1. (X elements released from left at odd clocks) 
Consider the following set of inputs :At clock 2s + _!,Left Input f-X,n+s with 
0 ~ s < N, and at clock 2 t, Right Input f- Yr with O < t < N. This is consistent 
with the da~a order 3a and 3d specified in Algorithm-2a. · Using (2.1), one ob-
tains 
(2s+l)-2t 
m+s 
-
-
-
-
N-1-2/, 
N - I - l + t. · 
This implies that Xm+s = XN-l-l+t meets Yt in processor l as required 
by eqn. (3.1). However, using the limits on s, one can see that only data points 
that· correspond to max { l-m-l ,O} < t < min { m+l,N-l} participate in these 
products. Also note that Yt meets oiµy (if possible) XN-l-l+t and ·nothing else in 
processor l. 
Case 2 .. (X elements released from left at even clocks) 
Consider the following set of inputs :At clocks2s,Left Input f- Xs with 
0 < s < m, and at clock 2 t + N - l, Right Input f- Yt with O < t < m. This is con-
sistent with data orders 3b and 3f stated in Algorithm--2a. Again use of eqn. 
(2.1) gives 
2s-(2t+N-l) 
s -
-
N.:...1-2/, 
N - I - l + t. 
Thus once more one ~ees that XN-l-l+t meets Yt in processor_l. Further~ using the 
range of s gives. the following values of t that are involved in the products : 
0 ~ t ~ .l - m. · 
Case 3. (Y elements released from left) 
,, 
33 
' ,, 
' 
.. 
·- ··-··"· .... --~--~~---'-------
·r 
.,.. - ------~-~ 
0 
- ·---~~-------
--~-~~---
Consider inputs corresponding to steps_ 3c and 3e of the algorithm : At 
clocks2s+N,Left Inputf-YN-l-s· with Os;s<m-1, and at clock 2t+l;Right 
Input~ Xi(N-l)-t with O ~ t < m - I. Use of eqn. (2.1) here gives 
(2s+N)-(2t+l) 
2(N-l)-t 
-
-
-
-
N- 1 - 2/ 
' N-I-l+(N-1-s). 
Thus, YP meets ~(N-l)-t = XN-l-l+p (where p = N - 1 - s) as required by eqn. (3.1). 
The range of s now provides the range of p (index of YP) as m + l + I < p ~ N - l. 
From the data input sequence specified in the algorithm, one may 
verify that these three cases are the only possibilities of X's meeting Y's to form 
.,,· 
products. Further, the ranges oft in the three cases2 completely cover the range 
0 to N-1 without any overlap. Therefore one can conclude that Algorithm-2a is 
valid. 
Having presented the Algorithm-2a for N even, the following algo-
rithm is presented for N odd. 
I Algorithm 2b j: Commutative Systolic Multiplication of Toeplitz matrix M = 
~N-l-i+j)], and vector [Yj] withO < i < N, 0 < j < N and N 
odd. 
--
1. There_ are N processing elements connected in the linear systolic 
array. 
2. Each processing element in the array, accumulates products of 
data elements from the Left and the Right array-input, .provided 
that those two inputs are of different types. · 
3. Following in.put sequences are used : 
Let m = (N-l)/2, 
• 
• 
2one should use p instead oft in case 3 since p is the coefficient of Y 
34 
l. 
I , 
I ' 
• 
a. At clocks 2s, Left Input f- Xm + s with O ~ s < N. 
b. At clocks 2s + 1, Left Input f- Xs with ·O ~ s < m. . 
C. At clocks 2s+N, Left Input f-YN-1-s with O~s<m. 
d. At clocks 2 t , Right Input f- Yt with O ~ t < N. 
e. At clocks 2t + 1, Right Input f- X2(N-l)-t with O < t < m . .... 
f At clocks 2 t + N, Right Input f- Yt with O < t < m. 
4. At the end of 2.5 N - 1.5 cycles, the· i-th processing element of the 
array will hold the i-th component of the product vector. 
A sample figure for a typical data flow of this algorithm for mul-
tiplication of a 5 X 5 Toeplitz matrix and a 5 X 1 vector is provided at FIG. 3-i 
A proof is also provided in the following paragraph in order to show the validity 
of Algorithm-2b for the Toeplitz matrix product. 
Similar to the proof of Algorithm-2a, we also use the same necessary 
condition for contruction of a Toeplitz matrix and Vector product. But, on the 
contrary to the previous proof, since N is an odd number, the only elements of 
X-matrix and Y-vector that need to be considered are the ones that are both 
simultaneously released at odd or even clock cycles only. This is so, because 
when the number of processing elements is odd, then any element entering the 
array from one side at odd clock cycles will not form a product with another 
element entering the array from the other side at even clock cycles and vice 
... 
versa. Therefore, simply by verifying the products formed by X-matrix and Y-
vector elements both released at odd and even cycles should construct our proof 
adequately. 
Case 1. (X elem~nts released from left at even clocks) 
Let us consider the · following cases : At clocks· 2 s, Left Input ~ ~+s with 
• 
35 
........ ..,.. 
I. 
.... 
,•4'~ I 
t· 
IJ. 
Clock 
0 
1 
2 
3 
4 
5 
6 
8 
9 
10 
• 
'p·. 
X2 -------Q D 
' 
' 
' 
' 
' 
D 0/--YO 
XO ----- ... Q q D D------- XS 
' '· 
' ' 
' ' 
' ' 
;. ' ' 
, 
; 
, 
, 
, 
X3 L------Q q ::::. D/--Y1 
' ' ' , 
' ' ' "' \, A 
' ' , ' 
' ' ; ' X 1 - - - - - ... Ox··=· D .:::: D-- -----X7 
' ' , ' ' , 
I ' '"' ' ,, A A 
' , ' ' , ' 
';' .... t,/';' X4 _____ .,.@. Dx==::: ux·===--Y2 
' , ' ' , ' 
' / ' ' , ' A A 
, ' ' , ' ' 
; ' ' ; ' ' Y4 :"'< ',.,,ox, ::::: X, ~=· --YO 
/ ' ' ' 
; ' ' ' X5 - - - - ---m_ ;r,xx--Y3 
Y3--
X6 - ------a_ 
X4 XS X6 X7 xa 
X3 X4 XS X6 X7 
X2 X3 X4 XS X6 
Xl X2 X3 X4 XS 
" XO Xl X2 X3 X4 
X 
,,.x--Y1 
--Y4 
YO 
Yl 
' 
Y2 
Y3 
Y4 
y 
Figure 3-4: Toeplitz Matrix and Vector multiplication 
for N = 5 
' . 
36-
I<• ' 
. ; . r 
• 
,. . 
,. 
I 
p1 ! I 
, 
·, 
I 
,l 
0 :::; s < N, and at clocks 2 t , Right Input ~ Y1 with· 0 ~ t < N. Thi~ set of inputs 
are derived from the step 3a and 3d of Algorithm-2b. Again using eqn. (2.1) one 
obtains, 
2s- 2t 
m+s 
-
-
-
-
N-1 - 2l 
' N - I -1 + t. 
It can be seen clearly that ~+s = XN-I-l+t meets Yt in processor l as 
required by the Toeplitz product condition in eqn. (3.1). By considering the 
range of s, the range of t is determined to be max { / - m,O l ~ t ::; min { N - l, m + l} 
Case 2. (X elements released from left at odd clocks) 
The following uses the input data sequences from step 3b and 3f : At 
clock2s+l ,Left Inputf-Xs with O<s<m, and at clock2t+N ,Right 
,. Input f- Yt with O ~ t < m. Using the eqn. (2.1), we have: 
(2s+l)-(2t+N) 
s -
-
N-I-2! 
' N - l -1 + t. 
In here, it is also shown that X 8 = XN-l-l+t meets Yt in processor l as 
,, 
required by Toeplitz product condition in eqn. (3.1). Based upon the range of s, 
valid range oft here is determined to be O < t < I - m - l. 
' ' _·· ?' ' " li, . 
Case 3. (X elements released from right at odd clocks) 
Consider the following sets of input sequence :At clocks 2s + N,Left 
Input f- Y N-l-s with O < s < m, and at clocks 2 t + 1, Right Input f-X2(N-l)-t with 
0 < t < m. Both sequences are consistent with the step 3c and 3e of the algorithm. 
Now, by applying eqn. (2.1), we get: 
'· 
{2s+N)-(2t+l) 
2(N-I)-t -
-
., 
N - 1 - 2/, 0 • 
N ~ I - l + (N - 1 - s): 
'. 
37 
I 
• 
, 
J) 
• 
·< 
' ' 
· Here, YP meets ~(N-1.)-t = XN-l-l+p ,therefore satisfies the 'foeplitz 
I 
I 
condition in eqn. (3.1). The range of s has provided the necessary range for p 
(where p = N-1-s), which is m + l + l ~p 5::. N - I. 
Since all· possible cases where X's can meet the Y's have been 
covered and succesfully proved, as well as specifying the ranges for t without 
any overlap, it is concluded that Algorithm 2b is valid.3 
3.4 Cyclic Matrix and Vector Product Algorithms 
In this section, the algorithm for constructing a Cyclic matrix 
product will not be presented in great detail, because of the many similarities 
that occur between the construction process of this type of matrix and the con-
struction of the Toeplitz matrix. The steps taken for producing the left and right 
data sequence for this type of matrix are almost identical in every respect with 
those in the Toeplitz case. The only minor difference is the addition of a MOD N 
operation on the index of the X-matrix elements. This is done, because the 
range of the X-matrix elements for a N size Cyclic matrix is bounded from O to 
(N-1). For description of the algorithms, please see below: 
I Algorithm 3a I : Com~utative Systolic Multiplication of Cyclic matrix 
M-IY,N 1 .. ) · d NJ and vector [Y.J with -L"'"-"-(_. - -z+J mo ' J 
0 < i < N, 0 ~ j < N and N even. 
1. There are N processing elements c9nnected in the linear systolic 
array. 
' 
3In Case 3., one should use p instead oft 
.. " 
38 
-. 
, 
, .. 
I 
,. 
.... 
. - '" ' 
·' 
I 
I 
· 2. Each processing elem·ent in the array, accumulates products of 
data elements from the Left and the Right array input, provided 
that those two''inputs·are of different types . 
3. Following input sequences are used : 
a. At clocks 2s + 1, Left Input f- ~m + s) mod N with 
0 < s < N. 
b. At clocks 2s, Left Input f- Xs mod N with O ~ r < m. 
c. At clocks 2 s + N, Left Input f- Y N-I-s with O < s < m - 1. 
d. At clocks 2t, Right Input f- Y 1 with O < t < N. 
e. At clocks 2t + 1, Right Input f- ~Z(N-l)-t) mod N with 
O<t<m-1. 
f At clocks 2 t + N - 1, Right Input f- Y1 with O :::; t < m . 
4. At the end of 2.5N-1 cycles, the i-th processing element of the array 
will hold the i-th component of the product vector. 
I Algorithm 3b I : Commutative Systolic Multiplication of Cyclic matrix 
I 
,) .J .... 
M=[XcN-l-i+J)], and vector [Yi] withO < i < N, 0 < j < N and N 
odd. 
1. There are N processing elements connected in the linear systolic 
array. 
2. Each processing element in the array, accumulates products of 
data elements from the Left and the Right array input, provided 
that those two inputs are of different types. 
3. Following data sequences are used : 
-:::.> 
a. At clocks 2 s, Left Input f- ~nz + s) mod N with O < s < N . 
.,f ' 
b. At cloc~s 2s + 1, ·Left Input f- Xs mod N with O < s < m. 
c. At clocks 2s + N, Left Input f- YN- I -s with O < s < m. 
d. At clocks 2_t, Right Input f-- Yt with O < t < N. 
e. At clocks 2t + 1, Right Input f- ~2(N-l)-t) mod N 
0 < t < n1. 
f At clocks 2 t + N, Right Input f-- Y1 with O < t < m. 
\ 
with 
. 4. At the ·end of 2.5N-l.5-cycles, the i-th processing elen1ent of the-af~ 
ray will hold the i-th component of the product vector . 
39 
. - . --· J .. 
- ,, 
\ 
\ 
· To illustrate the data flow for typical executions of the proposed al-
gorithm for 5 X 5 and 4 X 4 Cyclic matrices, some sample figures are provided in 
FIG. 3-5 and FIG. 3-6. Proofs for both Algorithm-3a and Algorithm-3b, as men-
tioned earlier can also be referenced to those of Algorithm-2a and Algorithm-2b 
repectively. 
3.5 m-diagonal Matrix and Vector Product Algorithms 
A m-diagonal matrix is a matrix whose elements only lie along the 
m.iddle m diagonal lines, while leaving the other location of the matrix un-
occupied. In this discussion, we assume that the value of m is always odd in 
order to evenly cover the middle diagonal areas of the matrix. In addition to 
that, it also assumed that all the elements that lie along the same diagonal are 
identical. • ·1' 
Below, the input data sequence of the array will be evaluated in 
terms of which matrix and vector elements should be entered and at what clock 
· cycle. The procedure followed for the algorithms presentation is similar to that 
of the previous section. The ordering of the m-diagonal matrix elements used 
. -
throughollt this section is as follows, 
' . 
M(iJ)=Xm-l -i+j ·o :s-; iJ < N 
2 
Commutative Systolic Multiplication of m-diagonal matrix 
[ M ~ Xcm
2 
1 -i+J)] and vector [Yj] withO < i < N, 0 < j < N, and 
N < m < 2N and N even. 
,' :.. .... , 
• 
. 
40 
"' ' ... (• 
L .. 
I, 
, .... , 
.. 
Clock 
0 
l 
2 
3 
4 ' 
5 
6 
7 
8 
9 
10 
·- -··-. ,-/" 
X2 ~----·Q,,, D O -0 · }J--YO 
', / 
XO -----·Q,,, 0,,, 0/ /,0•-----X~ 
' ' " ' ' ; 
X3 ~-~--~Q Qx·· D/ Y1 
' ' ' " ' ' ' / . 
' A 
' ' " ' 
' ' ' ; ' X1 -----~Q U D-------X2 
' " 
' " A 
' " 
' " A 
' / ' / ' 
.• X4 --.---.~[l - 'o,.x' ::::: 'o'x' ·:::::--· v2 
' " ' ' " ' ' " ' " .. ,. ' " ' . A A 
/ ' ' / ' ' 
; ' ' ; ' ' Y4-- :=:=: Q ?::::: x===·· -YO 
' " ' ' 
' " ' ' A 
" ' / ' ' 
... ' ' ' XO -----~Q 
--Y3 
Y3--
...... 
D D D 
X4 XO Xl X2 X3 
X3 X4 XO Xl X2 
X2 X3 X4 XO Xl 
Xl X2 X3 X4 XO 
XO Xl X2 X3 X4 
X 
' 
' 
' 
' 
' D D 
YO 
Yl 
Y2 
Y3 
Y4 
y 
--Y1 
--Y4 
Figure 3-5: Cyclic Matrix and Vector multiplication 
for N = 5 
.,,,·• 
41 
• 
,. \ 
• 
• 
-. 
V 
0 
~;-· 
.,_, -,,. .. 
• I 
Clock 
0 
1 
2 
3 
4 
5 
6 
7 
8 
XO --- ----q D 
XO 
X1 
' 
-------a 
' 
' 
' 
' 
' 
.•:-· 0/ 
--~--~-x 
'
' 
', 
0 /--,,,0 
D D Q 
D D D 
X3 XO Xl X2 
X2 X3 XO Xl 
Xl X2 X3 XO 
XO Xl X2 X3 
X 
--
' 
' 
' 
' 
' 
' 
' 
' 
' 
' 
--YO· 
--YO 
---=--Y2 
D 
D 
D 
YO 
Yl 
Y2 
Y3 
y 
Y1 
Y3 
" 
Figure 3-6: Cyclic Matrix and Vector multiplication 
for N = 4 
.· ·,,. ~-.. :;~"--. 
1. There are N processing elements connected in the linear systolic 
array. 
2. Each processing element in the array, accumulates products of 
data elements fro~ the_ Left and the Right array input, provided 
that those two inputs are of different types. 
3. Following data sequences are used: 
- '.,.._ ; . 
42 
.. ·T 
..... 
) 
'~;''"'''''""""'·',"''' 
. ' 
• • l • 
'. 
. '· 
.,. ... , 
" 
.. . 
• 
. . 
,. 
. . 
a. At clocks 2 s + (m - N), Left Input f- Y(N-l)-s 
0 ~ s < N/2. 
b. At clocks 2s, Left Input f- ~ with O f s < L m; J + N. 
with 
c.At clocks 2t+(m-N) , Right Input f- Yt with 
O<t<N/2. 
·" 
Wl"th Lm-NJ d. At clocks 2t~ Right.Input f- ~m-l)-t O ~ t < 2 + N 
( 
• 
4. At the end of m+N-2 cycles, the i-th processing element of the ar-
ray will hold the i-th component of the product vector. 
· A typical data flow in the array for every clock cycle for a 4 X 4 5-
diagonal matrix and a 4 X 1 vector multipllcatiqn is shown in FIG. 3-7. The 
mathematical proof for this algorithm is provided below. Note that all the steps 
taken in constructing the proof are similar to those of Algorithm-2x's. First, the 
necessary condition for constructing a N X N size m-diagonal matrix product is 
that a processor l (row l) of the array must contain 
N-l L (Xm-l_z+t·Yt) 
t=O 2 
(4.1) 
Before we begin, we should know that the value of N used in this algorithm is 
even, which means that the only X-matrix and Y-vector elements that should be 
considered are the ones released at odd and even clock cycles respectively, and 
vice versa. Reasons of why this argument is true are described in the proofs of 
m-l , m-l Algorithm-2a. Please also note that when( 2 -l+t) ~ m or( 2 -l+t) < 0 then 
no product will be formed at the processing element. Now, let us evaluate the 
. -
following cases : . 
Case 1. (X elements released from left at even clocks)At clocks2s,Left 
,. 
43 
• 
• 
' 
., 
Clock 
0 
1 
2 
3 
4 
5 
6 
XO -----~q D D J]------X4 
' 
;' 
, 
, 
, 
Y3 --~ Q;:<J:J /--YO 
X1 ----- ... Q J]------X3 
' , 
' , 
"' , ' 
; ' Y2-- --Y1 
, ' 
I' ' X2 --- -- .... m_ ,m- ----. X2 
' , ' , 
' , ' , 
"' "' , ' , ' 
I' ' I' ' 0/;:<~D 
X3 ----- .... (§1\. JJ Q jffl------X1 
' , 
' , 
"' , ' 
I' ' 
D Q 
' , 
' , 
A 
" ' I' ' 
' , 
' , 
"' , ' 
I' ' µ D 
D J] Q D 
,, 
I' 
,, 
,, 
" 
D D 
X2 X3 X4 --
• 
Xl X2 X3 X4 
XO Xl X2 X3 
XO Xl X2 
X 
' 
' 
' 
' 
' D D 
YO 
Yl 
Y2 
Y3 
y 
• Figure 3-7: m-diagonal Matrix arid Vector multiplica-
tion for N = 4 and m = 5 
•• 
• 
Input f- Xs with Lm-NJ+ 0 < s < 2 N, and at clock 2t + (m -N),Right 
Input f-Y1 with O < t < N/2. Both conditions above are consistent with steps 3b 
and 3c of Algorithm-4a. Now using the eqn. (2.1), we have : 
2s-.(2t+(m-N)) 
s 
-
-
-
-
' 
N-1-21, 
m-I · 
2 - I+ t. 
44 
,·-
• I 
' l . 
Here, it is shown that the X elements indeed meet the correct Y ele-
ments at the desired processor. The necessary range oft, derived from the range 
m-1 N 
of s and the proof is l - 2 ~ t ~ 2 - 1 + / 
• 
Case 2. : (X elements released from right at even clocks) 
At clocks2s-+ (m -N),Left Input~ Y(N-l)-s with O < s < N/2, and . at clock 
2t,Right Input f--~m-l)-t with O < t < L m;N J+-N. The above conditions are consis-
~ tent with steps 3a and 3d of Algorithm-4a. Now; by applying eqn. (2.1), we ob-
tain: 
• 
(2s +(m -N))- 2t 
(m-1)-t 
-
-
-
-
) 
N-1-2!, 
m I 
-+(N-l-s)-/--2 2' 
m-1 ( 2 ) - / + p. 
It is shown that YP meets ~m-l)-t =X(m
2
1)-l+p' where p = N - 1 -s, 
therefore satisfying eqn. (4.1). Valid range of p for this case is~~ p < m~l + I 
Since both proofs above have covered all the posssible cases for a 
product to be formed, it is concluded that Algorithm-4a is valid. 
"/,, ·. 
Commutative Systolic Multiplication of m-diagonal matrix · 
M=rvm-1 .. ] L"'li..2-z+J and vector [Yj] withO < i < N, 0 < j < N, and 
0 ~ m < N, and N even. 
1. There are N processing elements connected in the linear systolic 
• 1· 
-
array. . 
2. Each processing element in the array, accumulates products of 
data elements from the Left and the Right array input, provided 
that those two inputs are of different types. · ~~ 
45 
"' 
·~ 
-, 
' '· 
" 
• 
t 
•. 1 -, 
3. Following input sequences are used: 
a. At clocks 2s, Left Input ~ Y(n-l)-s with O ~ s < N/2. 
b. At clocks 2s + (N - m), Left Input ~ Xs ·with O ~ s < m. 
c. At clocks 2t, Right Input ~ Yt with O < t < N/2. 
d. At clocks· 2t + (N - m ), Right Input f- ~m-l)-t with 
0 st< m. 
· 4. At the end of 1.5N cycles, the i-th processing element of the· array 
.. will hold the i-th component of the product vector. 
• 
Clock 
0 
1 
2 
3 
4 
5 
Y3 --~ D D /--YO 
XO ------Q J]------X2 
Y2--
' / 
... 
/ ' 
.,. ' 
/ 
/ 
--Y1 
X1 ______ m ~ ,OO------X1 
' / 
' / 
... 
' / 
' / 
... 
/ ' , ' 
.,. ' .,. ' 0/~:<~D 
X2 ------m D o ,m------xo 
' / ' / 
' / ' / 
... ... 
/ ' / ' 
.,. ' .,. ' 
D Q ,0 D 
' / 
... 
/ ' 
.,. ' 
D D Q D 
/ 
.,. 
/ 
" 
" 
D D 
' 
' 
' 
' 
' D ·o 
Xl X2 YO 
XO Xl X2 
--
' 
XO Xl X2 
XO Xl 
X 
Yl 
Y2 
Y3 
y 
~ . 
Figure 3-8: m-diagonal Matrix and Vector multiplica-
. tion for N =· 4 ~ind m = 3 
... ' . 
46 
,, 
..... , 
• . - ?, 
,r 
.•. .. 
I, 
• 
:J,,. • •. , ••.• p ••••••• 
A sample figure for a typical execution of this algorithm is shown at 
FIG. 3-8. This figure displays the complete data flow in the array for every clock 
cycle until a full produc;t of a 4 X 4 3-diagonal matrix is formed. 
/The proof for Algorithm-4b is similar in many respects to that of 
Algorithm-4a, so all the assumptions made in here, are the same as those made 
in Algorithm-4a as well. 
Case 1. (X elements released from left at even clocks) 
At clocks 2s + (N- m),Left Input ~xs with O < s < m, and at clocks 2t,Right 
Input~ Yt with O < t < N/2. This. is consistent with steps 3b and 3c of 
Algorithm-4b. By using eqn. (2.1) one gets : 
2s+(N-m)-2t 
s 
-
-
N-1-2!, 
m-1 
2 - l + t. 
As shown, X s = X(m~1 Z+t) mee.ts with Yt precisely according to eqn. 
(4.1). Now, using the range of s gives the folio-wing range oft that are involved 
in the products: max { l - m~1,o) < t < min { l + m~1.NJ2 - I}. 
Case 2. (X elements rel~~. 0 ~d ·*~m right at odd clocks) . "' 
At · clocks 2 s, Left Input ~ Y(N-l)-s with Oss<N/2, and at 
clocks2t+(N-m),Right Input~~m-l)-t with Ost<m. Input sequences are 
considered corresponding to steps 3a and 3d of Algorithm-4b. Now, using eqn. 
(2 .. 1), one obtains : 
,· 
2s - (2t+"(N- m)) 
(m-l)-t -
-
...... 
., 
N-l-21, 
m-1 
2 - I+ p. 
47 
... 
-t"·-~ . 
r-···. r .. ~ 
,, 
. . 
~ . --
.. 
" ,, 
-' ta, ' 
Thus YP (where p=N-1-s) meets X(m-l)-t = X(";-1 t+p) as required 
by the eqn.: (4.1). Further, by including the range of sand the above proof, range 
of t that are involved in the products is determined as 
., 
max { /- m;1.NJ2) ~ t ~ min { / + m;1.N - 1 }. 
Since both possible cases for Algorithm-4b where X elements meet 
the Y elements have been succesfully proved, it is concluded that Algorithm-4b 
is valid. 
j Algorithm 4<? I : Commutative 
M=rvm-1 . ·] L-'~T-l+J 
Systolic Multiplication of m-diagonal matrix 
and vector [Y1] withO < i < N, 0 < j < N, and 
N ~ m < 2N and N odd . 
1. There are N processing elements connected in the linear systolic 
array. 
2. Each processing element in the array, accumulates products of 
da~a elements from the Left and the Right array input, provided 
that those two inputs are of different types. 
3. Following input sequences are used: 
a. At clocks 2s, Left Input f-- Xs with O < s < m-: + N. 
b. At clocks 2 s + ( m - N) + 1, Left Input f- Y(n-l)-s 
0 < s < LN/ 2J. · . ...t 
with 
C. At clocks 2t + 1, Right Input ~ ~m-l)-t 
N. 
m--N 
with O < t < 2 + 
d. At clocks 2 t + ( m - N ), Right Input f- Yt with 
0 < t < IN/21. 
4. At the end of m+N-1 cycles, the i-th processing element of the ar-
ray will hold the i-th component of the product vector. 
· · - · A sample data flow for a typical execution of this algorithm is shown 
48 
---~--- ---------
·-· -----·-----·-- - - - -. ---··-
.. ,_ ·.:.- - ....... 
• 
.·, 
• 
a 
.. 
•,w.-~----·" --· 
I , • 
' s> 
Clock 
0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
. ' 
XO ----- .... q D o· D 0 
' 
' 
' 
' 
' D Q D D D·-----X6 
' 
,, 
' 
,, 
' 
,, 
' 
,, 
' 
/> 
X1 ----- .... Q D q D / YO ' ' ,, 
' 
' ,, 
' "' 
' 
,, ' 
' 
/> ., 
Y4 ~ q\:'D D------XS ' ,, ' ,, 
"' ,, 
' 
/> ' /> 
' X2 ----- .... q 
Y3 
X3 - - --- .... m_ 
' 
,I 
' ,, 
A 
,, 
' /> 
' 
' ,, 
' ,, 
A 
' ,I 
' ,, 
A 
,, ' 
/> ' 
D Q 
' 
,, 
' ,, 
A 
,, ' /> 
' D D 
,, 
,, 
,, 
,, ,. 
D D 
D D 
X3 X4 XS 
.. 
X2 X3 X4 
Xl X2 X3 
XO Xl X2 
XO Xl. 
X 
w D/ Y1 .•:•. =··· .... ' ,, 
' ,, 
A 
,I ' 
/> ' 
.D--- ---X4 
\ I 
' ,, 
A 
,, ' 
,. ' 
' ,, 
' ,, 
A 
,, ' 
/> ' 
D Q D 
Q 
' 
' 
' 
' 
' D 
D 
X6 
XS X6 
X4 XS 
X3 X4 
X2 X3 
' 
D 
Q 
' 
' 
' 
' 
' 
' 
' 
' D 
D 
' D D 
YO 
Yl 
Y2 
Y3 
Y4 
y 
--Y2 
I' 
Figure 3-9: m-diagonal Matrix and Vector multiplica-
tion for N = 5 and m = 7 
49 
..... . ...... _ ... 
D 
,, 
\' 
"' 
. 
• 
.. ' ' 
.. 
- ~- . \ 
• 
' 
.. 
at.FIG. 3-9. This figure displays the complete data flow in the array for every 
clock cycle until a full product of a 5 X 5 7 -diagon~l matrix and a 5 X 1 vector 
multiplication is formed. The proof for Algorithm 4c is provided below, using 
I 
the same procedures followed for the previous algorithms. 
Case 1. (X elements released from left at even clocks) 
At clocks 2 s, Left Input f-Xs with m-N 0 < s < 2 + N, and at 
clocks2t + (m -N),Right Input f-Y1 with O ~ t < IN/21. 
2s -(2t+ (m -N)) N- I - 2/ 
' 
• s -
-
m-1 
2 - I+ t . 
In this case, Xs = Xm-1 l+t meets the Yt in processor l as specified in 
2 
eqn. (4.1). From the range of s, the range oft for data points that participate in 
· m-1 N-1 these products is computed to be : max { / - 2 , 0} < t < 2 . 
Case 2. (X elements released from right at odd clocks) 
At clocks2s+(m-N)+l,Left Inputf-Y(n-l)-s with O<s<LN/2J, and at 
clocks2t + 1,Right Input f-~m-l)-t with O < t < m;N +N. 
(2s+m-N+ 1)-{2t+ 1)= 
(m - 1) - t 
Similarly • 1n this 
N- l - 2/ 
' m-l ( 2 ) - I+ (N - 1 ~ s ). 
case, (where p N-I-s) meets 
~m-l)-t = Xm
2 
1 
-l+p as required in eqn. ( 4.1). Range of p for this case is found to be 
.N-1 < < . { m-1 / N l } 
. 2 -P - rmn 2 + ' - . 
(;)' 
50 
. ----· ~---·- ...... -----~ "•-,--------, __ ... I-•--·-··-·----·-------------·---·-···---~-·----"" - ·-·-
' . 
-·---- - - ,,,_, __ ., __ ,,, --- ' - .. -, i -- ' -· - __ ,,, ......... -- "' ' ___ ,, _______ ._ .. ___ ....... ·-"-'-~---- -·----.. - ----------,-----·-·--·' .. _ .. • ., 
., 
............. '." .. ,-. ', 
., 
·" 
,·· 
·lu: ••.. 
_/ 
), 
Now,,since both cases have been successfully proved, it can be con-
cluded that Algorithm-4c is valid. 
, 
Commutative Systolic Multiplication of m-diagonal·.matrix .M 
=[Xm
2
1 
-i+j] and vector. [Yj] withO < i < N , 0 :::;J < N. , and 
0 < m < N with N odd. 
1. There should be N number of systolic processor connected as a 
linear array. 
2. Each processing element in the array, accumulates products of 
data elements from the Left and the Right array input, provided 
that those two inputs are of different types. 
3. Following input sequences are used: 
. 
a. At clocks 2s + 1, Left Input f- Y(N-l)-s with O < s < Lv12J. 
b. At clocks 2s + (/v - m), Left Input f- Xs with O < s < m. 
C. At clocks 2 t, Right Input ~ y! with O < t < r N/ 2 l. 
d.Atclocks2t+(N-m)+l ,Rightinput f-~m-l)-t with 
0 < t < m. 
4. At the end of 1.5N-0.5 cycles, the i-th processing element of the ar-
ray will hold the i-th component of the product vector. ., 
I . 
A sample figure for a typical execution of this algorithm is shown at 
FIG. 3~10. This .figure displays the complete data flow in the array for every · 
clock cycle until a full product of a 5 X 5 3-diagonal matrix multiplication is 
formed. 
The procedures followed for verifying [Algorithm 4d] are similar to 
those of the previous cases. Detailed proofs are provided below : 
Cas·e 1. (X elements· released from left at even clocks) 
At clocks2s+ (N - m) ,~eft Input f--Xs with 
51 
~-...... - . -·· 
0 < s < m, and at 
' 
,, 
' .~,,. ·; 
Clock 
0 
1 
3 
4 
5 
6 
7 
D 
Y3--
X1 - -'~ -- .. [l 
D 
' 
' 
' 
' 
D 
' 
' 
' 
' 
' .... 
' / 
' / 
... 
D 
( 
' / 
' / 
"' / ' ; .... 
' 
/ 
. / 
; 
/ 
/ 
' / 
' / 
... 
~ ' / ' 
; ' ; ' 
... 
D 
--Y2 
rn---~--X1 
,i.:.;:,::i 
X2 ----- .. m_ D/·:=:= ~=== D 
' , ' , 
' / ' / 
"' "' / ' / ' 
; ' ; ' D D Q j]------XO 
' 
/ 
' / 
... 
/ ' ; 
' D D Q 
, 
' 
/ 
/ ' / 
/ 
"' / / ' ; ; .... 
D D D 
/ 
/ 
/ 
/ ; 
D D D 
, 
/ 
/ 
/ ; 
D D D 
Xl X2 
XO Xl X2 
XO Xl X2 
XO Xl X2 
XO Xl 
X 
' ' / 
... 
/ ' ; 
D 
Q 
' 
' 
' 
' 
D 
D 
/ 
' 
' 
D 
D 
D. 
D 
YO 
Yl 
Y2 
Y3 
Y4 
y 
/ 
. 
Figure 3-10: m-diagonal Matrix and Vector mul-
tiplication for N = 5 and m = 3 
52 
.. . 
,. 
., 
.... 
' ' 
' 
., 
clocks 2t ,Right Input~ Yt with O ~ t < tN/21 This is consistent with steps 3b. 
and 3c of Algorithm-4d. Now, by applying eqn. (2.1), one obtains: 
• 
2s+(N-m)-2t 
s -
-
N - 1 - 2/, 
m-1 · 
2 -1 + t. 
Above case has been succesfully proved, X elements alw_ays meet the 
· desired Y elements at the required processor according to the eqn. (4.1). Range 
oft in this case is found to be: max{/ - m~1, 0) < t ~ N;1. 
Case 2. (X elements released from right at odd clocks) 
At clocks 2 s + 1 , Left Input ~ Y(N-l)-s with 0 ~ s < LN;2J, and at 
clocks2t+ (N- m) + l,Right Input ~~m-l)-t vv.ith O < t < m. This is consistent 
with steps 3a and 3d of Algorithm-4d. By using eqn. (2.1), one gets: 
(2s + 1 )-(2t+ (N- m) + 1) 
(m-1)-t 
·-· 
., 
-
-
N- 1 - 2! 
' m-l 
2 - I + ( N -:- l ·- s). 
Here, it is proved that YP (p = N - 1 - s) meet Xcm-l)-t =X<m;1 z+p) as 
defined in eqn. ( 4.1). Appropriate range of p for this case is : 
N+l < < . { N l m-1 /} 2 -P - min - ' 2 + ... 
Algorithm-4d is al130 valid, since proofs for both possible cases have 
been succesfully constructed . 
... 
53 
.... 
.. 
l --- ~ 
' . 
-
.. 
Chapter4 
Architectural Considerations 
4.1 Sequencing Architecture 
r· 
Chapter 3 has described the commutative systolic algorithms, and 
in particular, the input data sequences necessary to produce the required result. 
However, little attention was paid to the implementation of the array, that is on 
how one should be able to produce the specified sequence. 
This chapter concerns with the actual implementation of these al-
gorithms. The exploitation of commutativity calls for three additions to the 
hardware. Firstly, the data has to be tagged with a flag bit identifying it with 
the source sequence (either as elements of X matrix or Y vector). This implies 
. 
widening the data paths between the modules by one bit. However, in the case 
of systolic arrays, the data paths are regular and connecting only to nearest 
neighbors. Thus, the effect of this flag bit on the hardware complexity is min-
imal. 
,._" ' ' Secondly, the array modules need to be modified such that the meet-
ings between two elements from the same sequence are ignored. This is also 
easy to incorporate by simply XOR-ing the two flag bits of both elements and 
.. 
ignoring the pair if the result of the XOR is O (both tag bits are identical). 
-
-
The major modification to the hardware comes because of the inter-
mixing of the elements of X and Y sequences and the particular input orders 
used in different algorithms. Each data sequence to a specific side of the ·array 
54 
• 
' .;,,1 I. .. 
t ,.. ',.. I •• ~ • 
/ 
I 
I 
I 
I 
req~ires a unique circuit. -Each circuit generally needs -some buffers or shift 
registers to hold the matrix and the vector elements. These shift-registers are 
initially filled with -every element of the matrix and vector in a predetermined 
order that may be calculated from the algorithm. One may safely assume that 
the same matrix elements will be used repeatedly with m:any different data vec-
1.-
tors, so the order of the matrix elements in the shift register is not of major 
importance. One needs to only ensure that the shift registers holding the matrix 
elements are restored to their starting state at the end of each computing cycle 
involving a matrix vector product. The data vector, on the other hand is con-
sid~red to be a variable changing at each computing cycle. Therefore in this 
case, it is important that its elements are stored in a simple and regular fashion 
in order to guarantee simplicity in refilling the registers when a new vector is 
input. Once all the shift registers have been filled with the required data, then 
,;. 
a data shuffling network manipulates the data stored in the shift-registers to 
provide the appropriate data sequence for the systolic array. These shuffling cir-
cuits mostly consist of a set of 2 to 1 multiplexers that select one of its two input 
data according to the condition of some select lines. These select lines are typi-
cally controlled by counters that keep track of the clock cycles applied to the 
<Iii 
network. 
· It is worthwhile to inpect the pattern of the input sequences re-
quired in ea~h of the algorithm presented in Chapter 3. A careful examination of 
Step 3. of each algorithm reveals that the inputs at e,ven and odd clocks are 
regular but may be very different from each other. Further, constrained to only 
even or odd clocks, the inputs indices of X or Y either increases sequentially 
.. 
(denoted by Xi or Yi respectively), or -decrease sequent{ally (clenoted by ·xt or 
Yi respectively). Thus, for example, in case of a Toeplitz product when the vec-
.. 
55 
,o' '· • 
- I 
.. 
-. 
,. 
f"' ! ' ' 
I I , 
' i 
II 
" 
tor length N is even, the left inputs att odd clocks are Xi (actually ~+s at clock 
2s+l), and at odd clocks are Xi (actually X8 at ~lock 2s) and· Y j, (actually Y N-l-s 
at clocks 2s+N). Similar observations may be made about the right input. 
These characteristics of the input sequences for all the algorithms in Chapter 2 
are presented in Table 4-1. 
Input Sequence Characteristic 
Algorithm Left Input Left Input Right Input Right Input 
even clocks odd clocks even clocks odd clocks 
Neven Xi,Yj, xi Yi xi, Yi 
Nodd xi xi, yj, Yi xi, Yi 
Toeplitz product 
Neven Xi,YJ, xi Yi xi, Yi 
Nodd xi xi, yJ, Yi xi, Yi 
Cyclic convolution 
N-even, m <N yj, xi Yi xi 
N-even, n1 > N xt yj, xt Yi 
N-odd, m < N xt yj, Yi xi 
N-odd, m > N xt yj, Yi xJ, 
m-diagonal product 
Table 4-1: Characterization· of the Input Sequences in 
Cilfae of Commutative Systolic Algorithms 
From this table, one can clearly see that each of the left and right 
sequences should be obtained from 2-to-1 multiplexers which constantly switch 
at every clock cycle between the data streams to be applied at even and odd 
clocks. Further, an examination of the. algorithms in' Chapter 3 shows that 
whenever a data stream has both X and Y inputs (such as the left input at even 
clocks of Toeplitz product algorithm with Nodd ; Xi, Y J,), the switch between X 
and Y occurs at a fixed clock. Thus, for the even clock, left input of the Toeplitz 
product (N even) algorithm, Xi switches to Y j, at N-th clock~· This changeover 
could be managed through another 2-to-1 multiplexer that is controlled by a 
56 
' ' 
Q• ....... ., 
• 
. ' 
·. >. 
,, 
. . 
counter keeping track of the number of clocks applied and how it compares with 
N. 
Due to the demand for X and Y components with sequentially in-
creasing as well as sequentially decreasing order, one generally will have to use 
two shift-registers, each for X and Y that rotate in either direction. The correct 
taps on the shift-registers have to be determined by examining the correspond-
ing _algorithm. 
Thus, each sequencing hardware typically consists of four 2-to-1 
multiplexers and four shift-registers to hold and rotate X and Y sequences. As 
mentioned before, the control of the two multiplexers is through the system 
clock, while the other two is through the counters. Note that since the 
m-diagonal matrix product algorithm does not have to switch at any even or odd 
clocks input between X and Y data streams, their sequencing architectures are 
relatively simple, because of the elimination of the two multiplexers controlled 
by the counters. 
FIG. 4-1 - FIG. 4-4 shows some sample sequencing architectures. For 
the sake of consistency, the sizes of the matrices and vectors used in these 
figures are identical to those used in the previous two chapters. · In all of these 
architectures, the shift-registers are clocked only during the even clock cycles, in 
order to maintain the synchronicity in the entire architecture. Note that the 
shift-registers used in these architectures perform a circular-shift, so that the 
starting X sequences are restored at the end of each computing cycle. 
57 
- ! I 
. 
I • ,, \ . 
• 
. ' - - . 
• 
•• 
.. 
" 
.. ~. 
- ,• ,-; .... 
. 
Xo - X1 - X2 X3 ~ - I 
p 
-
Xo X1 - X2· X3 ... -
b 
- Yo - Y1 - Y2 - Y3 f4. - - -
~ Yo . Y1 - Y2 - Y3 ~ 
-
• • 
~ 1 /. ~ 1 /_ . >=N . I I - -
1 ' I I 
~ 0 I ~ 1 I clock 
3-X3-X1 -X2-Xo . '' .... I :: : - . : ; : ;:::::::::: - -- ;:::::: - - -•••••• ::::::: ):):):\: 
-
.. 
- - - Yo-X2-Y1 -Yo-Y2-Y1 -Y3-** . . . 
. . . . 
. 
- "•:-:-:-:-:-:•. -. . r·:.·.·.·.·.· . 
. . . . . . 
'' ... '. . . . . . . 
Systolic Array 
Figure 4-1: Sequencing Architecture for Cyclic Matrix 
and Vector multiplication for N = 4 
I 
• 
58 
' 
•Jr. 
>= N - 1 
r• 
. ~·· 
' , 
i 
... 
" 
Xo - X1 - X2 - X3 - X4 .... - - - - -
Xo - X1 - X2 - X3 X4 .... - - - - -
- Yo - Y1 - Y2 - Y3 - Y4 ... - -- - - -
7 
~d' 
... Yo - Y1 - Y2 - Y3 - Y4 ..... -
• I 
, ' 
~ 1 L ~ 1 /_ . >=N N - I I >= 
, r I I r , ' 
\° 1 I ~ 1 I clock 
. 
-
... ~ 
--
., 
'"'" :;·. " "ti'!J r · .... ·. ; •, ~. ·. ~. ;/):):( .·.·.·.·.·.·.·.· - ·.·.·.·.·.·.·.·· - - - ,:,:,:.;.:,:,:,: ·.·.·.·.·.·.·.·, 
-
·.·.·.·.·.·.·.·, 
-
-
-
•,•,•,•,•,•,•,•, 
::~:::::::::: :-:,:,:-:,:,:.;. :::::::::::::::: .. 
•,•,•,•,·,•.·.·, . . . . . . . . •••••• :rrrr -Y 4-X4-X1-X3-Xo-X2 :.;-:-:-:-:-:-:• - \?/}) - - - Yo-X3-Y 1-X2-Y 2-Yo-Y 3-Y 1-Y 4 ,•.•,•,•.•,•,•.· ·> :,:,:,:,: .:,, - - -•,•,•,•,•,•,•,•, ',',',',',',',' ',',',',',',',', 
'.'.' ... ·.· . ·.·. ·.· .. 
·,•,•,•,•,•,•,•, 
Systolic Array · 
Figure 4-2: Sequencing Architecture for Cyclic Matrix 
and Vector multiplication for N = 5 
59 
. ' 
• 
,/• 
• 
• 
•. . 
__ ., .... ,,, .... l 
/ 
·'" 
,. 
. 
Xo X1 X2 X3 X4 ; - - - - -- - - - - -
.. , 
Xo - X1 - X2 - X3 - X4 ..... - - - - -
.... Yo - Y1 - Y2 - Y3 f...a '·~ .. -
~ 
..... Yo - Y1 - Y2 - Y3 i,..,. - -
7 
. 
0 0 
' 
' 
1 
' ' 
I r 
~ 1 /__ ~ 1 /_ • N - I > -
' r , . 1 r 
\° 1 / ~ 1 I clock 
,, 
"- ,, - .. ' ..... 
.... ' ... ?\:::\ )))) 2-Y2-X1-Y3-Xo I - - - . ........ - . - ::::::::::::::: 
.·.·.·.·.·.·.·. 
~::::::::::::::: ::::::::::::::: ::::::::::::::: •••••• 
.·.·.·.·.·.·.·. 
"'.:::::::;:::::: }\\/ X4-Yo-X3-Y 1-X2-••-X1 .:,:,:,:,:,:,:, - - - --: ::: : ::: ::: ::: : - ~ :: :: :: :: :: : :: : : - - -
Systolic Array 
Figure 4-3: Sequencing Architecture for a 5-diagonal 
Matrix and Vector multiplication for N = 4 
60· 
•' 
.. ' 
.... 
{ 
' 
. 
' 
. N I > 
" 
" 
. ' 
Xo - X1 - X2 ..... - - -
Xo - X1 - X2 .... , - - -
Yo ~ Y1 - Y2 - Y3 -~ - - -
- Yo - Y1 - Y2 - Y3 -- - - - -
0 0 
I 
' ' 
I 
~ 1 /_ ~ 1 /_ . - I < m 
I 
' 
. 
~ 0 I ~ 0 I clock 
~ -) -
........ 
-X1-Y2-Xo-Y3 ·,•,•.•,·,·.·.· ........ ,•,•,•,•.•,•,•, - ;:;:::::::::::: . ::::::::::::::; - -
lililili\ililil 
.:,:,:,:,:-:,:, 
- -))\((: ::::::::::::::: •••••• : ,: . ;. :,;. ;.:-: 
Yo-X2-Y 1-X1-11-•-Xo .·.·.·.·.·.·.·.· - )))/ - - -•:-:-:-:,:,:,: ·.·.•,•,·.·.·.· ::::::::::::::: : ,: . : . :-:-:-:-: 
' 
Systolic Array 
Figure 4-4: Sequencing Architecture for a 3-diagonal 
Matrix and Vector multiplication for N = 4 
61 
... 
-
. 
. N /2 I < 
... -- "' 
y' 
...... 
.... '.,., C 
4.2 Systolic Array Sintulator 
The primary tool used to aid the analysis of the effects of applying 
different input data sequence to the commutative systolic array was a simula-
tion software designed as part of this project. The package is written in the 
C-language to simulate the linear systolic array as closely as possible by con-
sidering every aspect of the architecture. This is extremely important in light of 
the fact that this simulator is a research tool and a user may desire to examine 
the data flow at any given time during the process. All computations are per-
formed on the basis of an internal clock This software is also designed to provide· 
the user with the flexibility to modify or to enter different patterns of data se-
quence as input, thus allowing them to examine their effects on the output. 
which are also generated at each clock cycle. 
A typical simulation starts with the user entering all the necessary 
information pertaining to the various parameters of the data sequence both for 
the matrix and the vector elements such as the initial delay, spacing delay, 
length of vector, etc. This data is then processed allowing the user to examine 
the data flow and the output of each processing element at each clock cycle. At 
the end of a simulation, a log file is created to contain the entire simulation ses-
sion. Using this file, the user can review all the steps that were taken and the: 
results on the product vector. ·Some samples of these log files are shown below. 
I SIMI.OUT I 
Parameters used for this. simulation are • • 
Number of Processing Elements = ·4 
Length of the X-Vector = '7 
Length of the Y-Vector - 4 
Order of L-Vector elements = xO x2 xl x3 y3 x4 -
xS 
62 
.,· 
"" .. 
• 
Order of R-Vector elements ' yO x6 yl yO y2 yl y3 -
-
Initial delay of the L-Vector - 0 
Initial delay of the R-Vector - 0 
Spacing delay of the L-Vector - 0 
Spacing delay of the R-Vector - 0 
Repetition of the L-Vector - 1 
-
Repetition of the R-Vector - 1 
-
PE Number 
0 1 2 3 
0 • - - - ·-• 
1 • - - - -• 
2 • - [ 2, O] - -• 
3 • [ 3, O] - [ 2, 1] [ 0, O] • 
4 • [ 3, 6] [ 3, 1] [ 1, O] [ 2, 2] • 
5 • [ 4, 1] - [ 3, 2] [ 1, 1] • 
6 • [ 4, 2] - [ 3, 3] • 
7 • [ 5, 2] [ 4, 3, • . J 
8 • - [ 5, 3] - -• 
9 • - - - -• 
10 • - - - -• 
By simulation 
Constructed matrix 3 4 5 6 '• ' ' ; •, .. ~ • ., • 
2 3 4 5 
1 2 3 4 
0 1 2 3 
I S1M2.0UT I 
·' 
Parameters used for this simulation are : 
Number of Processing ElE!Inents = 5 
Length of the X-Vector = 7 
Length of the Y-Vector = 5 
Order of L-Vector elements = xO - xl y4 x2 y3 x3 
- x4 - x5 
Q 
Order of R-Vector elements = - x6 yO x5 yl x4 y2 
x3 - x2 
Initial delay of the L-Vector - 0 
Initial· delay of the R-Vector - 0 
-
·Spacing delay of the L-Vector - 0 
-
63 
" 
. ., 
' 
0: 
" 
Spacing de1ay of the R-Vector = 0 
Repetition of the L-Vector = 1 
Repetition of the R-Vector = 1 
PE Number 
0 1 2 3 4 
0 • r;._ _) 
- - -
-
• 
1 • 
- - - - -
• 
2 • 
-
,-
- - -
• 
3 • 
- - - [ 0, O] -• 
4 • 
- [ 4, 6] [ 1, O] - [ 0, 1] • 
5 • [ 3, 6] [ 2, O] [ 4, 5] [ 1, 1] -• 
6 • [ 3, O] [ 3, 5] [ 2, 1] [ 4, 4] [ 1, 2] • 
7 • 
- [ 3, 1] [ 3, 4] [ 2, 2] [ 4, 3] • 
8 • [ 4, l] - [ 3, 2] [ 3, 3] • 
9 • - [ 4, 2] 
- - [ 3, 2] • 
10 • [ 5, 2] 
- - -
• 
11 • 
- - - - -
• 
12 • 
- -
-
-· 
• 
13 • 
-
- - - -
• 
14 • 
- - - -
-
• 
By simu1ation 
Constructed matrix • 3 4 5 6 -• 
2 3 4 5 6 
1 2 3 4 5 
0 1 2 3 4 
- 0 1 2 3 
The simulator pictorially shows on the screen the dynamic move-
ment of each data element within the array, as well as the linear·transformation 
matrix as it gradually builds up at every clock cycle. In the two simulation 
result examples above" a notation· such as [ 4, 6] stands for the X4 element of the 
input matrix meeting the Y6 element of the input vector. The row number re~ers 
.,., 
to the clock cycle, while the column number indicates the processing element of 
the array where both e~ements meet. The constructed matrix itself is written.by 
just displaying the indices of tlie matrix elements for shortness. 
64 
.,,,.., & 
It has been found that this simulator is not only a helpful develop-
<I 
ment tool, but also as an aid to test the validity of an algorithm. · 
• 
q 
" 
65 
I 
.... ·1 . 
·1 
-~ 
•J 
.. 
Chapter 5 
Conclusion 
5.1 Sunur-.ary of Results 
Systoli_c architectures are important in many signal processing ap-
plications. Their high degree of multiprocessing and pipelining has enabled 
them to achieve high processing speed. They are also simple and cost effective 
for VLSI implementations because of their modular design and regular com-
. . 
munication. However, VLSI implementations are good only for applications 
that are highly repetitive. In the area of digital signal processing, many ap-
plications involving linear transformation problems such as digital filtering, op-
-
timum linear filtering, Fourier, Hadamard and other transformations satisfy 
the repetitivity requirement. In all these applications the resulting vector is 
formed by repeatedly computing inner product of two data vectors sliding over 
--
each other. 
Most linear systolic arrays used to solve digital signal processing 
problems employ a number of processing elements linked into a chain and com-
municating with nearest neighbors. In all of these arrays, the two. data vectors 
X and Y enter the array from two ends of the array. Elements of X and Y are 
never intermixed in a single sequence. 
This thesis explores the possibility of mixing the elements of X and 
Y to improve the time complexity of the systolic algorithms. In order that the 
. ' 
unwanted products not be formed, we have proposed appending a tag bit to each 
data point identifying it as belonging to a particular sequence. This work· 
66 
•• 
.. , 
' .. 
' 
_ .. " 
reports :n.ew Commutative _Systolic Algorithms ~or multiplication of matrice$ and 
vectors in a general case as well as when the matrices have significant struc-
ture. Table 5.1 lists the time complexities of various algorithms derived in this 
•'' ' 
thesis. The time complexities of equivalent Noncommutative Systolic 
Algorithms are also listed in this table for comparison. An inspection of this 
table shows that even though the complexity order is not changed by exploiting 
the comutativity of the multiplication operation, the complexity itself is reduced 
a great deal, sometimes by upto about 50%. 
Time Complexity 
-Application Noncommutative Commutative 
Algorithm Algorithm 
Matrix-Vector product N2 (N2 + 3N- 4)/2 
length N even 
.... ,. Matrix-Vector product N2 +N- 1 (N2 + 4N - 3)/2 
lengthN odd 
Toeplitz product 3N-2 2.5N-1 
length N even 
Toeplitz product 3N-2 2.5N-l.5 
lengthN odd 
Cyclic convolution 3N-2 2.5N-1 
lengih N even 
Cyclic convolution 3N-2 2.5N-1.5 
lengthN odd 
Table 5-1: Comparison of Algorithms Time Complexity 
The mixing of the elements' from the two sequences has the penalty 
,. 
of additional hardware requirement. However, it has been shown that a proper 
sequencing typically can be achieved through circuit with a number of shift 
'~ registers and 2-to-1 multiplexers. The use of noncommutative algorithms also 
' 
requires the data paths to be widened to carry the additional flag bit·. However, 
this is simple in a systolic architecture where all the data paths are linear and 
4 • 
-67 
• 
! . 
·· only travel between neighboring modules. Finally, the internal module architec-
ture needs to be modified to multiply only elements bearing distinct flags. In 
practice, this could be achieved merely by XOR-ing the two flags and acting ac-
cordingly. Tlius the scheme proposed provides a considerable improvement in 
time·complexity with minimal increase in hardware complexity. 
5.2 Future Extensions 
The work presented in this thesis opens up a new research direction 
that exploits the commutativity of the operations performed in the processing 
elements of the array. This provides an additional degree of freedom to the 
designer of systolic algorithms. As shown in this thesis, this flexibility can be 
used to decrease the time complexity of some I-dimensional systolic arrays solv-
ing various linear tranformation problems commonly encountered in digital sig-
nal processing. 
The concept of trading the noncommutativity for time complexity is 
not limited to linear arrays alone, and could be applied to other systolic 
geometries such as two dimensional me~hes and hexagonal arrays. If is quite 
possible that in these differer1t geometries, the application of this technique 
could even reduce the order of time complexity. 
This thesis has reported on various algorithms for solving a variety 
·of ·problems on noncommutative systolic arrays., A missing element in this 
analysis is the derivation of bounds on the complexities of these algorithms. 
,. 
Such an analysis, if completed, would provide valtiable data regarding the su-
periority of the commutative algorithms over the noncommutative algoritlim. 
68 
. . 
........... 
'-l .... 
• 
• 
REFERENCES 
. 
[1] Harold S. Stone, "Special-Purpose vs. General-Purpose Systems, A Posi-
tion Paper", in Algorithmically Specialized Parallel Computers, 
L.Snyder, L.H.Jamieson, D.B.Gannon, H.J.Siegel, eds., Academic Press, 
Massachusets, 1985, pp. 251. 
[2] H.T.Kung, C.E.Leiserson, "Algorithms for VLSI Processor Arrays", in 
Introduction to VLSI Systems, C.Mead, L.Conway, eds., Addison-Wesley, 
Massachusets, 1980, pp. 271. 
[3] Chin-Liang Wang, Che-Ho Wei, Sin-Horng Chen, "Efficient Bit-Level 
Systolic Array Implementation of FIR and IIR Digital Filters", IEEE 
Journal On Selected Areas In Communications, pp. 484, Vol. 6, No. 
3, Apr. 1988. 
[ 4] Ronald J.Cosentino, "Concurrent Error Correction in Systolic Architec-
tures", IEEE Transactions On Computer Aided Design, pp. 117, Vol. 
7, No. 1, Jan. 1988. 
[5] Rob A. Rutenbar, D.E.Atkins, "Systolic Routing Hardware: Performance 
Evaluation and Optimization", IEEE Transactions on Computer Aided 
Design, pp. 397, Vol. 7, No. 3, Mar. 1988. 
[6] Long Wen Chang, Ming Young Chen, "A New Systolic Array for Discrete 
'Fourier Transform", IEEE Transactions On Acoustics, Speech, and Sig-
nal Processing, pp. 1665, Vol. 36, No. 10, Oct. 1988. 
[7] H.T.Kung, L.M.Ruane, D.W.L.Yen, "A Two-Level Pipelined Systolic Ar-
ray for Convolutions", in VLSI Systems and Computations, 
H.T.Kung,B.Sproull,G.Steele, eds., Cqmputer Science Press, M9-ryland, 
1981, pp. 255. · 
[8] James V. Candy, Signal Processing, The Model-Based Approach, Mc-
Graw Hill, NewYork, 1986. 
[9] H.T.Kung, ''Why Systolic Architectures ?", IEEE Computers, pp. 9, Vol. 
15,No.1,Jan.1982. 
[10] Hon Keung Kwan, "Systolic Realization of Linear Phase FIR Digital Fil-
ters", IEEE Trans. Circuits & Systems, pp. 2, Vol. cas-34, No. 
12, Dec. 1987 . 
[11] J.J.N avarro,J.M.Llaberia,M.Valero, "Partitioning: An Essential Step in 
Mapping Algorithms Into Systolic Array Processors", . IEEE 
Computers, pp. 77, Vol. 20, No. 7, July 1987. 
[12] Guo-Jie Li & Benjamin W. Wah, "The Design of Optimal Systolic ~-
.-.- -; rays", IEEE Transactions on Computers, pp. 66, Vol. c-3tl, No. 
1, Jan. 1985. 
69 
• 
• 
, 
VITA 
The author Ray A .. Gunara, was born on September 9, 1966 in 
Surabaya, Indonesia. He was the eldest of three children of Peter A. Gunara and 
Maria Patricia Tien Kalangi. Following his graduation from the Canisius 
Senior High School in Jakarta, Indonesia in 1984, he went to contin11e his study 
at Villap.ova University, located in Villanova, Pennsylvania. After spending 
three and a half years of study, he earned his Bachelor of Electrical Engineering 
degree with a concentration in Computer Science in December 1987 .. Upon 
graduation, he went directly to Lehigh University's graduate school in the fol-
lowing Spring semester of 1988, where he would then earn his Master of Science 
in Electrical Er1gineering degree in the sun1mer of 1989. T.he author is a stu-
dent member of IEEE and the Computer Society since 1986, and he is also a 
member of Eta Kappa Nu. 
70 
...--- ··:- ,..; 
' 
I 
I 
' 
... 
\ 
