Systematic design of two level pipelined systolic arrays with data contraflow by Valero García, Miguel et al.
SYSTEMATIC DESIGN OF TWO-LEVEL PIPELINED SYSTOLIC ARRAYS 
WITH DATA CONTRAFLOW 
M. Valero-Garcia, J.J. Navarrro, J.M.LLaberia, M. Valero 
Facultad de Informatica de Barcelona (U.P.C) 
Pau Gargallo,5, 08028 Barcelona (Spain) 
ABSTRACT 
Many systolic algorithms and related design methodologies 
have been recently proposed. Frecuently, in these systolic 
algorithms practical considerations are not taken into account. 
Equitatively distributed load between processing elements, 
pipelined functional units etc, are desirable features when 
implementing systolic algorithms.In this paper we present a 
design methodology in which these features are considered. As 
an example, the methodology is applied to obtain a 
problem-size-independent, two-level pipelined 1D systolic 
algorithm with data contraflow to efficiently solve triangular 
systems of equations. 
INTRODUCTION 
The design of simple and feasible, high performance, and 
algorithmically specialized concurrent systems is a very 
desirable goal when we consider many applications which 
require a great amount of computations performed at high 
speed. A present example of such systems, which are nowadays 
actively researched, are Systolic Array Processors [l]. 
A large set of Systolic Algorithms (SAS), as well as SA design 
methodologies [ 21 have been recently published. Nevertheless, 
a large part of these SAS shows several characteristics which are 
not suitable for its direct implementation. In these cases, 
adecuate transformations must be done on the SA to map it 
conveniently onto the proposed architecture, accordingly to the 
restrictions imposed. In this paper three types of 
transformations applied to SAS are considered: 
I .-Partitioning. That is, problem-size-dependent SAS are 
transformed into problem-size-independent SAS. In a 
problem-size-dependent SA the number of processing elements 
(PES) depends on some dimension of the data structures of the 
problem to be solved. The number of PES is, on the contrary, 
previously fixed and is independent of the problem dimensions, 
when a problem-size-independent SA is appointed. Partitioning 
issues are important because the number of PES in the SA 
becomes a fixed value on account of technological and/or 
economical considerations. 
2.-Computational balance. This aspect means the need of 
transforming non-balanced SAS into balanced SAS. The 
non-balanced SAS are caracterized for the requirement imposed 
on some PES to perform calculations which are of greater 
complexity that those to be accomplished, in a single systolic 
cycle, by other PES. Nevertheless, in a balanced SA, the load is 
distributed between all the PES as much equitatively as possible, 
in order to obtain that the PES perform useful task the maximun 
amount of time attainable. 
This work was supported by CAICYT under contract PA85-0314 
3.-Two-level pipeline. It consist in the transformation of 
one-level pipelined SAS into two-level pipelined SAS, where the 
functional units inside the PES are pipelined. To implement this 
kind of SAS is important if we want to get systems with high 
throughput. 
Several  authors have undertaken the design of 
problem-size-independent SAS. Reference [3] shows enough 
information on this subject. 
We have not found, until the present, any report on research 
with a formal treatment of the design of balanced SAS to solve 
non-homogeneous problems. 
The design of two-level pipelined SA ha been presented by 
Kung and Lam in [4], where SAS without data contraflow [ 3 ]  
are considered. It is well known the existence of 
non-homogeneous problems with dependences between results 
(i.e. triangular linear systems, LU decomposition, ...) solved 
efficiently by means of SAS with data contraflow. Kung and 
Lam propose for these problems a new type of SAS without data 
contraflow, called "Systolic rings" [4]. It is to remark, 
nevertheless, that the implementation of SAS without data 
contraflow require a greater amount of hardware (for functional 
units and control) than SAS with data contraflow for these 
problems. For example, in a systolic ring to solve triangular 
system of equations, all the PES must perform divisions in some 
cycles and multiplications and additions in other cycles. 
In this paper, a systematic methodology to transform 
non-balanced and/or one-level pipelined SAS into balanced 
and/or two-level pipelined SAS, is to be presented. This 
methodology is based upon two transformation rules, to be 
applied on a model of the SA. The proposed model is an 
extension of one previously presented by Kung and Li [SI. The 
first rule is a combination of the two formal transformations 
proposed in [ 5 ] .  The second rule we propose is a formal 
transformation to the attainment of a 1-slow algorithm from a 
k-slow one, through adjacent PES coalescing. With these 
transformation rules we obtain balanced and/or two-level 
pipelined SAS with data contraflow. To reach 
problem-size-independent SAS, we use techniques of 
partitioning and DBT transformations recently published in [3]. 
This paper is structured as follows: Section 2 presents the 
formal  model fo r  1D SAS.  The  case of a 
problem-size-independent SA to solve triangular systems of 
equations is presented as an example. Section 3 enunciate the 
two proposed rules for transformations. In section 4, these rules 
are applied to obtain a problem-size-independent, balanced and 
two-level pipelined 1D SA with data contraflow from the 
non-balanced and one-level pipelined SA presented in section 2. 
ISCAS'88 
2521 
CH2458-8/88/0000-2521$1 .OO 0 1988 IEEE 
CHARACTERIZATION OF 1D SA 
A ID SA consists in the grouping of w Processing Elements 
(PES), which we shall name PE,, ..., PE,. These PES are 
interconnected through unidirectional links. Each PE is also 
possibly communicated with the outside world through IIO 
unidirectional links. The 110 links are used to input (or output) 
sequences of data to (or from) the SA; these sequences will be 
named here as input (or output) flows. 
Every PE simultaneously receive all data involved in the 
computation to be performed in each cycle. Any computation 
may produce one or several results. These results may leave the 
PE in different cycles, depending upon the time interval needed 
for the PE to produce them. A time delay is associated to each 
link which is used to connect any two PES. This delay is 
measured by the number of registers along the link. 
Some PES in the SA may perform certain operations during 
some cycles, and other type of operations during other cycles. 
If this is the case, we say that the PE present a 
non-homogeneous time behaviour. When each PE initiates one 
valid operation every k cycles, we say that the SA is k-slow. 
In order to simplify the required notation, we assume that only 
one link, at most, in each direction exists between any pair of 
PES in the SA. To extend this assumption to a general case is 
straighforward. 
Modelization of a 1D SA 
A ID SA with w PES may be modelled by the tuple A=(w, I, 
0, L, R, E, S, P, k) and by the definition of operations 
performed in each cycle by every PE. 
The value w is the number of PES of SA. I is the set I], ..., I,,, 
where p is the number of links entering into the SA from the 
outside. Ij, for j  E [l..p] is the data sequence {Ij('), Ij(2), ...} 
which inputs to the SA through the j-th input link. 0 is the set 
01, ..., 0 , where q is the number of links leaving the SA. O., 
P 1 
for j E [ I  ...SI is the data sequence {Oj(l), Oj(2), ...} which 
outputs the SA through the j-th output link. 
L, R, E and S are matrices in the form: X(i,j) = z-x(iyJ) or 
X(ij) = 0. L (latency matrix) has w-by-w elements. The value 
of I(ij) is the number of cycles needed by the PEj to calculate 
every data item to be sent to the PE,. If no communication exist 
between PEj and PEi, then we have L(ij)=O. R (register matrix) 
has w-by-w elements. The value of r(ij) is the delay associated 
with the link from PEj to PEi. If such link does not exist, then 
R(i,j)=O. 
When a non-homogeneous time behaviour is exhibited by PEj, 
and PEj is linked with PEi, then we have 
where nj is the number of different operations performed by 
PE.; we name these operations as OP ],..., OP,, . The value of 
1 ,(i,j) is the number of cycles needed by PE, to calculate, by 
means of OP,, the value to be sent to PEi. Similary, rS(i,j) is the 
J 
delay associated to the link between PEj and PEi, when PEj 
performs OP,. 
E (entrance matrix) has w-by-p elements, and e(ij)  is the 
number of cycles from the begining of SA operation until the 
PE, receives the first data item in the Ij sequence. If PE, receives 
no data of Ij, then E(i,j)=O. S (sally matrix) has w-by-q 
elements, and s(i,j) is the number of cycles from the begining 
of SA operation until PEi initiates the calculation of the first data 
item in the Oj sequence. If PE, produces no data for Oj, then 
S(i,j)=O. P (periodicity vector) has w elements. The element 
P(i)= p(i) is the number of cycles from the begining of an 
operation in PEi until the moment in which PEi may initiate the 
next operation. If PEi exhibits a non-homogeneous time 
behaviour, we shall write P(i)=(p'(i), p2(i), ...,p"y i)), where 
ps(i) is the periodicity of operation OP,. Finally, the value of 
integer k indicates the slow of the SA. 
Modelling of a 1D SA to solve triangular systems of 
linear equations. 
In figure 1 we show the structure of an special type of 1D 
SAS, which we namelD spiral SAS with data contraflow. This 
kind of SAS may serve to solve a broad range of matrix 
problems, such as matrix multiplication, triangular systems of 
linear equations, LU decomposition, QR decomposition by 
Givens rotations, etc. All these problems can be solved, for any 
value of the involved matrix dimensions, on a fixed-size SA 
(with a given number of PES), by means of the partitioning 
technique known as DBT [3]. Xb 
STEPS 
19 
17 
16 
15 
14 
13 
12 
11 
10 
9 
8 
7 
6 
5 
4 
3 
2 
1 
i a  
N 
* 1  
* *  
* 1  
* *  
+ 1  
t t  
x3 0 
x2 0 
x1 0 
t *  
t .  
t .  
* 1  
t .  
* 1  
* *  
+ 1  
* *  
t t  o 1 b i g  * t  
I PE1 I PE2 and PE3 I 
Figure 1. Problem-size-independent Systolic Algorithm for 
triangular system of equations. 
2522 
We shall now present the modelization of a I D  spiral SA with 
data contraflow to solve a triangular system of linear equations, 
with any size. In [3] a detailed description of this SA operation 
can be found. Figure 1 shows the SA in the particular case of 3 
PES, as well as the I/O data sequences to solve a system with 6 
unknowns. There is non-homogeneous time behaviour in PE,, 
because it performs one division and one sign change during 
some cycles, but one multiplication followed by one addition 
during other cycles. This PE, receives a control signal (C,) 
which can be treated as another input flow. C, determines, in 
each cycle, which one of the two possible operations are to be 
performed in the PE. All other PES, except PE,, always 
perform one multiplication followed by one addition. The model 
is as follows: 
I (i,i+l) = 0, i E [ l  ... w-11; 
1 (w,l) = 1 l(2,I) = I  2(2,1) = 0; 
because, in the original design, the assumption of zero cycles to 
perform any operation is made. On the other hand, 
r(i,i+l) = I ,  
r(w,l) = w+l; r1(2,1)=r2(2,1) = 1; R(ij) = 0 in the other cases; 
because all the links between PES have an associated delay equal 
to 1, except the link between PE, and PE,, which has a delay 
of value w+l .  
1 (i+l,i) = 0, i E [2 ... w-11 
L(i,j) = 0 in the other cases; 
i E [ I  ... w-I]; r(i+l,i) = 1, i E [2 ... w-11; 
We denote the input flows to the SA as follows: 
Ii = Ai i E [ I...w]; I,+,=B; I,+2=Xe; I,+-, =Cl; I,+, = C ,  
We denote the only output flow as: O,=X, . 
The Ai, B, X,, C,,  C ,  and X, sequences are specified 
according to the DBT transformation rules, when applied to this 
kind of problem (see figure I ) .  
Entrance and sally matrices are specified as: 
e(i,i) = w+i-2, i E [ I  ... w]; e(l,w+2)= 3w-1; e(l,w+3) = w-l 
e(w,w+l) = 0; e(w,w+4) = 0; E(ij) = 0 in the other cases. 
s(w,l) = - (2w-1); S( i j )  = 0 in the other cases 
The periodicity vector can be specified as: 
P(i) = 1, 
The slow of the SA is k=2. 
i E [2 ... w]; P( l )  = (1,l). 
TRANSFORMATION RULES IN ID SAS 
Now we shall present two transformation rules for 1D SAS, 
which are to be used in the proposed design methodology. 
Rule 1. H.T. Kung presents in [5] two transformations 
applicable to time homogeneous SAS. The first one is equivalent 
to the retiming lemma proposed by Leiserson [6]; i t  allows a 
register redistribution inside the SA. The second transformation 
allows to attain a c-slow version of the SA; this is accomplished 
through a multiplication by c of the number of registers 
associated to each link between PES, and through an adequate 
modification of the input and output data sequences. 
A rule, extending these two mentioned transformations, 
follows in this paper. Main features of this rule are: a)  
computation time intervals (L matrix) and communication time 
intervals (R matrix) are distinguished; b) specification of SAS 
with non-homogeneous time behaviour is alloved. Rule is: 
The A' = (w, I', 0', L', R', E', S', P', k') SA performs the 
same calculation as A = (w, I, 0, L, R, E, S, P, k) i f  
L' o R I  = D ( L O  R)C D-1 
E' = D EC 
k' = ck 
S' = (Sc) D-' 
I'i(k't+l) = Ii(kt+l) 
O';(k't+l) = Oi(kt+l) 
i E [ I  ...PI and t 2 0 
i E [1 ...q] and t 2 0 
where symbol "o" denotes a matrix operation defined as 
follows: if PE, exhibits non-homogenous time behaviour 
(L R) ( i j )  ~ z-(l'(i,j)+rA(i,j).. . . .~~j(i,j)+r~(i,j)) 
and if PEj always performs same operation (homogeneous time 
behaviour) 
(L 0 R) ( i j )  = z-(l ( id+r(ij))  
1 
Any matrix ,of the form Xc is defined as: Xc(i,j) = z-cx(i9j) if 
X(i j )  = z-'('J) or Xc(i,j) = 0 if X(i,j) = 0, D is a diagonal 
matrix, with w*w elements of the type D(i,i) = z-d$ where di 
and c belong to the set of rationale numbers. 
Rule 2 (coalescing rule).The goal of this second rule is to 
obtain a transformation of a k'-slow SA, in which any PE 
performs only one valid operation every k' cycles, into a 1-slow 
SA with lower number of PES, in which any PE performs 
useful task in all cycles. In this manner we perform the same 
computations, in the same time, but using less hardware more 
efficiently. This rule is expressed as follows: 
Given the A'= (w, I', 0', L', R', E', S', P', k') SA; where 
p''(i)=l, with s E [1 ... nil and i E [l..w], we shall obtain 
another SA, A*= (w*, L*, R*, E*, S*, P*, k*), able to 
perform the same calculation as A' if 
t(i) mod k' <> t(j) mod k'; i j  E [(q-l)k'+l, ...,q k']; i <> j 
q E [1 ... k*] (1) 
where t(i) is the cycle in which PEi of A' initiates the first valid 
operation. 
For different periodicities (#l)  this rule may be enunciated in a 
similar way, by modifying condition (1). 
The described condition guarantees that k' adjacent PES of SA 
A' initiate valid operations in k' different cycles. Accordingly, 
the PE of A* may perform, via coalescing, the calculations 
performed in A' by processing elements PE(q-l)kl+l ,...,PEqk', 
without any conflict in the utilization of functional units. 
q 
The structures of each PE in A*, the form of L*, R*, E* and 
S* matrices, and the form of vector P*(i) are obtained through a 
simple algorithm; details are omitted here for reasons of space, 
but may be found in [7]. The number of PES in the new SA is 
w*=w/k'; the new slow is k*=l; periodicity is equal to 1 for all 
operations. Input sequences to each PEq of A* are obtained by 
interleaving the input sequences to PES of A' whose task is 
performed now by PEq (figure 2). Output sequences are 
obtained in a similar way. Detailed specification of these 
sequences is also presented in [7]. 
2523 
* * 1 3 9 )  * I.g(5) 
I j (9)  Ij(5) 
Ii(5) * * * 
I j (5)  * I6(1) 
I j (5)  Ij(1) 
* I i (1)  * 
11'6) * 
I j ( l )  * t t 
t Ii(,) t t t t 
* t  
I i (1)  t *  I j (9)  * * 
* t  
t t * t + * t  
Figure 2. Example of sequences interleaving when applying 
rule 2 (coalescing) for k '  = 4, and w = 8. 
TWO-LEVEL PIPELINED 1D SA WITH DATA 
CONTRAFLOW FOR TRIANGULAR SYSTEMS OF 
EQUATIONS 
In this section a two-level pipelined 1D SA with data 
contraflow, which attains maximum hardware utilization, is to 
be obtained. This SA is designed by using transformation rules 
1 and 2 of previous section. These transformations are applied 
here to SA A of section 2.2. SA A only has one specified level 
of pipelining, and all the operations show an equal time cost. 
Nevertheless, in the attained design, some implementation 
features which are taken into account include: pipelining of 
functional units of each PE, periodicity and different calculation 
time intervals for every operation performed by each PE. The 
first applied transformation allows to pass from A to A'; A' 
includes the above mentioned implementation features. For that 
reason, in the new SA A', matrix L' and vector P' are 
predetermined. Afterwards, the second transformation is applied 
and we pass from A' to A* by means of rule 2. So, A' must 
satisfy the required condition to be guaranteed if rule 2 is used. 
Now, we have obtained a SA in which every PE is maximally 
utilized. 
That is, we have now the problem of how to obtain a D 
matrix and a value c, in order to achive the SA A' through 
application of rule 1. 
In the considered example, conditions to be satisfied by matrix 
D and by value c can be expresed as: 
c+di-di+,2 max (1,l '(i,i+l)) , 
c+di+l-di 2 max (lJ'(i+l,i)) , 
i E [1 ... w-I] 
i E [2 ... w-I] 
(a.1) 
(a.2) 
(w+l)c+d,-dl> max(1,l '(w,l)) (a.3) 
c+d,-d, 2 max(l,l"(2,1), 1 ',(2,1)) (a.4) 
(w+i-2)c+di > 0 , 
d , 2 0  
i E [I  ... w] 
Because t(i) = (w+i-2)c+di; 
((w+i-2)c+di) mod k' <> ((w+J-2)c+d.) mod k 
i j  E [(q-l)k'+l, ..., qk']; 
(c) 1 
i#j, qE [ l ,  ..., wk'];  and k' = 2c 
Conditions (a) and (b) determine that the obtained SA A' 
satisfies the implementation constrictions imposed by L' and P .  
More in detail, conditions (a) establish that the number of cycles 
elapsed from the begining of a calculation in one PE, and the 
arrival of its result to the destination PE must be greater than or 
equal to the number of cycles needed for the production of the 
data item in the PE. This value must be equal to 1, at least, to 
avoid global communication requeriments. Conditions (b), in 
the other hand, establish that the number of cycles elapsed from 
the begining of the operation in A' and the arrival to any PE of 
the first data item coming from any one of their input flows, is a 
positive number. (Obviously, the contrary has no physical 
sense). Condition (c) serves to that rule 2 may be applied to A'. 
SA implementation example. 
Let us assume that we have the following characteristics of a 
certain implementation of a 1D spiral SA with data contraflow, 
used to solve triangular systems of linear equations: 
- Multipliers and adders, used in the design of functional units 
are pipelined with respectively m and a stages. Consequently, 
one multiplication requires m cycles and one addition requires a 
cycles. 
- To perform divisions, the divisor inversion algorithm is used 
[8]: Q = AIB = -AR(2+RB); where R is an approximation to 
-1IB, obtained by indexing a table with some bits of B. This 
calculation requires one access to the table, besides 3 
multiplications and 1 addition. Two of these multiplications can 
be performed in parallel. Consecuently, if we use 2 multipliers 
and 1 adder, the number of required cycles to perform one 
division is 2m+a, if we negleet the time to access the table. 
About the SA A' which we are looking for, the following 
relations are known, in this case: 
1 '(i,i+l) = m+a, i E [ I  ... w-11; 1 '(i+l,i) = 0, i E [2 ... w-11 
/ ' (W,I )  = m+a; ~*1(2,1) = 2m+a; 1*2(2,1) = o 
L'(ij) = 0 in the other cases. 
P'(i) = 1, i E [2 ... w]; P'(1) = (1,l). 
One posible set of values di and c, satisfying conditions (a), 
(b) and (c) is : 
d l =  c(w-l)-w+l-2m-a; dj = c(w-i)-w+i-I, i E [2 ... m+a+l]; 
di = c(w-i)-w+i, i E [m+a+2 ... w]; c= (3m+2a)/2 ; 
As the found solution satisfies condition (c), we are able to 
transform the k'-slow SA A' into a I-slow SA A* (figure 3) 
with w*=wk' PES, and where the PEq of A* performs those 
operations performed by PE(q-,)k,+l,...,PE of A'. Figure 4 
shows, for instance, the intemal structure of PE1 of A* in the 
considered example. This PE, performs the same operations 
which were performed by PEI,...,PE3m+2a of SA A'. The 
selection signals of inultiplexors are generated from a module 
3m+2a counter and from the extemal signal Cl*. In figure 5 the 
intemal structure of any one of the remaining PES of A*, can be 
seen. 
The input sequence to each PE of A* is obtained by 
interleaving the input sequences of k' consecutive PES (k'= 
3m+2a) of A'. More precisely: 
qk 
2524 
qk' 
m(q) = min t(i) h=(q-l)k'+l 
A*q(t+l) = A',(hk'+l) if t-t(r)+m(q) = hk', 
The computational time required to solve in this SA a 
q E [ I  ... w*] 
B* = €3'; Xe* = Xe'; C,* = C1'; C,*= Cm'; Xs* = X,' 
triangular system of equations with N unknows, becomes: 
3m+2 N2 
T=[ - [ ~ + N + 2(3m+2a)w*-4] - (3m+2a)+l]tc 
2 (3m+2a)w* 
where tc is one cycle time. 
As an example, if our implementation has w*=lO, m=a=3, 
and tC = loons, a number of 500 triangular systems of 
equations with 600 unknowns each, can be solved in approx. 
1.22 seconds. 
X', C', A', A', A', A I N .  C',,, B'X' ,  
1 1 1 1 .It . I t ,  .rl 
-U Y 
3 m + l a  3 m r 2 a  3 m + 2 a  3m t 2a 
Figure 3. Array structure of A'. 
i A* ,  
I I ] Feed- 
b 
Figure 4. PEI internal structure of A'. 
CONCLUSIONS 
In this work we show how the use of a mathematical model 
and the application of adequate transformations rules, allows the 
design of new efficient SAS. 
Starting from simple SAS which are not well suited for a real 
implementation , we obtain SAS where practical features are 
taken into account, namely: two levels of pipelining, fixed 
number of processing elements, unequal computational times 
for different operations, etc. 
The proposed design technique has been applied in this paper 
to get the design of a problem-size-independent, computational 
balanced and two-level pipelined 1D SA with data contraflow to 
solve triangular systems of equations. 
This methodology can be easily generalized for any type of 1D 
as well as 2 0  SAS and is suitable for any interconnection 
topology. 
REFERENCES 
[ I ]  H.T. Kung and C.E. Leiserson, "Systolic Arrays (for VLSI)", Sparse 
Matrix Proc. 1978, 1979, Society for Industrial and Applied 
Mathematics (SIAM), pp. 256-282. 
[21 J.A.B. Fortes, S.Y. Fu, and B.W. Wah, "Systematic Approaches to 
the Design of Algorithmically Specified Systolic Array Arrays",Proc. 
Int'l Conf. Acoustics, Speech, and Signal Processing, 1985, pp. 
[31 J.J. Navarro, J.M. Llaberia, and M. Valero, "Partitioning:An 
Essential Step in Mapping Algorithms into Systolic Array 
Processors", Computer, V01.20, No. 7, July 1987, pp. 77-89. 
[41 H.T. Kung, and M. Lam, "Wafer-Scale Integration and Two-Level 
Pipelined Implementations of Systolic Arrays",J. Parallel and 
Distributed Processing. Vol. 1, No. 1, August 1984, pp. 32-63. 
[51 H.T. Kung, and W.T. Lin, "An Algebra for Systolic Computation", 
Elliptic Problem Solvers 11, Academic Press 1984, pp. 141-160. 
[61 C.E. Leiserson, and J.B. Saxe, "Optimizing Synchronous Systems", 
Proc. of the 22nd Annual Symp. on Foundations of Computer 
Science, October 1981, pp. 23-36. 
171 M. Valero-Garcia, J.M. LLaberia, J.J. Navarro, "Considering 
implementations features in the design of systolic array processors", 
Internal report, RR 88/01, Facultad Informitica de Barcelona. 
181 Floating Point Division/ Square Root/ IEEE Arithmetic WTL 
1032/1033 , Application Note, Weitek, 1983. 
300-303. 
I A *  
I 2 m + a - l  
1 1 
Figure 5. internal structure of PE, for i = 2 to w* of A' 
2525 
