Synthesis aspects in the design of efficient processor arrays from affine recurrence equations  by Clauss, Philippe & Mongenet, Catherine
J. Symbolic Computation (1993) 15, 547-569 
Synthesis aspects in the design of efficient processor 
arrays from affine recurrence equations 
PHIL IPPE  CLAUSS t AND CATHERINE MONGENET t 
! Universitd Robert Schuman, IECS, 47 avenue de la For6t Noire, 
67000 STRASBOURG, FRANCE 
I Universitd Louis Pasteur, Ddpartement d'Informatique, 7 rue Rend Descartes, 
67084 STRASBOURG, FRANCE 
(Received P6 February 1993) 
Efficient implementation of problems on processor arrays reqttires dedicated compiling 
techniques. This paper proposes ynthesis techniques in the design of processor array 
architectures from system of affme recurrence equations. They implement the space- 
time mapping which determine the scheduling of the calculations and the allocation of 
these calculations to a surface of processors. The paper emphazises the two-steps space 
mapping which is composed of a classical linear allocation followed by the application of 
two specific partitioning techniques. These techniques reduce the number of processors 
and therefore result in more efficient parallel solutions. The paper is illustrated with the 
Cholesky factorization. 
1. In t roduct ion  
More and more processor array architectures are available and the implementation of
a problem on such architectures i  a crucial question. The design of efficient solutions re- 
quires the mastering of dedicated compiling techniques. Many research works are tackling 
this difficult question. 
The formalism of recurrence quations is appropriate for the description of many prob- 
lems. The class of uniform recurrence quations (Karp, Miller and Winograd, 1967) was 
first extensively studied, for it is a suitable way to specify a problem to be mapped 
onto a systolic array (Moldovan (1983), Quinton (1984), Delosme and Ipsen (1985), 
Mongenet (1985), Fortes, Fu and Wah (1985)). However this class of equations is too 
restrictive: many problems cannot be expressed naturally in terms of systems of uni- 
form recurrence quations. But these problems can often be defined by a more general 
class of recurrence quations: the affine recurrence quations. Many research works are 
now tackling this larger class of equations (Mongenet and Perrin (1987), Mongenet and 
Perrin (1988), Kung (1988), Roychowdhury, Thiele and Kailath (1988), Wong and De- 
losme (1988), Yaacobi and Cappello (1988), Rajopadhye (1989), Quinton and Van Don- 
den (1989), Mongenet, Clauss and Perrin (1991)). 
The design of parallel architectures to implement he class of affine recurrence qua- 
tions often results in more general processor arrays. The aim of this paper is to present 
0747-7171/93/5-6547 + 23$08.00/0 © 1993 Academic Press Limited 
548 Ph. Clauss and C. Mongenet 
compiling techniques in order to synthesize processor array solutions from systems of 
affine recurrence quations (SAKE in the following). 
A SARE defines the elementary calculations of a problem as points over an integer- 
coordinates domain. The technique consists in the determination of a parallel execution 
of these points on an architecture. It therefore requires the determination of a space-lime 
mapping characterized by 
- -  a scheduling or timing of the calculation points, 
- -  a space organization of the computation, i.e. the allocation of each calculation point 
onto a processor of the parallel architecture. 
A space-time mapping often consists of an affine timing function and of a linear alloca- 
tion by projection. The resulting architectures usually require more processors than the 
parallelism rate of the solutions. For this reason, they can be considered as inefficient. 
To get more general and more efficient architectures, a 2-steps space mapping is pro- 
posed. It first consists of a usual linear allocation and next of the application of specific 
allocation or partitioning techniques. These techniques reduce the number of processors 
without modifying the global computation time of the solution. This approach differs 
from other works (Moldovan and Fortes (1986), Navarro, Llaberia and Valero (1987),Bu, 
Depretrerre and Dewilde (1990), Teich and Theile (1992)) which focus on the partition- 
ing of the problem in order to map it onto a fixed-size array. Such techniques result in 
slower computation time of the solutions. 
The paper is organized as follows. Section 2 recalls the concept of SARE. The space- 
time mapping requires first an appropriate characterization f the dependencies between 
the calculation points. This dependency analysis is also presented in section 2. Section 3 
focuses on the time and space mapping problem. Section 4 presents the synthesis aspects 
in the design of a solution. Two specific allocation techniques called the clustering tech- 
nique and the serializing technique are presented in section 5. These techniques require 
geometrical tools to model the activity of any architecture. For each technique we point 
out the synthesis aspects to compile efficient architectures. These points are illustrated 
with the example of the Cholesky factorization. 
2. The  sys tems of  aff ine recur rence  equat ions  
2.1. DEFINITION 
An affine recurrence quation is of the form : 
X[z] = f ( . . . , r [p (z ) ] , . . . ) ,  z • Dp, (E) 
where X and Y are variable names indexed with integral indexes over a convex bounded 
polyhedron t Dp C Z n, called the indez domain. Its bounds are characterized by the 
size parameterp • Z q. The computation function f is any function of complexity O(1). 
Function p is an affine function from Z n to Z '~' (n' < n). It is called the index function. 
It defines the item of variable Y used to calculate the item X[z]. Notice that if Y is a 
result variable, we have n' = n. 
t When several identical equations are associated with different domains, we simplify the presentation 
by defining one equation on a union of convex bounded polyhedra. 
Design of Efficient Processor Arrays 549 
The initialization of the recurrence defined by equation (E) is specified by the initial- 
ization or input equations. At least one such equation is associated with each result X of 
the problem. It is of the form : 
0 i n ' X[zo] = d, Zo E Dp C (Init) 
where d is either a data or a result item and Zo corresponds to the initial step of the 
recurrence. In practice the initialization domain D ° is adjacent to the calculation do- 
main Dp. 
EXAMPLE : The Cholesky factorization. Let A be an n x n symmetric positive definite 
matrix. This factorization calculates a lower triangular matrix L such that A = LL t. It is 
defined by the following SARE : 
L(i , j ,k) = L( i , j ,k  (2.1) 
1 
L(i , j ,k)  = L( i , j ,k  (2.2) 
1 
L(i , j ,k)  = L( i , j ,k  (2.3) 
1 
The input equation is : 
L( i , j , -1 )  = Aji, 1 < i < n, 1 <_ j <_ i. 
The resulting matrix L is composed of the elements : 
Lij =L( i , j , j -1 ) ,  1 <i<n,  1 < j  < i .  
The problem is characterized by 3 convex subdomains Da, D~ and D3 respectively asso- 
ciated with equations (1), (2) and (3). 
-1 ) -L ( i , k+ l ,k )  x L ( j , k  + l , k ) ,  
<i<n,  l<__j<_i, 0<k<j -2 ;  
- 1 ) /L ( j , j , j  - 1), 
<i<n,  l<_ j<_ i -1 ,  k=j -1 ;  
- 1)1/~, 
<i<n,  j= i ,  k= i -1 .  
D1 = {(i, j ,k) eZa l l< i<n,  l< j<_ i ,  O<k<j -2} ,  
D2 = {( i , j , j - l )  EZa l l  < i<n,  l _< j_< i -1} ,  
D3 = {( i , i , i -1 )~Za l l< i<n}.  
2.2. DEPENDENCY ANALYSIS 
The space-time mapping lies on a dependency analysis which automatically extracts 
from the SARE the causal dependencies between its elementary calculation points. 
Considering equation (E), we characterize the set of points using a given item Y[(o], 
(~0 E Z n'. This set is called the utilization set and defined by : 
DEF IN IT ION 2.1. The utilization set associated with a given item Y[ff0] relative to equa- 
tion (E) is: 
UtilE(Y, C0) ---- {z E Dp ]p(z) = C0}- 
550 Ph. Clauss and C. Mongenet 
Since the index mapping p is an affine function, the solutions to any p(z) = ~0 form 
an affine subspace of dimension m < n of Z '~. Therefore UtilE(Y, ~'0) is the intersection 
of this affine subspace and the convex bounded polyhedron domain Dp. Hence we have 
the following property :
PROPERTY 2.1. Every utilization set Util~(Y,(0) is a convex bounded polyhedron. 
DEFINITION 2.2. The utilization vectors associated with variable Y relative to equation 
(E) are any basis vectors of UtilE(Y,~o) for any ~o. They are denoted by ~E,Y,i for i = 
l, . . . ,m.  
In the compiling process of a parallel architecture, these vectors modelize the way data 
are used, i.e. their streams organization. 
REMARK 1: When the utilization set is of dimension 0, i.e. when it is reduced to one 
point, we characterize it by a null utilization vector. 
Notice that the causal dependencies imposed by the problem are only related to the 
result variables. Hence a data does not define any a priori dependency between calcula- 
tions. For any result item Y[(0] the set of points which causally depend on point if0 is 
UtilE(Y, ~0). The causal dependency related to this utilization set is classically expressed 
by the notion of dependency vectors. 
DEFINITION 2.3. A dependency vector associated with result variable Y relative to equa- 
tion (E) is any vector qY, p(z) = z -  p(z). 
The causal dependency associated with a result variable is then characterized by the 
dependency vectors q and by the utilization vectors ~. The dependency vectors connect a
point (0 = p(z) computing the result item Y[(0] to all the points z, z E UtilE(Y,~0), that 
causally depend on ~0, i.e. that use the result provided by ~'0 for their own computation. 
Notice that there are as many sets of dependency vectors associated with a given result 
variable Y as there are occurrences of Y in the right hand side of equations (E). 
All dependency vectors qY,¢o of a given set have the same origin ~0 and their extremities 
belong to the convex bounded polyhedron Utile(Y, ~0). From classical convex geometrical 
results (Schrijver, 1986) we deduce the following property :
PROPERTY 2.2. The set of dependency vectors qV,¢o associated with result item Y[(0] 
relative to equation (E) defines a cone which is characterized by its finite set ofextremal 
vectors. These extremal vectors are denoted by qY,(o,ex, for i E N. 
There are as many q-cones as there are result variables Y appearing in the right hand 
side of any equation (E). 
REMARK 2: Any dependency vector of a cone is defined as a linear combination of the 
extremal vectors: q = ~'~i a tYez~ where the coefficients ai are non-negative and at least 
one of them is greater than zero. 
EXAMPLE : The Cholesky factorization. In the 3 subdomains there is a constant 
dependency vector q0 = (0,0, 1) due to the use of L( i , j ,  k - 1) in all three equations. 
Design of Efficient Processor Arrays 551 
In equation (1) element L(i, k + 1, k) defines utilization set Util l(L, ff0), while element 
L(j, k + 1, k) defines utilization set Vtil2(L,~0). In equation (2) we call Util3(L, if0) the 
utilization set due to element L( j , j , j -  1). Let if0 = (io,Jo, ko). These different sets are : 
Vtil l(n, ff0) = {(io,j, j o -1 )  l j o+ l  <j<i0} ,  
Util2(n,(o) = {(i, io, Jo-  1) l io < i_< n), 
Util3(L,(o) = {(i, io , io -  1) l io < i <_ n}. 
The corresponding utilization vectors are respectively ¢1 = (0, 1, 0), <b2 = (1, 0, 0) 
and ¢3 = (1,0, 0). The dependency vectors respectively denoted by @t, @2 and ~3 are 
defined by : 
~1 - -  (O, j - jo,O) with l < j - jo<_ io - Jo ,  
~2 = ( i - i o , io - jo ,0 )  with 0_<i - io_<n- io ,  
~3 = ( i - i o ,0 ,0 )  with l< i - io<n- io .  
3. Space-T ime mapp ing  parameters  
3.1. AFFINE TIMING FUNCTION 
Classical synthesis techniques only consider affine timing functions. The timing func- 
tion defines, for any point z of the domain, its execution instant t(z) in the following 
way. 
DEFINITION 3.1. An affine timing function associated with a problem is a function of 
the form : 
t: DCZ n ~ N 
z , , t ( z )=A.z+o~ 
where A = (A1,.. . ,An) and A1, . . . ,An,a  are integral constants and. denotes the scalar 
product. 
Vector A is orthogonal to hyperplanes of Z" such that all the points belonging to one 
of them have the same execution instant. We call these hyperplanes the timing surfaces. 
This notion is illustrated in figure 1. 
To satisfy the causal dependencies, any affine timing function must satisfy : 
PROPERTY 3.1. An affine timing function exists if and only if its linear part A satisfies 
A.kO~= >0 
for all the extremal vectors kO~ associated with the problem. 
PROOF. A causal constraint expresses that if point z depends causally on point ~0, it 
must be executed after (0, i.e. t(z) > t(¢0). This is equivalent o A.(z - ~0) > 0. By 
definition 2.3, we have z -~0 = ko. Hence the causal constraint is A. @ > 0. By remark 2, 
any dependency vector @ is equal to 7]i ai@e~ with ai > 0. Therefore if the extremal 
vectors meet the causal constraint, so does any dependency vector, m 
552 Ph. Clauss and C. Mongenet 
k 
convex 
timing surfaces 
Figure 1. Timing surfaces 
Moreover, the resulting architecture will be free of any broadcast if the following pro- 
perty is satisfied. 
PROPERTY 3.2. An affine timing function removes any broadcast on the resulting archi- 
tecture if and only if 
~.¢#0 
for all the utilization vectors ~ associated with the problem. 
PROOF. The utilization vector • directs lines whose points use the same item of given 
variable. To avoid a broadcast, these points must belong to different timing surfaces, 
i .e . .~ .¢~0,  m 
Notice that the computability of a SARE has been proved undecidable by Saouter and 
Quinton(1990). Therefore an affine timing does not always exist. Some problems can only 
be scheduled by an affine timing function related o the variables (Rag, 1985), (Rag and 
Kailath, 1988), Yaacobi and Cappello, 1989)) and some others can only be scheduled 
after domain transformations (Mongenet, 1991). These important problems related to 
the timing are not discussed in this paper. 
3.2. LINEAR SPACE MAPPING 
A classical space allocation technique is realized by projecting the domain along a 
direction called allocation direction. It is defined by a unimodular vector ~ G Z". From 
this direction ~, an allocation function alloc is computed using Blankinship's pivoting 
algorithm (1963). It projects all the points of Dp C Z '~ onto a set of processors in Z n-1. 
An allocation direction defines a set of lines parallel to ~ intersecting the domain. 
Each segment is called an allocation segment. This notion is illustrated in figure 2. All 
the points of an allocation segment define the calculations to be executed on the same 
processor. Therefore they cannot be executed at the same time. Hence we have the 
following property. 
PROPERTY 3.3. Let )~ be the linear part of an aj~ne timing function. A vector ~ defines 
an allocation direction if and only if ~. ~ • O. 
Design of Efficient Processor Arrays 553 
/ 
convex domain 
~ ~ocation segments 
systolic array surface 
Figure 2. Allocation segments associated with a given allocation direction 
It is always possible to choose ¢ such that A -¢ > 0. From now on we assume that any 
chosen ¢ satisfy this later condition. 
In order to compile a target architecture with precise characteristics, such as no broad- 
cast, no programmable processors, locality constraint, etc., architectural constraints are 
expressed as conditions either on the timing or on the allocation function. In a practical 
way, these constraints can be taken into account o determine convenient time and space 
parameters (Mongenet, Clauss and Perrin, 1991). 
4. Systo l ic  a rch i tec tures  ynthes is  
Once the dependency analysis has been automatically determined and the time and 
space parameters have been chosen, an architecture can be synthesized. The allocation 
function is applied on the equations defining the domain and determines the shape of the 
architecture. The data and results streams organization is then computed by applying this 
function on the different utilization and ependency vectors. This compilation technique 
is detailed in (Mongenet, 1985) and (Mongenet, Clauss and Perrin, 1991). 
EXAMPLE : The Cholesky factorization. The problem can be scheduled by a timing 
A = (AI, A2, A3) which must satisfy )~ • @ > 0, i.e. for this problem: -~1 > 0, ~2 > 0 and 
A3 > 0. A possible timing is therefore characterized by )~ = (1, 1, 1). Notice that this 
timing automatically removes the broadcast since A • ~t = A • ~2 = A "¢b3 = 1. Choosing 
= (1,0,0) as allocation direction the synthesis gives the architecture presented in 
figure 3 for n = 8. Choosing ¢ = (1, 1, 1) as allocation direction the synthesis gives the 
architecture presented in figure 4 for n = 8. 
5. Towards  o ther  synthes is  techn iques  
Arrays synthesized from a linear allocation present many qualities of regularity, lo- 
cality of the connections, juxtaposition of many identical elementary cells. However the 
disadvantage of most architectures is that they do not have an optimal parallelism ratio: 
the number of active processors at any instant is much smaller than the total number of 
554 Ph. Clauss and C. Mongenet 
a18 a17 a16 a15 a14 at3 a12 
a2s a27 a26 az~ a24 a23 az2 
a3s a37 a36 a35 a34 a33 
a4s a47 046 a~5 o44 • 
ass  a~7 a56 a55 . 
a6s  a~ a66 
a78 aTz . 
a88 • 
F igure  3. Array solution for the Cholesky factorization with ~ = (1,0,0) 
: : : : : : : a l l  
: : : : : : a12 : 
: : : : : al.3 : a22 
: : : a14  . a23 : 
: : : a15 : a24 : a33 
: : a16 :a25  : a34 : 
: a17  : a26 : a t5  : 
o-44 
a18 : a27 : a36  : a45 : 
a28 : a37  : a46 : 
a55 
038 : a47  : a56 : 
048 : a57 : a66  
a58 : a67 : 
a68 : a77  
a78 . 
a88 
F igure  4. Array solution for the Cholesky factorization with ~ = (1, 1,1) 
Design of Efficient Processor Arrays 555 
processors. This phenomenon can be caused by two situations. The first one is character- 
ized by processors rythmically active (for example, each processor is active every other 
instant). The second one occurs when some processors have disjoint activity periods, 
i.e. one processor becomes d finitely inactive while the other has not been active yet. 
In both cases, different echniques map on a single processor the calculations of several 
processors of the initial architecture, t In many successful cases the resulting processor 
array is size-optimal, i.e. its number of processors i  equal to the parallelism rate of the 
algorithm. 
In the first situation, the clustering technique can be used. It consists of grouping 
together onto a single processor neighbouring cells of the initial architecture which have 
different idle cycles, i.e. which are active at different instants. In the second situation, 
the serializing technique applies. It maps cells which have disjoint activity periods on a 
single processor. 
These techniques require the determination of activity parameters from which new 
architectures synthesis methods can operate. These parameters rely on the modeling of 
the problem, its domain, the timing surfaces and the initial architecture. They can be 
automatically deduced from the previous architectural solution. Their determination is 
realized with geometrical tools on convex polyhedra presented hereafter. 
5.1. GEOMETRICAL TOOLS 
The convex polyhedral domain D is modelled by its faces, edges and vertices. We call 
edge vectors the vectors generating the edges of the polyhedron. 
EXAMPLE : The Cholesky factorization. The domain of this problem is visualized in 
figure 5 for n = 8. It is characterized by its vertices and its edges: 
v ,=( l , i ,0 ) ,  v2 = (n, l, 0), v3=(n ,n ,0 ) ,  vd=(n ,n ,n - l ) ,  
e1=(1 ,1 ,1) ,  e2=(0,0 ,1) ,  e3=(0,1 ,0) ,  e4 = (1, 0, 0), 
es = (0, 1, 1), e6 = (1, 1,0). 
The calculations computed on a given cell correspond to the points belonging to a 
given allocation segment. Using the allocation function computed from the allocation 
direction ~, the segments are defined as follows. 
DEFINITION 5.1. The allocation segment associated with cell indexed s,  s E Z n- l ,  is 
A, = {z E D [ alloc(z) = s}. 
DEFINITION 5.2. We call first faces and denote them Fq, the faces of the convex polyhe- 
dral domain D whose points z verify z - ~ ~ D. We call ast faces and denote them Lq, 
the faces of the convex polyhedral domain D whose points z verify z + ~ ~ D. 
Notice that the points of any first or last face are extremities of an allocation segment. 
Since an allocation segment characterizes the calculations computed on one cell, these 
points correspond either to the first or to the last computation on the associated cell. 
However, as illustrated in figure 6, it may happen that the so-defined first or last faces 
t To avoid confusion, we call cells tile processors of the initial architecture. 
556 Ph. Clauss and C. Mongenet 
~ °  ~ 
_ - : '7_  - : z -  
g- ' .  ¢-"  
i ~ ~  ~"  I,. ' -  
e4 .,~e6 : : : .  
• " " ' " -  e " "-"~ • • 
I . °  • * 
| . • • ~ 
e3 v3 
Figure 5. Domain for the Cholesky factorization 
do not contain the extremit ies  of all the al location segments.  In this s i tuat ion,  in order 
to character ize all the points of first or last computat ion  on any cell, we int roduce the 
notion of thick faces. 
DEFINITION 5.3. Let Fq be a first face of equation fq • z + Uq = O. t The corresponding 
thick first face Fq is defined by : 
Th (k'~,) 
U Fqk 
k=l 
where Fq~ = {z • D J f f .  z + ~q + k = 0) and Th(Fq) defines the thickness of the face. 
Let Lq, be a last face of equation gq, . z + rlq, = O. The corresponding thick last face Lq, 
is defined by : 
T h ( L q , )  
L,z= (J Lq,  
k=l 
where Lq,~ = {z • D [ gq, . z q - r /q , -  k = O} and Th(Lq,)  defines the thickness of the face. 
PROPERTY 5.1. Let Fq be a first face of equation fq . z + uq 
associated first thick face Fq is given by 
Th(Fq) =l fq' ~ J. 
The thickness of a last face is characterized in the same way. 
= O. The thickness of the 
PROOF. The thickness is equal to the number of hyperp lanes paral lel  to Fq defining first 
computat ions .  Any hyperp lane paral le l  to Fq can be defined by an equat ion of  the form 
t Notice that in fact fq • z + Uq = 0 is the equation of the hyperplane containing the considered face. 
In order to simplify the notation we use the same contraction i~ the following. 
Design of Efficient Processor Arrays 557 
ib 
\~ \ \~k\~\ \~~\ \~\  } Thick face 
, , , , , , , , , , , , , , , , , , , , , ,  \ , , , , , , , , , , , , , ,  
,,,,,, \ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,  
F igure  6. Thick face with allocation direction ~ = (2,1) 
fq • z + Uq 4- k = 0 where k is a coefficient of translation relatively to Fq. Let z0 be a point 
in Fq. The point z0 + ~ does not define a first computation (but a second one). Therefore 
this point belongs to the Th(Fq)-th hyperplane parallel to Fq and defined by equation 
fq.  z + uq -6 Th(Fq) = 0. This~ point therefore verifies fq. (z0 + ~) + uqL-6 Th(Fq) = 0, 
i.e. fq .zo + uq + fq .~ -6 Th(Fq) = 0. Since zo E Fq, it simplifies as Th(Fq) =1 fq .~ [. u 
To simplify the presentation, we consider from now on that there are no thick faces. 
This case is fully detailed in (Clauss, 1990). 
As mentioned before, the linear part A of any affine timing function defines a set of 
timing surfaces. The intersections between one timing surface and one first (respectively 
last) face, characterize all the processors having their first (respectively last) activation 
at the same instant. These intersections are either segments or the faces themselves. In 
the following we consider the former case where the intersections define segments. 
DEFINITION 5.4. Let Fq (respectively Lq,) be the q-th first face (respectively the q'-th 
last face). The intersection segment of his face with the timing surface at instant to 
is denoted by Y:q,to "respectively •q',to). Their union for all first and last faces are 
respectively defined by 
S,o=UZ .,o and 
q q" 
We consider the projection of these sets brto and £,o along the allocation direction 
on the initial architecture. When the cells are active at any instant (A. ~ = 1), their 
projection alloc(grto+l) and alloc(/2to) describe ither some cones or two parallel ines as 
shown in figure 7 in the case of only one first and one last face. These lines delimit three 
areas of systolic cells : the dead cells, the currently active ones and the previously unused 
ones.The cells which are inside the surface defined by these two lines correspond to the 
active cells at instant to. When the cells are not active at any instant ()~. ~ > 1) this 
interpretation generalizes while considering/:to and ~'to+a.~. 
EXAMPLE : A solution of the Cholesky factorization is given by the linear allocation 
= (1,0,0). This allocation direction defines one first face and one last face on the 
558 Ph. Clauss and C. Mongenet 
alloc ( L to) 
all~ £tJ y alloc( ~rt0+l )
~.  g/ alloc ( Ft0+l ) / "  / 
m : Dead cells 
[] : Active ceils 
[] : Previously unused cells 
Figure 7. Schema ~ 
Figure 8. First and last faces corresponding to ~ = (1,0,0) 
domain as it is shown in figure 8. Their equations are respectively f • z + • -- 0 with 
f = (1 , -1 ,0 )  and t, = 0 and l . z  + r /=  0 with l = (1,0,0) and 77 = -n .  The thickness of  
both faces is 1 since f .~  = ( t , -1 ,0 ) - (1 ,0 ,0 )  = 1 and l - (  = (1 ,0 ,0 ) - (1 ,0 ,0 )  = 1. The 
activity of the cells is characterized by a cone presented in figure 9 for to = 14 and n = 8. 
This geometrical information on the active cells is now used to apply the specific allo- 
cation techniques (in particular the serializing technique) whose objective is to minimize 
the processor count, i.e. reduce the total number of processors of the resulting architec- 
ture. In many successful cases, we shall reach size-optimal architectures which are defined 
as follows. 
DEFINITION 5.5. For any given timing, we call parallelism rate pp the max imum number 
of simultaneous calculations. 
pp = max X<to<T P(~0), 
Design of Efficient Processor Arrays 559 
q P [] 
m[ ]  
[ ]mm 
alloc( £.14) 
alloc( Fl15) 
D 7 
F igure  9. Act iv i ty of the cells for to = 14 
where p(to) is the number of simultaneous calculations at instant to and r is the number 
of timing surfaces, i.e. the latency of the algorithm. 
DEFINITION 5.6. For any given timing, an array is size-optimal if its number of proces- 
sors is equal to the induced parallelism rate. 
The quantities p(to) used to determine the parallelism rate are computed using the 
following properties. 
PROPERTY 5.2. Let A be the linear part of a timing function such that A • ~ = 1. For a 
given instant to, the number of simultaneous calculations is computed by : 
to to--1 
tt~-I tt=l 
or else 
p(t0)= f i  r ig(p) -  f i  n f (p )  
p----to p----to+l 
where n f (p )  is the number of cells having their first activation at instant p and ng(l~) is 
the number of cells having their last activation at instant I~. 
PROOF. When A. ~ = 1, the array is efficient : all the processing units are busy at every 
instant from their activation until their completion. The parallelism rate of instant to is 
equal to the number of active processing units at this instant. It can be deduced that this 
number is equal to the number of processing units having their first activations during 
the time interval [1,t0] minus the number of processing units being no more active. [] 
This property can be generalized as follows to any allocation vector ~. Recall that 
if A • ~ > 1 the cells are active 1 instant out of A • ~. 
560 Ph. Clauss and C. Mongenet 
PROPERTY 5.3. For a given instant o, the number of simultaneous calculations is com- 
puted by : 
p(to) = E n f ( t ° -  /'tA "~) - E ng(to- /~A.,~),  
p=0 p=l  
or else 
Lr ,~|  iZ-=-~l ),-~ a k ,~.( J 
p(to)= ~ ng(to+IZA.~)-  E n f ( to+,A .~) .  
/J=0 p=l  
PROOF. In a general way, there is A • ~ kinds of cells : the cells active at instants qA. ~, 
the cells active at instants qA. ~-  1, . . . ,  the cells active at instants (q -  1)A. ~ + 1, where 
q is a positive integer. Therefore, for any given instant to, the cells which are active at 
to are cells which are active at instants to - qA -~. [] 
Notice that when A. ( = 1, property 5.3 expresses a summation equivalent o the 
one given by property 5.2. The quantities n£ and nf  are computed using the following 
definition and property. 
DEFINITION 5.7. The number of cells having their first activation at instant to is : 
nf(to) = ns (to)- Z .sqq,(to), 
q q ql>q 
where nfq(to) is the number of first activations included in first face q and nfqq,(to) is 
the number of first activations included in the intersection of first faces q and ¢. 
P~OPERTY 5.4. The function nfq is defined at intervals Iql,n = [t(vO,t(vm)] C [1 . . . r ]  
where vt and vm are two distinct vertices of set Fq such that any other vertex vk of Fq 
verifies t(v~) ~]t(vl),t(vm)[. On every interval Iqtm, nfq is determined by: 
l ' o - '{v ,  )1 I .~ . l  
(I to-t(v,s)I-r .e,) mod 
nfq  ( to )  ~-,~ 
e j  / 
r=o A • ej J 
where vii is the intersection vertex of the egdes respectively generated by the edge vectors 
ei and ej. These two edyes delimit the q-th first face considered at interval Iqtm. 
Note that in this expression, i and j may be exchanged. Note also that the function ngq 
is characterized in the same way. 
PROOF. Vectors ei and ej define a basis of the hyperplane containing the set Fq. There- 
fore any point z of this hyperplane can be expressed as a linear integral combination of 
these vectors :
, z=ae iWbe j+v i j  
with a > 0 and b _> 0. Let to be t(z). Then to = aA-ei+bA .ej +t(vij), which is equivalent 
to  : 
to - bA • e~ - t (v i j  ) a= 
A "el 
Design of Efficient Processor Arrays 561 
with 0 < b < Po-t(v,DI We deduce from this last statement a definition of z depending 
A.ej " 
on the considered instant t0 : 
to t (vu)  -- rA • 
-- " e~.ei + rej  + vii z= ~.e i  
with 0 < r < [t°-t(v'i)[ and to > t (v i j ) .  Therefore for any given instant to of the con- 
A-ej 
sidered interval and for any given value of r, there exists a point z associated with a 
computation at this instant to if and only if 
(to - t (v i j )  - rA. ej) mod A- ei : 0. 
In the same way, we can write : 
( to - t (v i j ) - rA .e i )  mod a .e j=0.  
The number of first computation points of instant to is equal to the number of points 
which exist at to, i.e. to the number of values of r which satisfy the equality above. 
The counting of these values can be made from the following expression : 
[A .e i - ( to - t (v i j ) - rA .e j )modA.e i ]  
A .e i  
which is equal to 1 if (to - t(vlj) - rA • ej)  mod A - ei --- 0 and is equal to 0 otherwise. 
In the same way, the expression : 
A • ej - (to - t (v i j ) -  rA. ei) mod A- ej I A • ej 
is equal to 1 if (to--t(vij)--rA.ei) mod A.ej = 0 and is equal to 0 otherwise. The property 
can then be easily deduced, o 
EXAMPLE : The Cholesky factorization. From the domain modeling in terms of its 
edges and vertices described in figure 5 the parallelism rate is computed as follows. Recall 
that A = (1, 1, 1) and that we consider the initial architecture obtained with allocation 
direction ~ = (1,0, 0). 
The function nf  is first defined at interval [A • Vl . . .  A • va] = [1. . .  2n - 1]. The two 
considered edges are el and e6 whose intersection is vertex vl characterized by t (v l )  = 1. 
The application of property 5.4 gives for all to E [1. . .  2n - 1], 
nf ( to )  = ~a"' J L A- e6 - (to -- 1 A--e6rA" el) mod A .e6j 
r=0 
= ~-~, [L'°~-'J 2 - (to - 3r2- 1) mod 2j. 
r :0  
The function nf  is then defined at interval ]A -v3 . . .A -v4]  =]2n-  1 . . .3n -  2]. The 
two considered edges are el and e2 whose intersection is vertex v4 characterized by 
t(v4) = 3n-  2. Hence for all to G [2n . . .3n -  2], 
L ~ J  A .e~- (3n-2 - t0  rA el) modA 
L " - • .e j nf ( to )  = A.e~. 
r-~O 
562 Ph. Clauss and C. Mongenet 
3h- -2 - -1  o 
L ~ l = ~,  [1-(3n-2-t0-3r) modlj 
1 
r=0 
3n-- 2-  IQ 
= E 1 
r=O 
= [3n-2  3 -t°J +1 
The function ng is determined in the same way : 
Vto • [ l . . .n -  1], rig(to) = O, 
Vto • [,,... 2~ - 1], he(t0) = L~-o-~j + a, 
vt0 • [2n . . .  3 ,  - 2], ,~(t0)  = L~J  + 1. 
The determination of the maximum number of simultaneous calculations gives the follow- 
ing result: p(t0) is maximum when t0 = ½(3n-2) i fn  rood 2 = 0 and when to = ½(3n- 1) 
i fnmod2=l .  
Since A • ~ = 1 the parallelism 
If n mod 2 = 0 : 
3n - -  2 - - -  ( -v - )  -- 
rate is given by property 5.2. 
3n- -2  3n.- -4 
#=I ~=I 
p=l r=O p=l 
3n-4 
_ ~ (L u - nj + z). 
• 2 
#=n 
3n- -3  
L ) -  (L% J + l) 
D=r~ 
I fnmod 2 = 1 : 
3n -  1 ~ L~-----~J 
D=I r=O 
5.2 .  ARCHITECTURE SYNTHESIS  BY  SERIAL IZ ING 
The basic ideas of this serializing technique have been initially presented in (Clauss, 
Mongenet and Perrin,1990). A more formal and constructive approach is developped in 
the following. It is based on affine transformations applied on the initial architecture. 
The set of active cells of the initial architecture is delimited by the lines alloc(£q,,to) 
and alloc(gVq:o+l) (or in the general case a11oc(£¢:o) and alloc(.T'q:o+~.~) ). The idea 
of the serializing technique is to map on a single processor a cell belonging to the line 
alloc(£¢:o) and a cell belonging to the line alloc(.T'q:o+:~.e). Their respective computa- 
tions will therefore be serialized on the resulting processor. 
This allocation technique call systematically give size-optimal solutions (Clauss, Mon- 
genet and Perrin,1990). But an empirical serialization of the initial cells often leads to 
arrays with non-local and crossing connections. To avoid these problems, a constructive 
Design of Efficient Processor Arrays 563 
application of the technique is necessary. It consists in partitioning the initial architecture 
into sub-parts characterized by the following properties : 
• each sub-part is a convex bounded polyhedron of Z n-1. 
• all these parts have the same surface s, i.e. they contain the same number of cells, 
some of them can be virtual. 
• the surface is close to the parallelism rate pp with s > pp. 
• all the sub-parts $1, S~, . . . ,  Sm covering the initial architecture can be ordered fol- 
lowing a sequence of affine unimodular bijective relations (7~1,7~2,..., 7~m-1), such that 
Si+x is the successor of Si if and only if : 
Vz E S i ,3y  e SI+I,y = ~i(z )  and x E alloc(/:to) ::¢, TQ(x) E alloc(.T'to+X.¢). 
Affine transformations conserve geometrical properties such as alignment of processors, 
convexity and parallellism of lines. Therefore they preserve these regularity properties. 
The resulting arrays are toroi'dal according to the following definition. 
DEFINITION 5.8. A toroi'dal array is an array that can be drawn on a torus. 
Among the possible affine relations characterizing a partitioning of the initial architec- 
ture, the simplest one is the translation. It occurs when the unique first and last faces are 
parallel. In this case the two parallel ines alloc(Eto) and alloc(Uto+X.e) delimit a sub-part 
which partitions the initial architecture : all the sub-parts are identical and each sub-part 
is obtained from the previous one by a unique translation. This situation is illustrated in 
figure 10. 
In the general case, when the activity is not characterized by parallel ines but by cones, 
more general ffine relations are needed. These relations 7~i consist in a constructive 
application of the serializing technique and can be characterized by the following property. 
PROPERTY 5.5. Let ~ i  be an affine relation defined by a n x n transformation matrix Ri 
and a n-vector ri. Let cq be a cell associated with first face Fq of equation fq • z + uq = 0 
and cq, be a cell associated with last face Lq, of  equation gq, • z + rtq, = O. The relation T~i 
defines a constructive application of the serializing technique b tween cells cq and cq, , i.e. 
serializes the two cells on the same processor, if : 
(1) eq, = fq . Ri = )~. Ri - ~, 
(2) r/q, = fq • r i + b,q = )l  . ri -- ~ • ~, 
(3) Det(Ri) = +1, 
where Det(Ri) denoles the determinant of matrix Ri.  
PROOF. Let zq, be any point of last face Lq,. It verifies the equation of Lq,: gq, • zq, + 
r/q, = 0. Since zq, belongs to last face L¢ ,  it characterizes the last computation of 
the cell defined by alloc(z¢). This cell belongs to the line alloc(Lq,,,o) on the initial 
architecture for some instant to. With the serializing technique this cell must be mapped 
by relation ~i  onto a cell belonging to some line alloc(.T'q,to+X.~). Therefore the image of 
zq, by Ri defined by zq = Ri • zq, + r i belongs to first face Fq of equation fq • z + uq = 0 
and the two points zq and zq, must satisfy A • zq, = to and A • zq = to + A • ~. It gives (a) 
fq" (R i .  zq, + 7"/) + uq = 0 and (b) A. (R/- zq, + ri) = A .zq, + A. ~. Since eq,. z¢ + ~?q, = O, 
it results from (a) that gq, = fq • Ri and Oq' = fq " ri + uq and it results from (b) that 
564 Ph. Clauss and C. Mongenet 
/ / 
Initial Architecture 
c ( Yto+ 1 ) ~.-.---~translation 
/ 
LJ 
partitioning of 
the architecture 
timal 
Figure 10. Partitioning and serializing with a translation 
gq, = A - Rq - A and r/q, = A • ri - A • ~. Statements (1) and (2) are then easily deduced. 
Statement (3) expresses that all the successive sub-parts must be convex, i.e. relation 7"¢i 
must be a unimodular elation and Ri a unimodular  matrix, m 
EXAMPLE : The Cholesky factorization. The parallelism rate is pp = 10 for n = 8. The 
initial architecture is partit ioned into n + 1 regular sub-parts having the same surface 
s = pp. This partit ion is presented in figure 11. The initial 8 × 8 tr iangular array is 
decomposed into sub-parts of 10 cells. Notice that  to respect the serializing constraints, 
virtual cells have to be introduced in some sub-parts. These sub-parts are ordered fol- 
lowing the sequence of affine unimodular elations (TQ)i=I..,, defined by : 
~i  : Si , S i+l  
a l loc ( : )  ~ a l loc (R i  . z + ,-,) 
where (101) (:) 
R1 = 0 0 1 and rl = 
0 1 -1  -1  (1 0) (08) 
/~ = 0 I 0 and ri = 
0 -1  1 9 
for i=2 . . .8 .  
Any sub-part  can be chosen as resulting architecture. When choosing the first one, 
Design of Efficient Processor Arrays 565 
[ ]  : Virtual processing units 
. . . . . . . . . . . . . . .  
Figure 11. Partitioning of the initial architecture 
u)l u) u 
Figure 12. Size-optimal toroidal array 
the synthesis gives the array presented in figure 12. Notice that to clarify the picture, the 
toro'/dal connections have been drawn outside the architecture area. Notice that since the 
parameter s is exactly equal to the parallelism rate pp, this architecture is size-optimal. 
Moreover since the timing function has not been changed, the resulting algorithm has 
the same latency as the initial one. For the particular case n = 8 presented in figure 12, 
the number of processing units reduces from 36 to 10. 
5.3. ARCHITECTURES SYNTHESIS BY CLUSTERING 
This technique has been independently proposed by Bu, Deprettrere and Dewilde 
(1990) (passive clustering) and ourselves (Clauss, Mongenet and Perrin, 1990) (group- 
ing method). It has since been studied by different authors (Darte, 1991), (Zhong and 
Rajopadhye, 1991). 
This technique applies when A. ~ > 1. In this case the initial architecture can be 
decomposed into k types of cells where k = A. ~ : the active cells at instants qk, the 
active cells at instants qk - 1, ..., the active cells at instants (q - 1)k + 1, with q E N ' .  
Formally patterns of k adjacent cells are constructed and assembled to cover the whole 
initial architecture. Notice that the architecture can be covered by patterns of different 
shapes. A pattern must correspond to cells with different idle cycles. The construction of 
566 Ph. Clauss and C. Mongcnet 
such a pattern must then be done in the following way. The initial allocation direction 
is decomposed into a sequence of elementary allocation directions ~i, ~2,... ~k satisfying 
the following properties: 
For all q E [1.. .k],A.~q = 1, 
The allocation function is then applied to each of these directions in order to determine 
the geometrical shape of the pattern. The position of cell 2 relatively to cell 1 is given by 
alloc(~1), the position of cell 3 relatively to cell 2 is given by alloc(~2), . . . ,  the position 
of cell k relatively to cell k - 1 is given by alloc(~k-1), and finally alloc(~k) gives the 
position of cell k relatively to cell 1. Notice that such a decomposition is equivalent to a 
quasi-linear allocation function. 
Optimisations of the final architecture can be done by the choice of the patterns : 
• the number of connections between the processors can be minimized if the elementary 
allocation directions are chosen parallel to some data streams. This choice results in 
clustering interconnected cells and connections are therefore transformed into internal 
memory registers. 
• the number of I /O streams can be minimized if the elementary allocation directions 
are chosen parallel to the borders of the initial array. This choice results in clustering 
cells which communicate with the host and several I /O streams are therefore merged into 
only one. 
Notice that once the patterns have been chosen, the resulting architecture can be auto- 
matically synthesized. Moreover the clustering technique divides the number of processing 
units by a factor A • ~. 
EXAMPLE : For the Cholesky factorization, consider the solution corresponding to allo- 
cation direction ~ = (1, 1, 1). Two different patterns are needed to cover the whole initial 
architecture. In order to minimize the number of I /O streams, the following sequences 
of elementary allocation directions are chosen : 
((0, 1, 0), (0, 1, 0), (1, -1,  1)), 
((0, 1, 0), (0, 0, 1), (1,0, 0)). 
The clustering of the initial cells and the synthesized final architecture are shown in 
fgure 13 for n = 8. Since the timing function has not been changed, this resulting 
solution has the same latency as the initial one. Notice that number of processing units 
is divided by 3 (in general A-~). It decreases from ½n(n + 1) to ln(n + 1). 
It is proved in (Clau~ss,1990) that size-optimal arrays can not always be reached with 
the clustering technique but that, combined with the serializing technique, size-optimal 
arrays can always be computed. Moreover, when the serializing technique can not be 
applied, the arrays after clustering are always size-optimal. 
6. Conc lus ion 
A crucial problem in concurrent programming is concerned with the efficient imple- 
mentation of problems on processor array architectures. 
The formalism of systems of a/fine recurrence quations and their space-time transfor- 
mations are essential components of compiling techniques for this kind of architectures. 
Design of Efficient Processor A rays 567 
lOOn 
[ffJ-~lD [] [] 
lOOm 
In [] olin [] [] 
~J-~In [] []I10 [] [] 
IO [] olIo [] [] 
a17 
a18 
a28 
a14 
a15 
a25 
a16 
a26 
a36 
a27 
a37 
a47 
o38 
a48 
a58 
a l l  
a12 
a22 
a13 
a23 
a33 
a24 
a34 
a44 
a35 
a45 
a55 
a46 
a56 
a66 
a57 
a67 
a77 
a68 
a78 
a88 
Figure 13. Application of the clustering technique to the Cholesky factorization 
We have shown in this paper that appropriate geometrical modeling of the causal 
dependencies, the problem domain and the architecture activity allow the determination 
of various parallel solutions and precise the way automatic synthesis can operate. 
Using the modeling of the initial architecture activity, we have proposed two different 
allocation techniques to minimize the processor count. 
The clustering technique which have been studied by several authors (Clauss, Mon- 
genet and Perrin, 1990), (Bu, Deprettrere and Dewilde, 1990), (Darte, 1991), (Zhong 
and Rajopadhye, 1991)) only operate when the cells of the initial architecture are active 
one instant out of i > 1, i.e. A • ~ > 1. Moreover this technique does not always lead to 
size-optimal solutions. 
When the clustering does not operate, i.e. A-~ = 1, we propose a new technique: 
the serializing technique. It has initially been introduced in (Clauss, Mongenet and Per- 
568 Ph. Clauss and C. Mongenet 
rin, 1990) under the name of pipel ining technique. In order to preserve as much as possible 
the regularity of the result ing solution, a new constructive appl ication of the technique is 
presented in this paper. It consists in a part i t ioning of the init ial archtecture using affine 
un imodu lar  bijective relation to reach size-optimal arrays. 
Re ferences  
Blankinship W.A, (1963). A New Version of the Eucliean Algorithm. American Mathematical Monthly 
742-745. 
Bu J., Deprettere E.F., Dewilde P. (1990). A design methodology for fixed-size systohc arrays. In 
S.Y.Kung and E. Swartzlander Editors, International Conference on Application-Specific Array 
Processors, ASAP'90, Princeton, New Jersey, IEEE Computer Society. 591--602. 
Ciauss Ph. (1990). Synthbse d'algorithmes systoliques et implantation optimale n place sur r~seaux de 
processeurs synchrones. PhD Thesis, University of Franchc-Comtd, France 
Claws Ph., Mongenet C., Perrin G.R. (1990). Calculus of Space-Optimal Mappings of Systolic Algo- 
rithms on Processor Arrays. In S.Y.Kung and E. Swartzlander Editors, International Conference 
on Application-Specific Array Processors, ASAP'90, Princeton, New Jersey, IEEE Computer So- 
ciety. 4-18. 
Darte A. (1991). Regular partitioning for synthesizing fixed-size systolic arrays. Integration, the VLSI 
journal 12, 293-304. 
Delosme J.M., Ipsen I.C.F. (1985). An Illustration of a Methodology for the Construction of Efficient 
Systolic Architectures inVLSI. Sd. Int. Symposium on VSLI Technology, Systems and Applications, 
Taipei, Taiwan, R.O.C. 268-273. 
Fortes J.A.B., Fu K.S., Wah B.W. (1985). Systematic Approaches to the Design of Algorithmically 
Specified Systolic Arrays. Int. Conf. on Acoustics, Speech and Sig al Processing. 
Karp R.M., Miller R.E., Winograd S. (1967) Tile Organization ofComputations for Uniform Recurrence 
Equations. JA CM, 14,3. 
Kung S.Y. (1988) VLSI Array Processors. Prentice Hall. 
Mongenet C., Claws Ph., Perrin G.R. (1991) Geometrical Coding to Compile Affine Recurrence Equa- 
tions on Regular Arrays. Int. Parallel Processing Symposium, 1EEE, Anaheim, CA, USA. 
Mongenet C., Claws Ph., Perrin G.R (1991) Geometrical Tools to Map Systems of Affine Recurrence 
Equations on Regular Arrays. Research Report, Universite de Franche-Comtd, LIB-30-90 
Moldovan D.I., Fortes J.A.B. (1986) Partitioning and Mapping Algorithms into Fixed Size Systolic 
Arrays. IEEE Transactions on Computers, 35-1, 1-12. 
Moldovan D.I. (1983) On the Design of Algorithms for VLSI Systolic Arrays. Procedings of IEEE, 71-1, 
113-120. 
Mongenet C. (1985). Use Mdthode de Conception d'Algorithmes Systoliques, Rdsultats Thdoriques et 
r~alisation. PhD. Thesis, National Polytechnic Institute, Nancy. 
Mongenet C. (1991). Affine Timings for Systems of Affine Recurrence Equations. Parallel Architectures 
and Languages Europe Conf., PARLE'M, LNCS 505, Springer Verlag Ed. 
Mongenet C., Perrin G.R. (1987). Synthesis of Systolic Arrays for Inductive Problems. Parallel Archi- 
tectures and Languages Europe Conf., PARLE'87, LNCS 259, Springer Verlag Ed. 
Mongenet C., Perrin G.R. (1988) Synthesis of Heterogeneous Systolic Arrays. Phoenix Conference on 
Computers ass Communications, IEEE Press. 39-44. 
Navarro J.J., Llaberia J.M., Valero M. (1987). Partitioning:ms e sential step in mapping algorithms into 
systolic array processors. IEEE Computer. 
Quinton P. (1984). Automatic Synthesis of Systolic Arrays from Uniform Recurrence Equations. 11th 
Int. Syrup. on Computer Architecture, Ann Arbor, MI, USA, IEEE Computer Society. 208--214. 
Quinton P., Van Dongen V. (1989). The Mapping of Linear Recurrence Equations on Regular Arrays. 
The Journal of VLSI Signal Processing, 1, 95-113. 
FLajopadhye S. (1989). Synthesizing Systolic Arrays with Control Signals from Recurrence Equations. 
Distributed Computing. 88-105. 
Rao S.K. (1985). Regular Iterative Algorithms and their Implementation Processor Arrays. Ph.D. 
Thesis, Information Systems Lab., Stanford University. 
Rwo S.K., Kailath Th. (1988). Regular Iterative Algorithms and their Implementation Processor 
Arrays. Procedings of IEEE, 76,3, 259-269. 
Roychowdhury V., Thiele L., Kailath Th. (1988). On the localization f algorithms for VLSI processor 
arrays. VLSI Signal Processing 111, IEEE Press,NY. 459-470. 
Saouter Y., Quinton P. (1990). Computability of Recurrence Equations. TR-1203, 1RISA, Rennes. 
Schrijver A. (1986). Theory of Linear and Integer Programming. Wiley. 
"reich J., Thiele L. (1992). A traltsformative approach to the partitioning of processor arrays. Interna- 
tional Conference on Application Specific Array Processors, ASAP'92, Berkeley, CA. 4-20. 
Design of Efficient Processor Arrays 569 
Wong Y., Delosme J.M. (1988). Broadcast removal in systolic algorithms. Int. Conference on systolic 
arrays. 
Yaacobi Y., Cappello P.R. (1988). Converting Affine Recurrence Equations to Quaai-Uniform Recurrence 
Equations. Third Int. Workshop on Parallel Computation and VLSI Theory. 373-382. 
Yaacobi Y., Cappello P.R. (1989). Scheduling a System of Nonsingular Affine Recurrence Equations onto 
a Processor Array. Journal of VLSI Signal Processing, 1 115-125. 
Zhong X. Rajopadhye S. (1991). Deriving fully et~cient systolic arrays by quasi-linear llocation func- 
tiorm. Parallel Architectures and Languages Europe, PARLE gl, LNCS 505, Springer Verlag Ed. 
