Topological transformations as a tool in the design of systolic networks  by Culik, Karel & Fris, Ivan
Theoretical Computer Science 37 (1985) 183-216 
North-Holland 
183 
TOPOLOGICAL TRANSFORMATIONS AS A TOOL IN THE 
DESIGN OF SYSTOLIC NETWORKS* 
Karel CULIK, II 
Department ofComputer Science, University of Waterloo, Waterloo, Ontario N2L 3GI, Canada 
Ivan FRIS** 
Department ofComputer Science, University of New England, Armidale, N.S.W. 2351, Australia 
Communicated by A. Salomaa 
Received May 1984 
Revised December 1984 
Abstract. We introduce the notion of  computational network (CN) which is a general model of 
an arbitrary (finite or infinite) system of parallel synchronized processors (systolic network). Our 
basic and very useful tools are topological transformations of the space-time diagrams (unrollings) 
of computations on CN. We show that the topological transformations on unrollings can be used 
to design systolic networks, to give simple proofs of their correctness, and to demonstrate the 
equivalence of different networks. For example, we use the transformation technique to give a 
concise proof of a strengthened version of  Leiserson's and Saxe's Retiming Lemma and Systolic 
Conversion Theorem. As a practical application we show the correctness of a simple algorithm 
for distributed sorting on a systolic ring. Many other examples are given. 
Contents 
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  184 
1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  186 
2. Computational schemas and networks . . . . . . . . . . . . . . . . . . . . . . . .  186 
3. Retiming and systolic conversion . . . . . . . . . . . . . . . . . . . . . . . . . .  190 
4. Simulation of networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  195 
5. Further examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  203 
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  215 
* This research was supported by the Natural Sciences and Engineering Research Council of  Canada 
under Grant No. A-7403. 
** This work was done during the second author's visit at the Department of Computer Science, 
University of Waterloo. 
0304-3975/85/$3.30 O 1985, Elsevier Science Publishers B.V. (North-Holland) 
184 K. Culik, II, L Fris 
Introduction 
Systolic systems are arrays of synchronized processors which process data in 
parallel by passing them from one processor to neighboring ones in a regular 
rhythmical pattern. Most systolic systems use only a few different types of processors 
arranged in a regular pattern. The principal idea is to perform the required computa- 
tions with minimal input-output communication. Systolic algorithms were explicitly 
introduced by Kung and Leiserson [19, 24] but many algorithms of this type were 
designed earlier, see for example [5, 11, 16]. 
Recently, systolic systems have been studied extensively; see [17] for a list of 
references. Most of this work has been devoted to the design of individual algorithms 
for many different areas, but the efficient layout of systolic systems (and VLSI in 
general) has also been well studied, see for example [20, 21]. Other topics that have 
been studied are: the development of general programming (design) techniques for 
systolic algorithms [3, 18, 22, 25, 28, 30], and the systematic study of the power and 
limitations of certain types of networks from an automata-theoretic point of view 
[1, 4-7, 9-11, 13, 14, 26, 29]. 
Our approach is also automata-theoretic, but rather than the study of a specific 
class of language recognizers or transducers we introduce precise notions to the 
study of arbitrary systolic networks. For example, we want to make precise the 
statement that the square grid and hexagonal grid are essentially equivalent, or that 
the bidirectional linear array and the unidirectional ring are essentially equivalent. 
Our main goal is to give a general framework for the study of arbitrary systolic 
systems and computations on them. 
We introduce the notion of a 'computational network" which in general is an 
arbitrary finite or infinite (synchronous) network of processors connected by com- 
munication lines with arbitrary integer (even negative) 'delays' and no queuin~ 
capability. We are mainly interested in homogeneous (identical processors) and 
regular networks, but our definitions do not make any such assumption. 
Our main tool is the space-time diagram, called the unrolling, of a computation 
on a computational network in [7]. The unrolling of a computational network is 
simple fomJ of a data-flow diagram; it motivates our definition of a 'computationaJ 
diagram'. We consider two networks to be equivalent when they have isomorphic 
unrollings. This is a much stronger equivalence than equivalence based on the sam~ 
input-output function. Two different networks can have isomorphic unrollings, tha 
is, the unrolling of one network can be topologically transformed to the unrollin~ 
of the other. Such a topolotical transformation is a useful tool when designing nev 
networks and in proving their correctness. Topological transformations a such ar~ 
not new, see [4], but we introduce a general model of a computational network ant 
a computational diagram which allow concise proofs using the transformatior 
technique for a broad class of parallel networks. We demonstrate hat most of th~ 
known techniques for systolic system design, such as systolic conversion [18, 22] 
Design of systolic networks 185 
folding [3], and speed up [26], are special cases of topological transformations on 
unrollings. This also holds for the wavefront technique [30] and geometric transfor- 
mations [2] but lack of space prevents their discussion here. 
A particularly simple type of a computational network is the pure computational 
network in which for each node n the paths from all inputs to node n have the 
same delay. In practice such a network always allows pipelining of inputs with 
pipelining of period one. We show that a pure computational network and its 
unrolling are isomorphic. 
In Section 3 we study the semisystolic and systolic networks introduced in [22] 
and generalize the Retiming Lemma and Systolic Conversion Theorem from [22]. 
Our proofs using unrolling are simpler than the original proofs and at the same 
time they are more general, since we do not restrict ourselves to a 'single host' and 
throughout this paper we study not only finite but also infinite networks. 
The systolic conversion preserves not only the structure, that is the underlying 
graph, of a computational network but also the functions performed by the individual 
processors. Only the timing, that is the initiation of the processors, and the delays 
between them are changed. Thus we obtain a very strong equivalent network. In 
our definition of equivalent networks we require that the networks have the same 
unrolling, that is they perform step by step identical computations but not necessarily 
in pairwise matching processors. In Section 4 we consider networks which are not 
equivalent in this sense but which still perform essentially the same computations. 
We introduce the notion of one network being (m, k)-simulated by another network. 
Intuitively this means that the simulating network performs identical computations 
using processors each of which simulates m original processors and requiring k 
steps on the simulating network to simulate one step of the original one. 
The notion of simulation allows us to compare precisely the power of various 
well-known networks. It is generally known, even though not stated explicitly in 
the literature, that the square grid is equivalent to the hexagonal grid. We make this 
comparison precise and give a number of further practical examples. One of them 
generalizes a result from [29] and shows that any network on a bidirectional linear 
array can be transformed into a unidirectional ring of the same size which is half 
as fast. Similarly, a bidirectional two-dimensional array can be converted into a 
unidirectional toroid, and similarly for higher dimensions. 
We close Section 4 by describing a simple efficient algorithm for distributed sorting 
on a unidirectional ring of processors. This algorithm is obtained and proved correct 
by transforming the well-known odd-even transposition sort algorithm from the 
linear bidirectional array to the unidirectional ring. 
In the last section we give further applications of the transformation technique. 
For instance, we give a simple proof of the result from [25] that global control does 
not increase the power of m-dimensional iterative arrays. As a new result 
we demonstrate that the same result also holds for m-dimensional cellular 
automata. 
186 K. Culik, II, L Fris 
1. Preliminaries 
Given a possibly infinite set V, the set of all finite sequences of elements from V 
(words) is denoted by V*. For x ~ V* the length of the sequence x is denoted by ]x[. 
For a finite set of M the cardinality of M is denoted by IMI. We use z to denote 
the set { . . . ,  -1 ,  0, 1, . . .}  and [~ to denote the set {0, 1, 2, . . .},  0 denotes the empty 
set. 
General ly we omit double parentheses in cases where f is a function whose 
argument is a pair (x, y), i.e., we write f(x, y) rather than f((x, y)). Similarly for 
functions returning functions, we prefer to use simpler dp(v)(x, y) to more precise 
(ep(v))(x, y). An ordered digraph is a structure H =(V, , r )where V is a (finite or 
infinite) set of nodes and rr: V--> V* is a function. I f  ,r(v) = Vl . . .  Vk, for vi in V, 
then (vl, v), .... , (Vk, V) are (directed) edges (in that order) from nodes vi to the 
node v, for i = 1 , . . . ,  k. To stress that the edges are directed we will often use the 
notation vi--> v rather than (vi, v), or even e: v~--> v, if we want to name the edge. 
We denote the set of all edges of H by E, and will often talk about the underlying 
digraph (V, E). Note that parallel edges and self-loops are allowed in H. 
The meaning of terms such as (directed) path, cycle, indegree, outdegree, start 
node, end node (of a path) will be applied to H meaning the corresponding terms 
for (V, E). Thus, for example the indegree indeg(v) is 17r(v)], i.e., the length of the 
word or(v). A path p=(/)O--~l)l, t~1-->/)2,..., l)k_l-->l)k) (k>0)  will be written as 
p: Vo-> Vl ->" • •--> vk or, simply as p: vo ->+ vk. The length o fp  will be denoted by ]Pl. 
I f  u --> v we also call u the (immediate) parent of v, and if u -->+ v we call v the 
descendant of u. 
For the ordered digraph H = (V, or) according to our definition, the indegree of 
every node is finite. Thus if we define V~ ={v~ vllrr(v)[ = i} for i=0 ,  1 , . . . ,  then 
(Vo, V~,.. .)  is a partion of V. Note that there is no requirement on the finiteness 
of an outdegree. 
2. Computational schemas and networks 
Let Q be a set (finite or infinite). A pre-computational diagram (preCD) S is a 
structure S = (V, 7r, Q, ~b) where (V, ,r) is an ordered digraph and ~b is a map (a 
collection of  maps) ~b: Vk-'> IQk  QI for k>0.  A preCD is called a computational 
diagram (CD) if the length of any path (of its underlying digraph) with a given 
end-point is bounded. Formally, for every v in V, there is b/> 0 such that if p is a 
path with end(p)  = v, then Ipl-< b. Clearly, there cannot be cycles in a CD. 
A computation a on a preCD is a map a:  V--> Q such that for each v ~ V with 
7r(v)= vl . . .  Vk (Vie V), k>0,  we have 
a(vk)), (1) 
i.e., the 'value' a(v) in Q is computed from the values at all nodes w ~ V for which 
there is an edge (w, v) in E. 
we may call these nodes the 
Design of systolic networks 187 
Note that (1) imposes no condition on nodes o E Vi,; 
inputs of the preCD. 
2.1. Example. Let us consider the CD A = ( V, n, 4) where V = {Q, 6, c, d, e}, T(U) = 
v(b) = T(C) = 1, a = ab, w(e) = dc, +(d)(x, y) = x+y, +(e)(x, y) =x- y. CD A 
is shown in Fig. 1. In the other examples we will show the names of the processors 
inside the circles, since usually no specific functions will be considered. This very 
simple CD computes the value X,(X, + X,) in node e from the inputs X,, X, and 
X, given in nodes a, b, and c, respectively. 
Fig. 1. 
2.2. Example. (Signal processing). Adopted from [ 181 is an example of an infinite 
computation. This is a CD that given wl, w2, w,, w, and inputs x1, x2, x3, . .‘. , 
computes ~1, ~2, ~3, . . . where _Yi = WlXi + WzXi+l+ W3Xi+2+ W4Xi+3 (see Fig. 2). 
There are five columns; if we number them 0 to 4 (left-to-right), then, in column 
i for i = 1, 2, 3, C#I( v) is a pair (gf, g:) where gf( S, x) = S + w5+x, g:(s, x) = x and, 
in column 0, +(v)(y) = y, where S and x are the left and the right component, resp., 
of the values (pairs) at the previous nodes. Finally, in column 4, ~(U)(X) = wlx. 
In this example there are five kinds of nodes, one in each column. It is easy to 
design a CD for the same computation in which all the functions are the same 
(homogeneous CD). 
The important property of CD is that given values at inputs there is a unique 
computation. Formally, we have the following theorem. 
2.3. Theorem. If S = ( V, 1~, Q, 4) is a CD, then for every a0 : Vo+ Q there is a unique 
computation Q on S which extends LYO, i.e., for which a(v) = ao( v) for all v in V,. 
Proof. For v E V put Iv1 = max(lpl Ip is a path with the end node v}. By the assump- 
tion of boundedness of paths in S, Iv1 is well defined for each v in K There is Iv1 = 0 
if and only if v E V,. Eq. (1) defines cy recursively. To see this we define partial 








maps a,: V~ Q as follows: 
a , (  v )  = ¢ , (  v ) (  a , _ l (  V , ) ,  . . . , a,-,(vk)) 
and 
for t--Ivl 
at(v)=at-l(v) for t>lvl 
(a,(v) is undefined if t < [vl). Finally, a(v) is defined as al,,l(v). [] 
Actually, we have proved a somewhat stronger esult, namely that given a preCD 
and a map tZo: Vo--> Q, map t~ o can be extended to all nodes in V for which Ivl<oo. 
Note that alternatively we could have defined a CD with qb defined everywhere 
on V, that is also on Vo, and interpret (1) for v e Vo as tz(v)= ~b(v). Then there 
would had been exactly one computation on every CD. 
A computation etwork (CN) is a structure N=(V,  It, Q, ~b, A, r) where H= 
(V, 7r, Q, th) is a preCD with underlying digraph (V, E), A is a map E ~ Z labeling 
each edge with an integer A(e). We interpret A(e) as the delay (in some time 
units clock cycles). Finally, r: V--> Z is a partial function which determines when 
a computation (see below) begins at a node v in V. 
Note: The interpretation of A (e) as time-delay makes sense only when A (e)i> 0; 
however, we are not assuming this in general because our results are valid also 
when A is possibly negative. In many examples of CN's the functions A and r will 
be defined by A(e) = 1 for each e~ E and r(v) = 0 for each v~ V, that is, each edge 
involves a unit delay, and initial conditions are specified for each processor (node) 
at time 0 (and thus the computation starts everywhere at time 1). 
Design of systolic networks 189 
We shall often omit the functions A and/or  r from the description of a network. 
Such an omission means that the 'default'  functions A(e )= l  and r (v )=O are 
considered. In most examples Q and 4, are left unspecified, in that case it is 
understood that arbitrary Q and 4, are considered. 
To define computat ions on CN, we associate with N = ( V, It, Q, 4,, A, z) the ordered 
graph GN=(V×7/ ,  ~') where 7r' is defined as follows. I f  7r(v)=vlv2... l)k, and 
di = A(v,, v) (vis V, 1 <~ i<~k) then ~r'(v, t )=  (vl, t-d,)(v2, t -d2) .. .  (vk, t -dk) .  
S = ( V x Z, ~', Q, 4,'), where 4''(v, t) = 4' (v), is a preCD which is meant o describe 
computations on N spread in time. The problem is that S is a proper preCD with 
all paths of infinite length, and thus there are no computations in our sense on S. 
We are interested in computations tarting at a particular time and continue from 
that time on. This is why we have the function r. 
Let us consider the following three subsets of V xZ: 
(i) S, the starting set is {(v, z (v ) ) lv~ dom(r)}, 
(ii) D, the descendants of S, i.e., the set {(v, t) l(v, S and there is (s, to)~ S 
such that (s, to) ~+ (v, t)}, 
(iii) P, the parents of D is the set {(v, t)[(v, (u, t') for some (u, t ' ) s  D}. 
In both cases -->+ and --> are paths and edges in GN. Denote by 1~" = S u D u P. 
A computation o~ on N is a function a- I~'~ Q such that, for all (v, t) ~ D, 
a (  v, t) = 4 ' (v ) (a (  Vl, t -  a l l ) , . . . ,  ct(vk, t -  dk)). (2) 
Note that it is often convenient o consider a as a partial function V x Z--> Q. As 
this cannot lead to ambiguity we shall do so, when convenient. 
Intuitively a computat ion on a network N proceeds as follows. Arbitrary 'initial' 
values are chosen at the processors in dom(z).  More specifically, an initial value 
from Q is chosen for each processor v in dom(z) at the time ~'(v). Then new values 
are successively computed according to (2). During this computation arbitrary 'input" 
values are supplied, when needed, at the nodes without parents. 
We are not concerned with formalizing how outputs are produced. In any par- 
ticular network outputs can be taken from suitable processors at suitable times to 
realize a desired input-output function. This is the same situation as for gate networks 
where for many considerations it is not relevant whether the output of a gate is 
external or not. 
We are mainly interested in computational networks. The computational diagrams 
are auxiliary constructs, they are important because for every CN N there is a CD 
H, called the unroll ing of  N, such that each computation a on N has an ' isomorphic' 
computation on H. 
Let N= (V, zr, Q, qS, r) be a CN. The unrolling 1Q of N is the preCD /V= 
(V, ~, Q, ~) where ~- is the restriction of ,r' to 1~" and ~(v, t) = 4,(v) for (v, t) in I~" 
(I~' and ~' are defined above). It is easy to see that /Q is not always a CD;  for 
example, /V might contain cycles, and thus computations need not exist on every 
N. On the other hand, it is obvious that a :  V x Z ~ Q is a computation on a CN N 
if and only if it is a computat ion on its unrolling N. 
190 K. Culik, II, L Fris 
Given a network N = (V, m Q, 4), A, ~'), let (V, E) be the underlying graph. We 
may extend A from edges to paths by putting, for p:Vo~ v l~ ' ' '~vk ,  A(p)= 
x ( v,_,, v,). 
For each network N there is a preCD, namely IV which has (by definition) the 
same computations. It is easy to show that conversely, given an arbitrary CD 
S = (V, ¢r, Q, 4)) we can construct a CN with the same computations as S. To do so, 
we use the function Iv[: v~N defined in the proof of Theorem 2.3, and define 
A(u, v)=lv l - lu l  for all u- v. If  we also define ~" by ~-(v)=0 for v~ Vo, then 
N = (V, ¢r, Q, d;, A, ~') is a network with the same computations as S. 
The network N has a useful "property, it generalizes the notion of  pure network 
of [6]. We say that a network N is pure if: 
(i) dom(~-)= Vo, the set of nodes with indegree 0, and ~'(v)= 0 for v ~ Vo, 
(ii) for each node v ~ V-  Vo there is a path p: Vo ~+ v, Vo~ Vo, and 
(iii) all such paths p: Vo ~+ v where VoZ 1Io and v~ V-  Vo have the same weight 
A(p), i.e., A(p) depends only on the end node ofp.  
It is easy to see, that more generally, for every two paths pl :u  ~+v and 
pz:U-~+v there is always A(p0=A(p2) .  Now we show that for a pure network 
N = (V, zr, Q, 4', A, z) the computations on N are the same as on the preCD 
(V, m Q, 4~). 
2.4. Theorem. Let N = (V, 7r, Q, ~b, A, z) be a pure CN. Then its unrolling iV= 
( V, ¢r, Q, ~ ) is isomorphic to preCD ( V, 7r, Q, ~b ). 
Proof. Let d(v) be the uniquely defined distance of a node v in-V from V0 (nodes 
with indegree 0). Clearly, I~" = {(v, d(v))] v ~ V} and thus the mapping v ~ (v, d(v)) 
establishes the isomorphism. [] 
3. Retiming and systolic conversion 
An important property of computations on CN is that if they exist they are 
uniquely defined given some 'initial values', i.e. an analog of  Theorem 2.3 holds for 
CN's. 
First' we observe that in general there might not exist any computat ion on a CN. 
This can happen on networks with zero or negative 'delays' in A(u, v). Following 
[22] we define: A CN N=(V,  zr, Q, q~,a, z) is semisystolic if A(v, w)~>0 for each 
edge v-~ w. It is systolic if ;t(v, w)>0 for each v~ w in E. 
Note: In [22] a semisystolic network (simpler version) was called a synchronous 
system. 
It is easy to see that there always exist computations on a systolic network on 
which ~'(v) is bounded from below. However, we will not restrict ourselves to this 
case and will consider even networks which a/'e not systolic. 
Design of systolic networks 191 
3.1. Example. Consider N = (V, m Q, ~b, h, z) where V= {1, 2,...}, ~r(k) : k+ 1 for 
k~>l, Q and ~b are arbitrary, A(e)=0 for all eeE and ~'(k) = k for all k~ V (see 
Fig. 3). 
Fig. 3. 
Clearly, there is no computation on the related preCD S = ( V, or, Q, ~b); however, 
for the • defined above (and suitable ~b) there are computations on N. 
Now, we want to find conditions under which a general CN can be transformed 
to a semisystolic CN or even a systolic CN with the same computations. First, we 
have to make precise the notion of equivalence for CN. We say that two CN's are 
equivalent if their unrollings are isomorphic as preCD's. 
3.2. Lemma. (Retiming Lemma of [22]). Given a CN N = (V, zr, Q, qb, A, r) and a 
function 8, called lag, 8: V ~ Z, define A8 : E ~ Z by 
X (u, v)= X(u, v)-8(u)+ 8(v) (3) 
for each e=u~v in E, and z~: V~Z by rs (v )=r (v )+8(v) .  Consider CN Ns= 
( V, zr, Q, qb, Aa, ~'8). I f  a: ( /~  Q is a computation on N, and as" ~"~  Q is defined by 
as(v, t) = a(v, t - 8(v)) for  all (v, t) ~ Ys, then t~ is a computation on Ns. 
Proof. Clearly, a and a8 are identical computations when considered as computa- 
tions on unrollings 1V and 1Qs. [] 
3.3. Corollary. Networks N and N8 are equivalent. 
The following two theorems are generalizations of the Systolic Conversion 
Theorem from [22]. They give the necessary and sufficient conditions for the existence 
of a lag-function 8 which converts a CN network to an equivalent semisystolic 
(systolic) network. Given a CN N = (V, m Q, ~b, A, ~) we define 
p :V×V-~Zu{-oo ,  oo} byp(u ,v )= in f{a(p) [p :u -~+v} 
for all u, v in V. (By definition, inf(~)= oo.) 
3.4. Theorem. Let N = ( V, ~r, Q, ~b, h, z) be a CN. There exists a lag-function 8: V ~ 7/ 
such that CN N~ =(V, ~-, Q, ~b, As, ~-~) (equivalent to N)  is semisystolic if and only if 
p(u, v)> -oo for all u, v in V. (4) 
Proof. Assume (4) holds. First we show three properties of p following from (4). 
192 K. Culik, lI, L Fris 
3.5. Claim. 
(i) p(u, v)+ p(v, w)>~ p(u, w) for all u, v, w in V. 
(ii) p(u, u)>~O for each u in V. 
(iii) p(u, v) + p(v, u) >I 0 for all u, v in V. 
Proof. (i) I f  u ->+ v and v -->+ w, i.e. p(u, v) and p(v, w) are finite, then u -->+ w, 
thus p(u, w) is finite and the path e" u-->+w is considered when calculating the 
infimum defining p(u, w). Thus p(u, w)<~p(u, v)+p(v, w). I f  there is no path u -->+ v 
or v ->+ w, then the right-hand side of (i) is oo and (i) holds trivially. 
(ii) Let d = p(u, u) and d <0.  This means that there is a path p: u ->+u with 
h(p)  = d, but then p followed by p is also a path u ->+ u and A(p -p)= 2d < d. This 
contradicts the assumption that 
p(u, u)=inf{A(p)]p:u -->+ u}. 
(iii) The last inequality immediately follows from (i) and (ii). This completes the 
proof of the claim. [] 
Proof of Theorem 3.4 (continued). Now we show that it is possible to define a 
lag-function 8: V-,  Z so that As(u, v) I> 0 for each edge u -~ v, and where A8 is defined 
by (3.) Let U be a maximal subset of V on which 8: U-* Z can be defined so that 
8(u)+ p(v, v) (5) 
holds for all u, v in U. If  U ~ V, then consider w ~ V-  U. By claim 3.5 (iii) we have 
p(w, u)>~-p(u, w)for each u in U. Therefore 8(u)+p(w, u)>~8(u)-p(u, w), and 
we can choose 8(w) so that 
8(u) +p(w, u) I> 8(w) I> 8(u) -o(u,  w). 
From this we immediately have 8(w)+p(u, w) >- 8(u) and 8(u)~ > 8(w) -p (w,  u) 
which contradicts the fact, that U is a maximal subset of V for which (5) holds. 
Thus U = V, and 8 can be defined on the whole V. By the Retiming Lemma, N8 
and N are equivalent and, by (5), 
 8(u, v) = v)-p(u,  v). 
For every edge u--> v, p(u, v)~< A (u, v) by the definition of p, therefore As(u, v)i> 0, 
which means that N8 is semisystolic. 
To prove the converse, assume that there is a 8 such that N8 is semisystolic. That 
means ha(p) I> 0 for each path p: u -->+ v in N~ and, by (3), A (p) I> 8(u) - 8(v). This 
gives a lower bound for p(u, v)>I 8(u) -  8(v)> -oo, hence (4) holds. [] 
Given a CN N=(V,w,Q,d~,A,~') ,  let [N - l ]  be the CN [N- l ]=  
(V, It, Q, d~, [A - l ] ,  ~')where [A- - l ] :  E~Z is given by [A -1] (e )=A(e) -1 .  More 
generally, we can define [ f (N) ] fo r  every function f :  2-* Z; however, we shall need 
only [N-1] ,  [N+ 1], and [kN] for k = 2, 3, . . . .  Clearly, N is systolic if and only 
Design of systolic networks 193 
if [N - l ]  is semisystolic. For CN N=(V,  m Q, ~b,A, r )we define tr: Vx V~Zw 
{oo,-oo} by tr(u, for all u, v in V. Now we are ready 
for the following result. 
3.6. Theorem. (Systolic Conversion Theorem). Let N = (V, ~r, Q, ~b, A, ~-) be a CN. 
There exists a lag-function 8: V--> 7_ such that CN N8 = (V, 17, Q, dp, As, 7"8) (equivalent 
to N)  is systolic if and only if 
tr(u, v )> -~ for all u, v in V. (6) 
Proof. We have already noted that N8 is systolic if and only if[N8 - 1] is semisystolic. 
Clearly [N8-  1] = IN-118. By Theorem 3.4 there exists a 8 such that [N-118 is 
semisystolic f and only ifptN_lj(u, v) -> -oo for all u, v in V. However, pEN_ll(U, V) = 
ON(U, V) for all u, v in V. Thus there exists a 8 such that N8 is systolic if and only 
if (6) holds. [] 
3.7. Corollary. Let N = ( V, 7r, Q, qb, A, ~') be a semisystolic CN in which paths of  weight 
zero are bounded. Then there exist k> 0 and a lag-function 8: V-* Z such that [ kN]8 
is systolic. 
3.8. Corollary. Let N = ( V, 7r, Q, dp, A, ~') be a finite semisystolic CN. Then there exist 
k>0 and a lag-function 8: V~7/  such that [kN]8 is systolic if and only if there are 
no cycles of  weight zero in N. 
I I I I 
Fig. 4. 
3.9. Example. Consider CN N given by the diagram in Fig. 4. N is an example of 
a semisystolic network which can be transformed in a systolic one without any 
change of time (speed). The resulting network is shown in Fig. 5. 
I I i I 
Fig. 5. 
In the next example the network M itself cannot be converted; however, the 
network [2M] can be. 
194 K. Culik, II, L Fris 
I I I I 
0 0 0 0 
Fig. 6. 
3.10. Example. Consider network M given in Fig. 6. The network [2M] which may 
be interpreted as M running twice slower, can be converted to an equivalent network 
[2M]~, for g(Uk)------k, k=0,  1 ,2 , . . . ,  shown in Fig. 7. 
I I I I 
I I I I 
Fig. 7. 
The next example shows that arbitrary large k (slowdown) may be needed in 
Corollary 3.8. 
3.11. Example. Consider the CN URk with k nodes given inFig. 8, called unidirec- 
tional ring. Here the network [ iURk]s  for i >/k is systolic for suitable 8; however, 




Finally, the following example shows that for an infinite network a conversion 
from semisystolic to systolic is not always possible. 
Design of systolic networks 195 
I )o I 
0 0 0 
0 0 0 
Fig. 9. 
3.12. Example. Consider the network N given in Fig. 9. It is easy to verify that for 
no k the network [kN] satisfies condition (6) from Theorem 3.6. 
4. Simulation of networks 
In this section we shall investigate networks performing essentially identical 
computations, however not necessarily equivalent in the sense of the previous section. 
Given two CD's  G~ = ( V~, 7ri, Q,, ~b~) (i = 1, 2) we say that (;2 simulates G1 or that 
G1 is simulated on (;2 if there are two maps p: 111-> V2 and ~b: V~ × Q2--~ Q~ such that 
for every computat ion t~ 1 on G1 there exists a computation a 2 on  G 2 for which 
= az(p(v)). 
We say that the pair of  maps p, ff establishes the simulation, or that (;2 simulates 
GI through p, qJ. 
The notion of simulation is too general; for example, every finite CD (i.e., V and 
Q finite) can be simulated on a finite automaton. Therefore we restrict p by requiring 
the existence of a bound m on the number of nodes that can be merged by p, i.e., 
I p - l (u ) [~ m for all u ~ V 2. I f  G 2 simulates G~ through p, ~ where p meets this 
restriction then we say that G2 m-simulates G1. 
Now, given two networks N1, N2 we extend the definition of simulation to CN 
by saying that a network N2 simulates NI if and only if the unrolling of N2 simulates 
the unrolling of N1. In more details let N~ = (V~, Iri, Q,, ~b~, a,, ~'~), i = 1, 2, be two 
networks. I f  N2 m-simulates N~ through p, ~ we can relate time of the computation 
on N2 to the time of  the computation on N1. Thus if p: (v, t)~--~(v', t') we denote 
the second component of  p by pz, i.e. pt(v, t)~--~ t'. 
I f  
pt(vz, (1)1 , tz), (v2, tz)E ]Q1} k, tl) - pt(1)2, 
t2) ) 
sup -- - = 
t I - -  t 2 
we say that N2 (m, k)-simulates N1. 
Intuitively, k corresponds to the slowdown of N1 in order to enable N2 to simulate 
it. 
196 IC Culik, II, L Fris 
4.1. Lemma. Let N1 = ( V1, zr~, QI, dp~, h 1, zl) be a systolic CN and G = ( V2, E2, h2) be 
a labeled digraph with A2(e) > O for all e ~ E2. Let there be a map y: V~ --> V2 such that: 
(i) There is a fixed k> 0 such that for every edge e: u--> v ~ E~ there is a path 
p(e): ~/(u) -->+ 3,(v) in G of  weight A2(p(e)) = khl(e). 
(ii) For all u, v~ V1, if y(u)  = y(v),  then either z(u) = z(v) or ~" is undefined for 
both u and v. 
(iii) There is m such that m for all v~ V2. 
Then we can construct a CN N2 = ( V2, 7r2, Q2, tk2, A2, ~'2) such that the underlying raph 
of  N2 is (V2, E2) and N2 ( m, k )-simulates N1. Moreover, if the indegree of  (11"1, E~) is 
finite, then finite QI implies finite Q2. 
Proof. Since the idea of the proof is fairly straightforward we will omit the lengthy 
technical details and give only the outline of the construction of N2. 
First, ~r2 is chosen arbitrarily so that ( V2, E2) is the underlying raph of N2. Next, 
given this ~r2, we construct Q2 (tuples of elements of Q1) and ~b2 as follows. Each 
processor of N2, i.e. ~b2(v) for each v ~ V2, performs two kinds of tasks. It simulates 
concurrently all the processors in y-~(v) (at most m) and if v is a node other than 
the end-node of any path p(e),  e ~ E~, then v passes values (from Q~) along that 
path. Ifa node lies on several paths (not end-node), then it passes, generally different, 
values along each path. 
z2 is defined as follows. If u ~ dom(z~), then y(u) ~ dom(z2) and r2(y(u)) = zl(u). 
Condition (ii) assures that 72 is well defined. Using the conditions (i) and (iii) it is 
easy to verify that CN N2 (m, k)-simulates CN N1. 
Finally, if the indegree of (II1, El) is finite (as is the case in all practical applica- 
tions), then the number of all paths p(e) going through any fixed node is uniformly 
bounded, so the length of the tuples in Q2 is bounded too implying the finiteness 
of Q2. [] 
Note that if we are interested in networks with processors of limited complexity, 
that is, functions ~b restricted to a certain class of functions, then we notice that the 
construction in the above proof preserves those classes of functions that contains 
all projections and are closed under composition. 
Two special cases of Lemma 4.1 often occur. 
4.2. Corollary. Let G1 = ( It"1, El) be a graph and let { U~ IJ v2} be a partition of VI. 
We let G2 = ( V2, E2) where ( u, v ) ~ E2 if  and only if  u ~ Ua, v ~ Ub and there is an 
edge in E1 f rom some node in Ua to a mode in Ut, For every network NI on GI there 
is a network N2 on (32 such that NI is (m, 1)-simulated on N2. Here again m = 
maxs~ v~ Iujl. 
4.3. Corollary. Let G~ be a subgraph of  G2, i.e., G i=(  V~Ei), i=1 ,2 ,  and V1 ~- V2 
and E1 c E2. Then every network NI on G1 can be (1, 1)-simulated by a network N2 
on (32. 
Design of systolic networks 197 
Finally, the following lemma is easy to prove. 
4.4. Lemma. Let N1, N2, N3 be networks uch that N2(il,jl)-simulates N1 and N3 
(i2, J2)-simulates N2. Then N3 (il i2, jlj2)-simulates N1. 
The following lemma is useful in proving for given two networks, that one cannot 
simulate the other. 
4.5. Lemma. Let Ni = ( V~, ~'i, Qi, A~, z~), i = 1, 2, be two networks. I f  N2 ( m, n )-simulates 
N1 through p and ~, then for every path p: Vo--> v~ -->. • • --> Vk in the underlying raph 
ofN~ p(p)  : p(v0) -->+ p(vl) -->+" • • -->+ p(Vk) is a path of N2. If, moreover, A1 = A2 = 1, 
then Ip(p)l/Ipl<  n. 
In the examples throughout the paper we will frequently use bidirectional com- 
munications between processors (nodes), and processors (nodes) with selfloops. We 
introduce the notational abbreviations for these cases as shown in Fig. 10. Another 
abbreviation also shown in Fig. 10 is the omission of 0-degree nodes (input pro- 
cessors), only the 'half edges' entering the other nodes are shown. 
© © stands for 
stands for Q(~ i 
stands for Q 
Fig. 10. 
Finally, we would like to recall the following conventions. Omitted edge-label 
implies the value of h for this edge is one. When ~- is not explicitly given, we assume 
that z is defined for all the nodes and is zero everywhere. 
4.6. Example (Simulations between hexagonal nd square grid networks). A hexagonal 
grid network, i.e. a grid as in Fig. l l (a)  can be drawn as in Fig. l l (b);  that is, as a 
square grid network with some connections omitted. Therefore, a hexagonal grid 
network can be (1, 1)-simulated by a square grid network. 
To show the converse, consider any two nodes connected by an edge of the square 
grid. Clearly, there is a path between them in the hexagonal grid of the length at 
most 3, and because of the presence of self-loops there is a also a path of length 
exactly 3. Thus every square grid network can be (1, 3)-simulated by a hexagonal 
grid network. Note that if a pair of nodes can be connected by a path of length 2 








on the square grid, then the corresponding pair of nodes on the hexagonal grid can 
be connected by a path of length 4. Thus a more elaborate construction would allow 
slowdown of only 2 rather than 3 as in our construction above. 
4.7. Example (Simulations between square and triangular grid networks). When we 
draw the triangular grid as shown in Fig. 12, we see that similar considerations as 
in Example 4.6 show that a square grid network can be ( 1, 1 )-simulated by a triangular 





4.8. Example (Simulations between bidirectional ring and bidirectional linear array 
(each consisting of n nodes)). Let us denote these networks BRn and BAn, respec- 
tively. (1, 1)-simulation of BAn on BR, is trivial, since BR, is obtained from BAn 
by omitting one edge. 
Fig. 13. 
To prove the converse, we first note that it follows by Lemma 4.1 that the network 
M, given (for n = 7) in Fig. 13 can be (1, 2)-simulated on BA,. Now BRn can be 
(1, 1)-simulated On Mn since it is obtained from Mn by removing all but two 'short' 
edges, which is demonstrated by drawing BR7 as in Fig. 14. 
Design of systolic networks 199 
Fig. 14. 
4.9. Example (Simulations between homogeneous BR, and the homogeneous nidirec- 
tional ring with n nodes (URn)). Again BRn trivially simulates URn. To show the 
converse we must assume that BRn is homogeneous, i.e., all its processors perform 
identical functions (~b(u) = ~b(v) for u, v in V). We show the simulation in two steps. 
Let BRn = ( V, 7r, Q, ~b) where V = {v~,.. . ,  v,} (BRn has the 'default" A, r). Consider 
network Cn = (V, 7re, Q, ~b) where ~c: V--> V* is defined by 7r~(vi) = viv~elVi~2, where 
~) means addition mod n. C7 is shown in Fig. 15. Clearly, since the networks BR, 
and C, are homogeneous p"(v~, t)~->(v~t, ), for i= 1 , . . . ,  n and t~>0, establishes 
the isomorphism of the unrollings of BRn and Cn. Thus BRn and Cn are equivalent, 
and also each (1, 1)-simulates the other. By Lemma 4.1 Cn can be (1, 2)-simulated 
on URn and therefore also BRn can be (1, 2)-simulated on UR,. 
i 
Fig. 15. 
4.10. Example (Simulations between bidirectional linear array (BAn) and bidirectional 
linear array without selfloops (Wn)). Network W7 is shown in Fig. 16. Trivially, BAn 
(1, 1)-simulates IV,. We show that BAn can be (1, 2)-simulated on W2n-1 and W2n-1 
can be (2, 1)-simulated on BAn. Let Vn = {1, . . . ,  n} be the nodes of both BAn and 
W,. The map y: i~-->2i- 1, i = 1 , . . . ,  n, maps Vn into V2n-~. Clearly, for every edge 
of BAn there is a path of length 2 in Wn connecting the corresponding nodes. Thus 
W2n-1 simulates BA, by Lemma 4.1. The converse is easy to see by mapping pairs 
of nodes 2 i -  1, 2i of Wn, into one node i of BA, (except he last node n). 
0 0 O. 0 0 0 0 
Fig. 16. 
4.11. Example. Note that the unrolling of W, has two disjoint components, only 
one of them is used in our (1, 2)-simulation of BA, on W:n-1. Here we assumed 
the default definition of function z, i.e., • defined everywhere and equal to zero. If 
200 BL Culik, II, L Fris 
Fig. 17. 
we restrict ~" only to the 'old' nodes of W2,-1 then the unrolling of this modified 
I~r2,_1 has only one component. The unrolling of I~¢'7 is shown in Fig. 17. We can 
also consider pure homogeneous network T2,-1 = (V, 7r, Q, ~b, A, ~') where V, 7r, Q, 
~b are as in the unrolling of vfV2,_1, A = 1 for all edges and z is defined (and equal 
to zero) only on the top row. Clearly, the unrollings of T2~-~ and l~'2n-~ are 
isomorphic and therefore T2,-1 and ITV2n_~ are equivalent. 
We can summarize the results demonstrated in Examples 4.8 to 4.11 in the 
following Theorem 4.12. This theorem generalizes similar results for various types 
of cellular, iterative or trellis automata defined on structures like unidirectional 
linear arrays, bidirectional arrays, and rings. 
4.12. Theorem. Any of the homogeneous CN of the following type can be ( i,j)-simulated 
on any other with i, j <~ 2. 
(i) bidirectional ring BR,,, 
(ii) unidirectional ring URn, 
(iii) bidirectional linear array BA,,, 
(iv) bidirectional linear array without selfloops (memory) W2.-1, 
(v) trellis T2.-1. 
In more details, Table 1 shows the values of (i,j). A pair i, j in the intersection 
of a row and column means that a network named in the row (i,j)-simulates the 
network named in the column. The values not shown in Examples 4.8 to 4.11 follow 
from transitivity of simulation. The exception is the pair 2, 2. From the transitivity 
we get 1, 4, but the value 2, 2 can be shown easily. This result means that a solution 
Table 1. 
BA,, BR,, UR. W2,,_ 1 7"2,,_ 1 
BA~ 1.1 1,2 1.2 2,1 2,1 
BR. 1,1 1,1 1,1 2.1 2,1 
UR~ 1,2 1,2 1,1 1,1 1,1 
~. - i  1.2 2.2 2,2 1,1 1,1 
~._~ 1,2 2,2 2,2 1.1 1,1 
Design of systolic networks 201 
(systolic algorithm) for a problem for a network of one of the above types can be 
easily converted into a network (systolic algorithm) of any of the other type. Also 
note that the simulations between networks of type (iii)-(v) are also valid for 
corresponding (potentially) infinite networks. 
Note that we are implicitly assuming that the 'cost' of computing a new value 
~b(v)(q) is constant. In some applications this is not the case. For example, if a 
processor with selfloop is implemented as a device with memory then it is typically 
much cheaper to update one location than to rewrite the whole memory. Thus if a 
single update means to change the value q to q', then this can be easily done in 
one clock step inside one processor but might be impossible to do in our step in 
the neighboring processor, since this could require to communicate the whole 
contents of the memory in one step. 
Theorem 4.12 can be used to design systolic algorithms and prove their correctness. 
For example, the unidirectional (systolic) ring has been implemented by Ostlund 
and used specifically for computations in molecular dynamics (see [23]). The 
algorithms described in [23] were designed directly for the unidirectional ring. 
However, having Theorem 4.12, it is typically easier to program such algorithms, 
and in particular to prove their correctness, for bidirectional linear arrays and then 
transform them to unidirectional rings. The algorithm for distributed sorting is 
designed using this method in the following example. 
4.13. Example. A distributed sorting algorithm is designed and proved correct for 
the unidirectional ring of microprocessors each of which can store the same number 
of records (numbers). The well-known odd-even transportation sort for 2n records 
[15, p. 241] can be easily implemented on the network T2,-1 in time 2n-  1. Each 
processor sorts two records (numbers), with suitable modification for the 'end- 
processors'. Now we use the result hat every sorting network also works for multisets 
when we start with sorted multisets and replace the operation of sorting two elements 
by merging two multisets, see [15, p. 241]. By Theorem 2.4 the unrolling of 1~¢2,_~ 
is isomorphic to I~2,_, itself. Thus both lg'7 and its unrolling are shown in Fig. 17. 
We compare it with the unrolling of UR4 shown in Fig. 18. We see that by omitting 
Fig. 18. 
fPf 
/ / /  
f 
202 K. Culik, II, L Fris 
one node in each even row (and therefore also the dotted edges) in Fig. 18 we 
obtain a subgraph isomorphic to the one shown in Fig. 17. 
It is now easy to verify that the following is a correct algorithm for (2n - 1)-step 
sorting on URn. We assume that initially there is a sorted multiset of at most 2k 
elements in each processor (if necessary local sorting is performed first). Then in 
each step each processor performs the following. It sends the 'left half' of the 
multiset, i.e., the k smallest elements (or all of them if there are less than k of them) 
to the left neighbor and merges the remaining elements with those coming from the 
right neighbor. The only exception is that no sorting is done across 'the fence'. The 
fence is originally between the processors vn and Vl. The fence is sent at 'half  speed' 
through the processor, i.e., it is in the 'middle' of processor Vn-, at the time 2t -  1 
and that processor is 'inactive'. After 2n -  1 steps the 'fence' returns to its original 
position and the sorting is completed as shown in the following example of 3 
processors, each containing 4 numbers. The bar represents the fence. 
8 6 3 4 5 6 3 1 9 1 4 6 
5 6 6 8 1 1 3 4 6 913  4 
1 1 6 8 3 4 6 9 I 3 4 5 6 
3 4 6 8 6 9 [3  4 1 1 5 6 
6 6 8 9 I 1 1 3 4 3 4 5 6 
8 911  1 3 3 4 4 5 6 6 6 
1 1 3 3 4 4 5 6 6 6 8 9 
Note that if some processor is not 'full', i.e. contains less than 2n elements, it still 
sends n elements to the left, which means that it pretends that it contains additional 
dummy elements considered larger than all the other, and therefore these dummies 
are always retained. Thus the algorithm is correct also in this case. Alternatively we 
can retain the n largest elements and send the rest to the left thus treating the 
nonexistent elements as the smallest. 
Except for the pure network T2n_l, our Theorem 4.12 deals with one-dimensional 
structures. This result can be generalized to m-dimensional structures (arrays and 
'toroidal structures'). We formulate it for the most interesting case of the two- 
dimensional structures, the other cases are left for the reader. 
4.14. Example (Simulations between homogeneous bidirectional two-dimensional array 
(BArn, n) and homogeneous nidirectional two-dimensional toroid (UTm, n)). BA4,3 is 
shown in Fig. 19 and UT4,3 in Fig. 20. A straightforward generalization of the 
technique used in Example 4.9 (rotation of the nodes both horizontally and vertically) 
shows that the homogeneous bidirectional two-dimensional toroid BTm, n can be 
(1, 2)-simulated on UTm, n. Since BAm, n can trivially be (1, 1)-simulated on BT,,,n 
we conclude that BA,.,n can be (1, 2)-simulated on UT,.,.. To show the converse is 
easier. Using the argument as in Example 4.9 we conclude that not only BA,~n but 
Design of systolic networks 203 
,) ,) ~,) 
~,) 
Fig. 19. Fig. 20. 
w~ 
even homogeneous bidirectional two-dimensional toroid BTm.n can be (1, 2)-simu- 
lated on UTm.n. 
Therefore, we have another useful design tool; namely, every algorithm for 
bidirectional two-dimensional rray (mesh-connected processors) can be easily 
modified to run at half speed on a unidirectional two-dimensional toroid of the 
same size. 
5. Further examples 
We start this section by examining a relation between etworks M and kM as 
far as simulation is concerned. Obviously M can be (1, k)-simulated on kM. To 
see this, it is enough to take  p(v, t )=(v,  kt) and ~b(v, q)=q. If we modify the 
definition of (j, k)-simulation so that k can be a rational number, then conversely 
kM can be (1, 1/ k) -simulated by M. As was already mentioned, the intuitive 
interpretation of 'N2 (m, k)-simulates Nl"is 'N2 can do what N1 does, but k-times 
more slowly'. The concept of time is particularly important when considering how 
networks receive their inputs. Informally, a network working k-times more slowly 
needs to get its input k-times more slowly to do the same work. 
In Example 5.1 below we confirm this interpretation by considering a well-known 
example of a linear iterative array which recognizes palindromes. 
First we describe an iterative array (see [5]) as CN. 
Consider the network L in Fig. 21(a), with ~" defined for node 1 only (~'(1) = 1). 
The unrolling of L is the CD U in Fig. 21(b). In order to define a computation on 
U (and thus on L) initial values must be given in processors (input nodes) with 
indegree 0. These are the nodes 
(a) (0, t) for t = 0, 1, 2 , . . . ,  which, because they are given to the same processor 
at different imes, are called serial inputs, and 
(b) ( j+  1,j) and ( j+2 , j )  for j = 0, 1, 2 , . . . ,  which are called parallel inputs. 
Now, L is an iterative array if Q is a finite set which contains a special element, 
say, #,  and the parallel inputs are 'quiescent', i.e., only those computations ot on 
L are considered for which a(j+ 1, j )=  c t ( j+2 , j )=  # for all j =0, 1, . . . .  
204 IC Culik, II, L Fris 





5.1. Example (Speed up). Consider the example of a linear iterative array of finite-state 
machines recognizing a palindrome. Such an iterative array was given in [5]. In 
[22], Cole's result was reproved using the Systolic Conversion Theorem. It is easy 
to construct a semi-systolic linear array Po recognizing palindromes. Its underlying 
graph is shown in Fig. 22, ~" is defined (as zero) in the leftmost node only. The 
description of 4> can be found in [22], but is unimportant for the following discussion. 
To convert his semisystolic network to a systolic network, we need to 'slow-down' 
the network, i.e., [2P0] rather than P0 is equivalent to some linear iterative array P. 
Unfortunately, P running at half speed, as explained above, needs its input coming 
0 0 
~ --Q @ 0__ 
Fig. 22. 
Design of systolic networks 205 
at half speed--otherwise, under strict interpretation of what iterative arrays recog- 
nize, as stated in [22], P recognizes a palindrome 'in 2 clock ticks per character'. 
Thus P does not recognize palindromes. Instead, given a string x~.. .  xn, P will 
recognize whether x1x3x  5 . . . is a palindrome or not. In this particular case, the small 
gap in the proof of [22] can be filled easily. It can be shown that in general for an 
n-dimensional iterative array, if N2 (1, k)-simulates N~, then there is an N3 which 
(k, 1/k)-simulates N2, and thus N3 (k, 1)-simulates N~. We shall show this only for 
the example of a one-dimensional palindrome recognizer (k = 1), but the generaliz- 
ation is obvious. 
In Fig. 22 Po represents a semisystolic network---a recognizer of palindromes. 
Network P~ is obtained by the Conversion Theorem (edges with delay 3 are omitted 
because they are redundant), and P~ is equivalent to [2Po]. Note that (V × Z, E), 
the graph underlying the network P~, has two components, one of which is the 
unrolling of P1, the other component absorbs the odd-numbered inputs, as discussed 
above. P~ is not a linear iterative array, since the selfloops have delay 2. However, 
the function ~b can be easily modified to obtain a network with the delay 1 on each 
selfloop of a linear array. Calling this modified network P~, we see that P~ is a 
linear iterative array, and also that any computation on P1 can be done on P~. 
However, P~ still does not recognize palindromes (because P~ does not). 
Consider now the network P2- Informally, we may say that two steps of a 
computation on P1 correspond to one step on P2. Formally, we may only say that 
P~(1,2)-simulates P2, this follows from Lemma 4.1. We did not define (1,½)- 
simulation, but according to the discussion above, there is some justification in 
saying that P2 (1, ½)-simulates P~. Regardless of whether (1, ½)-simulation is defined 
or not, comparing computations on P~ and P2 we see that P2 is now a proper 
palindrome recognizer, however, P2 fails to be a linear iterative array. Thus, one 
more step is needed. By Lemma 4.1, P3 (2, 1)-simulates P2Nthis is seen by taking 
the function p from the lemma as p: i~  [( i+1)/2] ,  where the processors (nodes) 
in both P2, P3 are numbered 0, 1, . . .  from left to right. We can conclude that P3 is 
a linear iterative array which does recognize palindromes. 
Note that it would be possible to develop an alternative theory for retiming. 
Instead of using networks like 2N we could have allowed retiming by 1, 1, 3 
however, as it does not seem to give any significant advantages, this route has not 
been taken. 
Also note, that it was essential in this example that we consider iterative (linear 
or more general n-dimensional) arrays. The 'speed up' step from P2 to P3 cannot be 
done in a general case. In particular [10, Theorem 6.4] shows that speed up is not 
possible for iterative tree automata. 
It is shown in [25] that for d-dimensional iterative arrays the power of a system 
is not increased by allowing Direct Central Control (or Global Control). In the 
following example we give another proof of this result using the Systolic Conversion 
Theorem. For simplicity we consider the case d = 2 only. 









5.2. Example. The network A in Fig. 23 is a square grid network with the usual 
connections with delay 1, and with additional connections with delays 0. Clearly, 
these additional connections can be used to accomplish the global control. Note 
that the 0-connections are quite arbitrary as long as they connect he origin with 
each other node, and no 0-loops are introduced. 
The network [2A] can be converted, just as in the previous example, into the 
systolic network B (Fig. 23). It is useful to compare A in Fig. 23 with Po in Fig. 22. 
The difference, apart from Po being a one-dimensional rray while A is two- 
Design of systolic networks 207 
dimensional is that the 0-connections are oriented in opposite directions. In Fig. 23 
they lead from a fixed node to every node, while in Fig. 22 the)/lead from every 
node to a fixed node. Despite of this difference our initial transformations are the 
same. 
Since between every pair of nodes connected by an edge in B there is a path of 
weight 3 we can apply Lemma 4.1. Thus, we conclude the network C in Fig. 23 
(I, 3)-simulates network B. The initial function r for all three networks A, B, C is 
defined (as zero) for the origin only and therefore not affected by the modifications. 
The following example demonstrates the influence of input and output consider- 
ations on geometric transformations. Cellular automata of [26] are Superficially 
similar to iterative arrays investigated in the previous example. C and U in Fig. 24 
illustrate a network, and its unrolling, which corresponds toone-dimensional version 
of cellular automaton (CA). Some nodes and edges of U are drawn dotted--this 
indicates that we consider eal-time computations on CA. The figure pictures the 
computation with four inputs and one output. Formally, the finite input is accommo- 
dated on an infinite network C by requiring again a fixed symbol, say # in Q, and 
extending any finite input by appending # # . . .  on the right. 
C" 
U" 
output x .~,  ./, l~-x 
= -" I 
! ! 
? -  
CG" 
I I I 
Fig.  24. 
208 K. Culilq 11, L Fris 
The network CG in Fig. 24 represents a possible definition of a (one-dimensional) 
CA with global (central) control. It could have been more natural to use also 
connection of delay 1 going in parallel with 0-delays, but, obviously these connections 
would be redundant. 
5.3. Example. This example shows that any computation (in real time) on a CA with 
global control can be done (in real time) on the standard CA. Network C in Fig. 
24 represents a CA, its unrolling is CD U in the same figure. The part of the 
unrolling which is irrelevant for real-time computations i drawn in dotted lines. 
Finally, CG in Fig. 24 represents a cellular automaton with additional 0-connections 
implementing the global control. Since CG has selfloops and 0-connections from 
left to right, the 1-connections from left to right become redundant and are omitted. 
We proceeed initially in the same way as in Example 5.2 for iterative arrays, 
namely [2CG] is retimed. The resulting network C' and its unrolling U' are shown 






In the following simple steps we demonstrate hat any computation a on U' can 
be simulated (in fact (2, 1)-simulated) on a CA. 
(1) For the price of doubling the size of Q, the delays on selfloops in C' (Fig. 
25) can be changed to 1. 
C": 
Design of systolic networks 
r J r  - - ,  r~-  , r -~-  . . . . .  1 r -~- -  - 
I I I t I I 
L l L J L J L . . . .  
209 
output 
- ; 3  - 
, J  i , . ,  : , 
| I 
/ I ~'~ i / 
! I 
/~T<x I " /  
% 
5 





(2) The unrolling U' of C' after this modification is now a subgraph (more 
precisely a preCD) of the unrolling of a network C" which is like C, but for input 
of double length. This is shown in Fig. 26 where the subgraph corresponding to U' 
is drawn in bold. It is easy to see that the network C" whose unrolling is U" can 
do the computation a (as modified already in (1)) as long as instead of the original 
input i~i2i3i4 we use ii # /2# i3# i5 (or similar). This is so because the diagonal path 
going from the top left node down right, the path on which the inputs of a are 
needed, can be 'computed' by sending a signal along it. 
(3) The dashed boxes in the network C" in Fig. 26 show which processors are 
mapped together in the final (2, 1)-simulation. 
Note that a slightly simpler proof of the simulation of CG on CA would be 
possible had we not wanted to preserve the real-time. It is slightly simpler to show 
that CA (1, 2)-simulates C' than to show the (2, 1)-simulation; however, (i, j)- 
simulation preserves time only if j = 1. 
We have just proved the following result for real time-CA (languange recognizer). 
This result can be easily generalized to n-dimensional cellular automata. 
210 K. Culi~II, L Ffis 
5.4. Theorem. Given a real-time cellular automaton with central (global) control, there 
effectively exists a real time cellular automaton (without central control) recognizing 
the same language. 
5.5. Example. We now briefly discuss two more results about one-way (unidirec- 
tional) cellular automata. One-way cellular automaton allows communication 
between two nodes in only one direction. Here we consider two-dimensional 
triangular grid in which the communication is in three directions. Such a one-way 
cellular automaton M]  is shown in Fig. 27 for n = 3. We are interested in the output 
produced at the node with outdegree zero (lower left corner in Fig. 27). Similarly, 
we have networks M,  b, M~, and M,  a shown for n = 3 in Fig. 27. 
As a direct generalization of the result of [4] that one-dimensional one-way CA 
languages are closed under reversal we can show that networks M,  ~, M,  b, M~,, and 
















Design of systolic networks 211 
networks, that is, the input coming to node i, j in one network comes to node i, j 
in any other network. The following result from [4] can be generalized to higher 
dimensions. Real time bidirectional cellular automata (working in time n) are 
equivalent to one-way (unidirectional) CA working in time 2n. One possible gen- 
eralization of this is that the functions computable on network M e (Fig. 27) in real 
time are the same as those computable on network M~ d in time 2n. 
5.6. Example. It was shown in [9] that regular sets can be recognized by a parallel 
algorithm on an unidirectional binary tree network. To recognize a string of length 
n we need any tree (not necessarily balanced) with at least n nodes; thus to accept 
arbitrary long input, this algorithm requires a potentially infinite tree. If the tree is 
(almost) balanced the recognition is in logarithmic time. A network based on such 
a tree is shown in Fig. 28(b). Here, the initial function ~" is defined (as zero) at all 
the input nodes. In [8] it has been shown that we can use finite tree-like network 
Mk illustrated in Fig. 28(a) for depth k = 2. Here the initial function z is defined 
(as zero) for the processors of the top row. On this network we recognize strings 
of length n, n <~ 2 k, in time k~ Longer strings are cut into pieces of length 2 k and 
recognized in time n/2k+ k. The correctness of the modified algorithm for Mk 
follows easily from the fact that the unrolling of Mk is the infinite binary tree shown 
in Fig. 28(b). 
5.7. Example. In Example 5.2 we considered a 'quarter plane' iterative array. Here 
we show that it is not important what regular infinite section of plane is used. We 
show this for the case of the 'full plane' iterative array A and the 'quarter plane' 
iterative array B (see Fig. 29). 
The idea of showing two systolic systems equivalent by folding has been introduced 
in [3]. Clearly, the result of folding A twice is B. To illustrate here that folding is 
a special case of our technique we show that the two networks in Fig. 29 can simulate 
each other. (1, 1)-simulation of B by A is trivial (and follows from Corollary 4.3). 
Conversely, B can (4, 1)-simulate A. This is established by Corollary 4.2 with the 
help of the following partition of nodes V1 = {(i, j)[i, j  ~ Z} of the network A: nodes 
(il,jl), (i2,j2) belong to the same partition if and only if ]i1[=]i2[ and IJll--IAI. 
Obviously, B is the network obtained by just described partition of A. 
Clearly, this example generalizes to n-dimensional rrays. For less trivial examples 
of folding see [3]. 
5.8. Example. It is shown in [10] that an iterative tree network Tk (where 1 + k is 
the number of nodes connected with each node) is more powerful than T~ for k > I. 
This paper considers Tk as a language recognizer and thus their result is stronger 
than what is shown in this example. 
It immediately follows from Lemma 4.5 that T~ cannot (1, 1)-simulate Tk (for 
l < k); however, (m, 1)-simulation is possible. To simplify the notation, we show a 





specific result, namely that 7"2 can (3, 1)-simulate T4; the generalization is straight- 
forward. 
Fig. 30(a) (ignoring the boxes) shows T2. The boxes around nodes represent 
partition of nodes of this network. Corollary 4.2 shows that T4 of Fig. 30(b) can be 
(3, 1)-simulated on 7"2. 
We will conclude this paper by showing the relation between various types of 
'shuffle' networks. Perfect shuffle network has been introduced in [27]. It is a difficult 
network to layout but many algorithms can be efficiently implemented on it. 















Fig.  29. 
J , 1 ~" . . . . . . .  
/ \ ~ ~ j \ ' I , I / \ 
I I I \ 
I l , I . 
/~ / - \ -~ I -V  ~/ ~- /-Y ~/-\---I-~' 
(a) 
(b) 
F ig .  30. 
















Design of systolic networks 215 
Perfect shuffle with 2 n processors, n > 1, is the network PS, = (V, m Q, ~b, A, 7") 
where V = {0 , . . . ,  2" - 1}, ~r(2k) = 7r(2k + 1) = k, 2 n-1 + k for k = 0 , . . . ,  2 "-1 - 1 and 
Q, ~b depend on particular algorithm (with default A, r). PS3 is shown in Fig. 31 (a). 
Shuffle xchange (cf. [28]) with 2" processors in the network SE, = ( V, m Q, ~b, A, r) 
where V={0, . . . ,2" - l} ;  ~(2k)=k,  2k, 2k+l ;  ~r(2k+l)=2k, 2k+l ,2" - l+k ,  for 
k = 0, 1 , . . . ,  2 "-1 - 1. (Again default A, r are assumed.) SE3 is shown in Fig. 31(b). 
Note that each processor has a selfloop (indicated by double circles in Fig. 31 (b)) 
which, in other words, means that the processors have memory. 
It is known that algorithms for PS, can be modified to run on SE, and vice versa. 
Formally we have the following theorem. 
5.9. Theorem. For any.n > 1, PS, can be (1,2)-simulated on SE,. 
Proof. Since each processor of SE, has memory (selfloop), we see that if (u, v) ~ E 
in PSn than there is a path u -->+v of length exactly 2 in SEn. Thus the result follows 
by Lemma 4.1. [] 
Now, we will consider three networks obtained as 'partial unrolling' of perfect 
shuffle. Let networks N~, N d, and N e be networks with n. 2" processors connected 
as shown for n = 3 in Fig. 31(c)-(e) with r defined (as zero) in the top row. Note 
that network N a is the unidirectional version of the butterfly network from [28]. 
The numbering of the nodes in Fig. 31(c)-(e) shows how the unrollings of the 
networks N~, N3 b, and N~ are isomorphic. We leave it for the reader to specify 
these networks for arbitrary n, and to verify that the unrollings are again isomorphic. 
Thus we have the following theorem. 
5.10. Theorem. For each n > 1, the networks N~, N d, and N e are equivalent. 
From Theorems 5.9 and 5.10 we have the following corollary, which in the case 
of N~ a is the unidirectional version of [28, Theorem 6.3]. 
5.11. Corollary. Any of networks N~, Nan, N~ can be (1, 2)-simulated network on SEn. 
Note that in the proof of [28, Theorem 6.3] it is also implicitly assumed that all 
the processors of the shuffle-exchange n twork have memory; however in [28, Fig. 
6.1 ] the selfloops are shown for the leftmost and rightmost processors only. This is 
misleading since if these processors have memory, why the selfloops? 
References 
[1] W. Bucher and K. Culik II, On real time and linear time cellular automata, RAIRO Inform. Thror. 
18 (1984) 307-325. 






























P.R. Cappello and K. Steiglitz, Unifying VLSI design with geometric transformations, Proc. IEEE 
1983 Internat. Conf. on Parallel Processing (1983) 448-457. 
C. Choffrut and K. Culik II, Folding of the plane and the design of systolic arrays, Inform. Process. 
Letters 17 (1983) 149-153. 
C. Choffrut and K. Culik II, On real-time cellular automata nd trellis automata, Acta Inform. 21 
(1984) 393-407. 
N. Cole Stephen, Real-time computation by n-dimensional iterative arrays of finite-state machine, 
IEEE Trans. Comput. C-18 (1969) 349-365. 
K. Culik II, J. Gruska and A. Salomaa, Systolic trellis automata; Part I, Internat. J. Comput. Math. 
15 (1984) 195-212; Part II, ibid. 16 (1984) 3-22. 
K. Culik II and J. Pachl, Folding and unrolling systolic arrays, ACM SIGACT-SIGOPS Symp. on 
Principles of Distributed Computing, Ottawa (1982) 254-261. 
K. Culik II and H. Jiirgensen, Programmable finite automata for VLSI, Internat. J. Comput. Math. 
14 (1983) 259-275. 
K. Culik II, A. Salomaa nd D. Wood, Systolic tree acceptors, RAIRO Inform. Theoret. 18 (1984) 
53-69. 
K. Culik II and S. Yu, Iterative tree automata, Theoret. Comput. Sci. 32 (1984) 227-247. 
C.R. Dyer, One-way bounded cellular automata, Inform. and Control 44 (1980) 261-281. 
P.C. Fischer, Generation of primes by a one-dimensional real-time iterative array, J. ACM 12 
(1965) 388-394 
O.H. Ibarra, S.M. Kim and S. Moran, Sequential machine characterizations of trellis and cellular 
automata nd applications, Unpublished manuscript. 
O.H. Ibarra and S.M. Kim, A characterization f systolic binary tree automata nd applications, 
Unpublished manuscript. 
D.E. Knuth, The Art of Computer Programming, Vol. 3 (Addison-Wesley, Reading, MA, 1973). 
S.R. Kosaraju, Speed of recognition of context-free languages by array automata, SIAM J. Comput. 
4 (1975) 331-340. 
H.T. Kung, Why systolic architectures?, Comput. Magazine (January 1982) 37-46. 
H.T. Kung and W.T. Lin, An algebra for VLSI algorithm,Proc. Conf. on Elliptic Problems, 1983. 
H.T. Kung and C.E. Leiserson, Systolic arrays (for VLSI), Proc. Sparse Matrix Conf., 19878. 
F.T. Leighton, Complexity Issues in VLSI (MIT Press, Cambridge, MA, 1983). 
C.E. Leiserson, Area-Efficient VLSI Computations (MIT Press, Cambridge, MA, 1982). 
C.E. Leiserson and J.B. Saxe, Optimizing synchronous systems, Proc 22rid Ann. FOCA Syrup. (1981) 
23-36. 
N.S. Ostlund and R.A. Whiteside, A machine architecture for molecular dynamics: The systolic 
loop, Conf. on Macromolecular Structure and Specificity: Computer Assisted Modeling (New York 
Academy of Sciences, 1983). 
C.A. Mead and L. Conway, Introduction to VLSI (Addison-Wesley, Reading, MA, 1980). 
J.I. Seiferas, Iterative array with direct central control, Acta Inform. $ (1977) 177-192. 
A.R. Smith III, Real-time language recognition by one-dimensional cellular automata, J. ACM 6 
(1972) 233"-253. 
H.S. Stone, Parallel processing with perfect shuffle, IEEE Trans. Comput. C-20 (1971) 153-161. 
J.D. Ullman, Computational Aspects of VLSI (Computer Science Press, Rockville, MD, 1984). 
H. Umeo, K. Morita and K. Sugato, Deterministic one-way simulation of two-way real-time cellular 
automata, Inform. Process. Letters 14 (1982) 158-161. 
U. Weiser and A. Davis, A wavefront notation tool for VLSI array design, in: H.T. Kung, R.F. 
Sproull and G.L. Steele, Jr., eds., VLSI Systems and Computing (Carnegie-Mellon University, 
Computer Science Press, 1981) 226-234. 
