Abstract. We introduce the notion of computational network (CN) which is a general model of an arbitrary (finite or infinite) system of parallel synchronized processors (systolic network). Our basic and very useful tools are topological transformations of the space-time diagrams (unrollings) of computations on CN. We show that the topological transformations on unrollings can be used to design systolic networks, to give simple proofs of their correctness, and to demonstrate the equivalence of different networks. For example, we use the transformation technique to give a concise proof of a strengthened version of Leiserson's and Saxe's Retiming Lemma and Systolic Conversion Theorem. As a practical application we show the correctness of a simple algorithm for distributed sorting on a systolic ring. Many other examples are given.
Introduction
Systolic systems are arrays of synchronized processors which process data in parallel by passing them from one processor to neighboring ones in a regular rhythmical pattern. Most systolic systems use only a few different types of processors arranged in a regular pattern. The principal idea is to perform the required computations with minimal input-output communication. Systolic algorithms were explicitly introduced by Kung and Leiserson [19, 24] but many algorithms of this type were designed earlier, see for example [5, 11, 16] .
Recently, systolic systems have been studied extensively; see [17] for a list of references. Most of this work has been devoted to the design of individual algorithms for many different areas, but the efficient layout of systolic systems (and VLSI in general) has also been well studied, see for example [20, 21] . Other topics that have been studied are: the development of general programming (design) techniques for systolic algorithms [3, 18, 22, 25, 28, 30] , and the systematic study of the power and limitations of certain types of networks from an automata-theoretic point of view [1, 4-7, 9-11, 13, 14, 26, 29] .
Our approach is also automata-theoretic, but rather than the study of a specific class of language recognizers or transducers we introduce precise notions to the study of arbitrary systolic networks. For example, we want to make precise the statement that the square grid and hexagonal grid are essentially equivalent, or that the bidirectional linear array and the unidirectional ring are essentially equivalent. Our main goal is to give a general framework for the study of arbitrary systolic systems and computations on them.
We introduce the notion of a 'computational network" which in general is an arbitrary finite or infinite (synchronous) network of processors connected by communication lines with arbitrary integer (even negative) 'delays' and no queuin~ capability. We are mainly interested in homogeneous (identical processors) and regular networks, but our definitions do not make any such assumption.
Our main tool is the space-time diagram, called the unrolling, of a computation on a computational network in [7] . The unrolling of a computational network is simple fomJ of a data-flow diagram; it motivates our definition of a 'computationaJ diagram'. We consider two networks to be equivalent when they have isomorphic unrollings. This is a much stronger equivalence than equivalence based on the sam~ input-output function. Two different networks can have isomorphic unrollings, tha is, the unrolling of one network can be topologically transformed to the unrollin~ of the other. Such a topolotical transformation is a useful tool when designing nev networks and in proving their correctness. Topological transformations as such ar~ not new, see [4] , but we introduce a general model of a computational network ant a computational diagram which allow concise proofs using the transformatior technique for a broad class of parallel networks. We demonstrate that most of th~ known techniques for systolic system design, such as systolic conversion [18, 22] Design of systolic networks 185 folding [3] , and speed up [26] , are special cases of topological transformations on unrollings. This also holds for the wavefront technique [30] and geometric transformations [2] but lack of space prevents their discussion here.
A particularly simple type of a computational network is the pure computational network in which for each node n the paths from all inputs to node n have the same delay. In practice such a network always allows pipelining of inputs with pipelining of period one. We show that a pure computational network and its unrolling are isomorphic.
In Section 3 we study the semisystolic and systolic networks introduced in [22] and generalize the Retiming Lemma and Systolic Conversion Theorem from [22] . Our proofs using unrolling are simpler than the original proofs and at the same time they are more general, since we do not restrict ourselves to a 'single host' and throughout this paper we study not only finite but also infinite networks.
The systolic conversion preserves not only the structure, that is the underlying graph, of a computational network but also the functions performed by the individual processors. Only the timing, that is the initiation of the processors, and the delays between them are changed. Thus we obtain a very strong equivalent network. In our definition of equivalent networks we require that the networks have the same unrolling, that is they perform step by step identical computations but not necessarily in pairwise matching processors. In Section 4 we consider networks which are not equivalent in this sense but which still perform essentially the same computations. We introduce the notion of one network being (m, k)-simulated by another network. Intuitively this means that the simulating network performs identical computations using processors each of which simulates m original processors and requiring k steps on the simulating network to simulate one step of the original one.
The notion of simulation allows us to compare precisely the power of various well-known networks. It is generally known, even though not stated explicitly in the literature, that the square grid is equivalent to the hexagonal grid. We make this comparison precise and give a number of further practical examples. One of them generalizes a result from [29] and shows that any network on a bidirectional linear array can be transformed into a unidirectional ring of the same size which is half as fast. Similarly, a bidirectional two-dimensional array can be converted into a unidirectional toroid, and similarly for higher dimensions.
We close Section 4 by describing a simple efficient algorithm for distributed sorting on a unidirectional ring of processors. This algorithm is obtained and proved correct by transforming the well-known odd-even transposition sort algorithm from the linear bidirectional array to the unidirectional ring.
In the last section we give further applications of the transformation technique. For instance, we give a simple proof of the result from [25] that global control does not increase the power of m-dimensional iterative arrays. As a new result we demonstrate that the same result also holds for m-dimensional cellular automata.
Preliminaries
Given a possibly infinite set V, the set of all finite sequences of elements from V (words) is denoted by V*. For x ~ V* the length of the sequence x is denoted by ]x[. For a finite set of M the cardinality of M is denoted by IMI. We use z to denote the set {..., -1, 0, 1,...} and [~ to denote the set {0, 1, 2,...}, 0 denotes the empty set.
Generally we omit double parentheses in cases where f is a function whose argument is a pair (x, y), i.e., we write f(x, y) rather than f ((x, y) ). Similarly for functions returning functions, we prefer to use simpler dp(v)(x, y) to more precise (ep(v) ) (x, y) . An ordered digraph is a structure H =(V, ,r)where V is a (finite or infinite) set of nodes and rr: V--> V* is a function. If ,r(v) = Vl ... Vk, for vi in V, then (vl, v), .... , (Vk, V) are (directed) edges (in that order) from nodes vi to the node v, for i = 1,..., k. To stress that the edges are directed we will often use the notation vi--> v rather than (vi, v), or even e: v~--> v, if we want to name the edge.
We denote the set of all edges of H by E, and will often talk about the underlying digraph (V, E) . Note that parallel edges and self-loops are allowed in H.
The meaning of terms such as (directed) path, cycle, indegree, outdegree, start node, end node (of a path) will be applied to H meaning the corresponding terms for (V, E). Thus, for example the indegree indeg(v) is 17r(v)], i.e., the length of the word or(v). A path p=(/)O--~l)l, t~1-->/)2,..., l)k_l-->l)k) (k>0) will be written as p: Vo-> Vl ->" • •--> vk or, simply as p: vo ->+ vk. The length ofp will be denoted by ]Pl. If u --> v we also call u the (immediate) parent of v, and if u -->+ v we call v the descendant of u.
For the ordered digraph H = (V, or) according to our definition, the indegree of every node is finite. Thus if we define V~ ={v~ vllrr(v)[ = i} for i=0, 1,..., then (Vo, V~,...) is a partion of V. Note that there is no requirement on the finiteness of an outdegree.
Computational schemas and networks
Let Q be a set (finite or infinite). A pre-computational diagram (preCD) S is a structure S = (V, 7r, Q, ~b) where (V, ,r) is an ordered digraph and ~b is a map (a collection of maps) ~b: Vk-'> IQk QI for k>0. A preCD is called a computational diagram (CD) if the length of any path (of its underlying digraph) with a given end-point is bounded. Formally, for every v in V, there is b/> 0 such that if p is a path with end(p) = v, then Ipl-< b. Clearly, there cannot be cycles in a CD.
A computation a on a preCD is a map a: V--> Q such that for each v ~ V with 7r(v)= vl ... Vk (Vie V), k>0, we have a(vk)), (1) i.e., the 'value' a(v) in Q is computed from the values at all nodes w ~ V for which there is an edge (w, v) in E. we may call these nodes the
Design of systolic networks 187
Note that (1) imposes no condition on nodes o E Vi,; inputs of the preCD.
2.1. Example. Let us consider the CD A = ( V, n, 4) where V = {Q, 6, c, d, e}, T(U) = v(b) = T(C) = 1, a = ab, w(e) = dc, +(d)(x, y) = x+y, +(e)(x, y) =x-y. CD A is shown in Fig. 1 . In the other examples we will show the names of the processors inside the circles, since usually no specific functions will be considered. This very simple CD computes the value X,(X, + X,) in node e from the inputs X,, X, and X, given in nodes a, b, and c, respectively. This is a CD that given wl, w2, w,, w, and inputs x1, x2, x3, . .'. , computes ~1, ~2, ~3, . . . where _Yi = WlXi + WzXi+l+ W3Xi+2+ W4Xi+3 (see Fig. 2 ).
There are five columns; if we number them 0 to 4 (left-to-right), then, in column i for i = 1, 2, 3, C#I( v) is a pair (gf, g:) where gf( S, x) = S + w5+x, g:(s, x) = x and, in column 0, +(v)(y) = y, where S and x are the left and the right component, resp., of the values (pairs) at the previous nodes. Finally, in column 4, ~(U)(X) = wlx.
In this example there are five kinds of nodes, one in each column. It is easy to design a CD for the same computation in which all the functions are the same (homogeneous CD). The important property of CD is that given values at inputs there is a unique computation.
Formally, we have the following theorem. ,l(v) . [] Actually, we have proved a somewhat stronger result, namely that given a preCD and a map tZo: Vo--> Q, map t~ o can be extended to all nodes in V for which Ivl<oo.
Note that alternatively we could have defined a CD with qb defined everywhere on V, that is also on Vo, and interpret (1) for v e Vo as tz(v)= ~b(v). Then there would had been exactly one computation on every CD.
A computation network (CN) is a structure N=(V, It, Q, ~b, A, r) where H= (V, 7r, Q, th) is a preCD with underlying digraph (V, E), A is a map E ~ Z labeling each edge with an integer A(e). We interpret A(e) as the delay (in some time units clock cycles). Finally, r: V--> Z is a partial function which determines when a computation (see below) begins at a node v in V.
Note: The interpretation of A (e) as time-delay makes sense only when A (e)i> 0; however, we are not assuming this in general because our results are valid also when A is possibly negative. In many examples of CN's the functions A and r will be defined by A(e) = 1 for each e~ E and r(v) = 0 for each v~ V, that is, each edge involves a unit delay, and initial conditions are specified for each processor (node) at time 0 (and thus the computation starts everywhere at time 1).
We shall often omit the functions A and/or r from the description of a network. Such an omission means that the 'default' functions A(e)=l and r(v)=O are considered. In most examples Q and 4, are left unspecified, in that case it is understood that arbitrary Q and 4, are considered.
To (v2, t-d2) ... (vk, t-dk) . Q, 4,') , where 4''(v, t) = 4' (v), is a preCD which is meant to describe computations on N spread in time. The problem is that S is a proper preCD with all paths of infinite length, and thus there are no computations in our sense on S. We are interested in computations starting at a particular time and continue from that time on. This is why we have the function r.
Let us consider the following three subsets of V xZ:
(ii) D, the descendants of S, i.e., the set {(v, t) l(v, S and there is (s, to)~ S such that (s, to) ~+ (v, t)}, (iii) P, the parents of D is the set {(v, t) [(v, (u, t') for some (u, t')s D}.
In both cases -->+ and --> are paths and edges in GN. Denote by 1~" = S u D u P.
A computation o~ on N is a function a-I~'~ Q such that, for all (v, t) ~ D,
Note that it is often convenient to consider a as a partial function V x Z--> Q. As this cannot lead to ambiguity we shall do so, when convenient. Intuitively a computation on a network N proceeds as follows. Arbitrary 'initial' values are chosen at the processors in dom(z). More specifically, an initial value from Q is chosen for each processor v in dom(z) at the time ~'(v). Then new values are successively computed according to (2). During this computation arbitrary 'input" values are supplied, when needed, at the nodes without parents.
We are not concerned with formalizing how outputs are produced. In any particular network outputs can be taken from suitable processors at suitable times to realize a desired input-output function. This is the same situation as for gate networks where for many considerations it is not relevant whether the output of a gate is external or not.
We are mainly interested in computational networks. The computational diagrams are auxiliary constructs, they are important because for every CN N there is a CD H, called the unrolling of N, such that each computation a on N has an 'isomorphic' computation on H.
Let N= (V, zr, Q, qS, r) be a CN. The unrolling 1Q of N is the preCD /V= (V, ~, Q, ~) where ~-is the restriction of ,r' to 1~" and ~(v, t) = 4,(v) for (v, t) in I~" (I~' and ~' are defined above). It is easy to see that /Q is not always a CD; for example, /V might contain cycles, and thus computations need not exist on every N. On the other hand, it is obvious that a: V x Z ~ Q is a computation on a CN N if and only if it is a computation on its unrolling N.
Given a network N = (V, m Q, 4), A, ~'), let (V, E) be the underlying graph. We may extend A from edges to paths by putting, for p:Vo~ vl~'''~vk,
For each network N there is a preCD, namely IV which has (by definition) the same computations. It is easy to show that conversely, given an arbitrary CD S = (V, ¢r, Q, 4)) we can construct a CN with the same computations as S. To do so, we use the function Iv[: v~N defined in the proof of Theorem 2.3, and define A(u, v)=lvl-lul for all u-v. If we also define ~" by ~-(v)=0 for v~ Vo, then N = (V, ¢r, Q, d;, A, ~') is a network with the same computations as S.
The network N has a useful "property, it generalizes the notion of pure network of [6] . We say that a network N is pure if: 
Then its unrolling iV= ( V, ¢r, Q, ~ ) is isomorphic to preCD ( V, 7r, Q, ~b ).

Retiming and systolic conversion
An important property of computations on CN is that if they exist they are uniquely defined given some 'initial values', i.e. an analog of Theorem 2.3 holds for CN's.
First' we observe that in general there might not exist any computation on a CN. This can happen on networks with zero or negative 'delays' in A(u, v). Following It is easy to see that there always exist computations on a systolic network on which ~'(v) is bounded from below. However, we will not restrict ourselves to this case and will consider even networks which a/'e not systolic.
3.1. Example. Consider N = (V, m Q, ~b, h, z) where V= {1, 2,...}, ~r(k) : k+ 1 for k~>l, Q and ~b are arbitrary, A(e)=0 for all eeE and ~'(k) = k for all k~ V (see Fig. 3 ). Clearly, there is no computation on the related preCD S = ( V, or, Q, ~b); however, for the • defined above (and suitable ~b) there are computations on N. Now, we want to find conditions under which a general CN can be transformed to a semisystolic CN or even a systolic CN with the same computations. First, we have to make precise the notion of equivalence for CN. We say that two CN's are equivalent if their unrollings are isomorphic as preCD's.
Lemma. (Retiming Lemma of [22]). Given a CN N = (V, zr, Q, qb, A, r) and a function 8, called lag, 8: V ~ Z, define A8 : E ~ Z by
X (u, v)= X(u, v)-8(u)+ 8(v) (3) for each e=u~v in E, and z~: V~Z by rs(v)=r(v)+8(v). Consider CN Ns= ( V, zr, Q, qb, Aa, ~'8). If a: (/~ Q is a computation on N, and as" ~"~ ~ Q is defined by as(v, t) = a(v, t -8(v)) for all (v, t) ~ Ys, then t~ is a computation on Ns.
Proof. Clearly, a and a8 are identical computations when considered as computations on unrollings 1V and 1Qs. [] 3.3. Corollary. Networks N and N8 are equivalent.
The following two theorems are generalizations of the Systolic Conversion Theorem from [22] . They give the necessary and sufficient conditions for the existence of a lag-function 8 which converts a CN network to an equivalent semisystolic (systolic) network. 
Proof. Assume (4) holds. First we show three properties of p following from (4).
Proof. (i) If u ->+ v and v -->+ w, i.e. p (u, v) and p (v, w) are finite, then u -->+ w, thus p(u, w) is finite and the path e" u-->+w is considered when calculating the infimum defining p (u, w) . Thus p(u, w)<~p(u, v)+p(v, w) . If there is no path u -->+ v or v ->+ w, then the right-hand side of (i) is oo and (i) holds trivially.
(ii) Let d = p(u, u) and d <0. This means that there is a path p: u->+u with h(p) = d, but then p followed by p is also a path u ->+ u and A(p-p)= 2d < d. This contradicts the assumption that
(iii) The last inequality immediately follows from (i) and (ii). This completes the proof of the claim. [] Proof of Theorem 3.4 (continued). Now we show that it is possible to define a lag-function 8: V-, Z so that As(u, v) I> 0 for each edge u -~ v, and where A8 is defined by (3.) Let U be a maximal subset of V on which 8: U-* Z can be defined so that
holds for all u, v in U. If U ~ V, then consider w ~ V-U. By claim 3.5 (iii) we have
we can choose 8(w) so that
8(u) +p(w, u) I> 8(w) I> 8(u) -o(u, w).
From this we immediately have 8(w)+p(u, w) >-8(u) and 8(u)~ > 8(w)-p(w, u)
which contradicts the fact, that U is a maximal subset of V for which (5) holds. Thus U = V, and 8 can be defined on the whole V. By the Retiming Lemma, N8 and N are equivalent and, by (5),
8(u, v) = v)-p(u, v).
For every edge u --> v, p(u, v)~< A (u, v) by the definition of p, therefore As(u, v)i> 0, which means that N8 is semisystolic. To prove the converse, assume that there is a 8 such that N8 is semisystolic. That In the next example the network M itself cannot be converted; however, the network [2M] can be. The next example shows that arbitrary large k (slowdown) may be needed in Corollary 3.8. Finally, the following example shows that for an infinite network a conversion from semisystolic to systolic is not always possible. 3.12. Example. Consider the network N given in Fig. 9 . It is easy to verify that for no k the network [kN] satisfies condition (6) from Theorem 3.6.
I )o
Simulation of networks
In this section we shall investigate networks performing essentially identical computations, however not necessarily equivalent in the sense of the previous section.
Given two CD's G~ = ( V~, 7ri, Q,, ~b~) (i = 1, 2) we say that (;2 simulates G1 or that G1 is simulated on (;2 if there are two maps p: 111-> V2 and ~b: V~ × Q2--~ Q~ such that for every computation t~ 1 on G1 there exists a computation a 2 on G 2 for which =
az(p(v)).
We say that the pair of maps p, ff establishes the simulation, or that (;2 simulates GI through p, qJ.
The notion of simulation is too general; for example, every finite CD (i.e., V and Q finite) can be simulated on a finite automaton. Therefore we restrict p by requiring the existence of a bound m on the number of nodes that can be merged by p, i.e., Ip-l(u) 
we say that N2 (m, k)-simulates N1.
Intuitively, k corresponds to the slowdown of N1 in order to enable N2 to simulate it. Proof. Since the idea of the proof is fairly straightforward we will omit the lengthy technical details and give only the outline of the construction of N2.
First, ~r2 is chosen arbitrarily so that ( V2, E2) is the underlying graph of N2. Next, given this ~r2, we construct Q2 (tuples of elements of Q1) and ~b2 as follows. Each processor of N2, i.e. ~b2(v) for each v ~ V2, performs two kinds of tasks. It simulates concurrently all the processors in y-~(v) (at most m) and if v is a node other than the end-node of any path p(e), e ~ E~, then v passes values (from Q~) along that path. Ifa node lies on several paths (not end-node), then it passes, generally different, values along each path.
z2 is defined as follows. If u ~ dom(z~), then y(u) ~ dom(z2) and r2(y(u)) = zl(u). Condition (ii) assures that 72 is well defined. Using the conditions (i) and (iii) it is easy to verify that CN N2 (m, k)-simulates CN N1.
Finally, if the indegree of (II1, El) is finite (as is the case in all practical applications), then the number of all paths p(e) going through any fixed node is uniformly bounded, so the length of the tuples in Q2 is bounded too implying the finiteness of Q2. [] Note that if we are interested in networks with processors of limited complexity, that is, functions ~b restricted to a certain class of functions, then we notice that the construction in the above proof preserves those classes of functions that contains all projections and are closed under composition.
Two special cases of Lemma 4.1 often occur.
Corollary. Let G1 = ( It"1, El) be a graph and let { U~ IJ v2} be a partition of VI.
We let G2 = ( V2, E2) where ( u, v ) ~ E2 if and only if u ~ Ua, v ~ Ub and there is an edge in E1 from some node in Ua to a mode in Ut, For every network NI on GI there is a network N2 on (32 such that NI is (m, 1)-simulated on N2.
Here again m = maxs~ v~ I ujl. 197 Finally, the following lemma is easy to prove.
Corollary. Let G~ be a subgraph of G2, i.e., Gi=( V~Ei), i=1,2, and V1 ~-V2 and E1 c E2. Then every network NI on G1 can be (1, 1)-simulated by a network N2 on (32.
Design of systolic networks
Lemma. Let N1, N2, N3 be networks such that N2(il,jl)-simulates N1 and N3 (i2, J2)-simulates N2. Then N3 (il i2, jlj2)-simulates N1.
The following lemma is useful in proving for given two networks, that one cannot simulate the other.
Lemma. Let Ni = ( V~, ~'i, Qi, A~, z~), i = 1, 2, be two networks. If N2 ( m, n )-simulates N1 through p and ~, then for every path p: Vo--> v~ -->. • • --> Vk in the underlying graph ofN~ p(p) : p(v0) -->+ p(vl) -->+" • • -->+ p(Vk) is a path of N2. If, moreover,
In the examples throughout the paper we will frequently use bidirectional communications between processors (nodes), and processors (nodes) with selfloops. We introduce the notational abbreviations for these cases as shown in Fig. 10 . Another abbreviation also shown in Fig. 10 is the omission of 0-degree nodes (input processors), only the 'half edges' entering the other nodes are shown. Finally, we would like to recall the following conventions. Omitted edge-label implies the value of h for this edge is one. When ~-is not explicitly given, we assume that z is defined for all the nodes and is zero everywhere.
Example (Simulations between hexagonal and square grid networks).
A hexagonal grid network, i.e. a grid as in Fig. ll(a) can be drawn as in Fig. ll(b) ; that is, as a square grid network with some connections omitted. Therefore, a hexagonal grid network can be (1, 1)-simulated by a square grid network.
To show the converse, consider any two nodes connected by an edge of the square grid. Clearly, there is a path between them in the hexagonal grid of the length at most 3, and because of the presence of self-loops there is a also a path of length exactly 3. Thus every square grid network can be (1, on the square grid, then the corresponding pair of nodes on the hexagonal grid can be connected by a path of length 4. Thus a more elaborate construction would allow slowdown of only 2 rather than 3 as in our construction above.
Example (Simulations between square and triangular grid networks).
When we draw the triangular grid as shown in Fig. 12 , we see that similar considerations as in Example 4.6 show that a square grid network can be ( 1, 1 )-simulated by a triangular grid network and that (1, 2)-simulation is possible in the reverse direction. 
Example (Simulations between bidirectional ring and bidirectional linear array
(each consisting of n nodes)). Let us denote these networks BRn and BAn, respectively. (1, 1)-simulation of BAn on BR, is trivial, since BR, is obtained from BAn by omitting one edge. To prove the converse, we first note that it follows by Lemma 4.1 that the network M, given (for n = 7) in Fig. 13 can be (1, 2)-simulated on BA,. Now BRn can be (1, 1)-simulated On Mn since it is obtained from Mn by removing all but two 'short' edges, which is demonstrated by drawing BR7 as in Fig. 14 . 4.9. Example (Simulations between homogeneous BR, and the homogeneous unidirectional ring with n nodes (URn)). Again BRn trivially simulates URn. To show the converse we must assume that BRn is homogeneous, i.e., all its processors perform identical functions (~b(u) = ~b(v) for u, v in V). We show the simulation in two steps.
Let BRn = ( V, 7r, Q, ~b) where V = {v~,..., v,} (BRn has the 'default" A, r). Consider network Cn = (V, 7re, Q, ~b) where ~c: V--> V* is defined by 7r~(vi) = viv~elVi~2, where ~) means addition mod n. C7 is shown in Fig. 15 . Clearly, since the networks BR, and C, are homogeneous p" (v~, t)~->(v~t, t), for i= 1,..., n and t~>0, establishes the isomorphism of the unrollings of BRn and Cn. Thus BRn and Cn are equivalent, and also each (1, 1) 
Example (Simulations between bidirectional linear array (BAn) and bidirectional linear array without selfloops (Wn))
. Network W7 is shown in Fig. 16 . Trivially, BAn (1, 1)-simulates IV,. We show that BAn can be (1, 2)-simulated on W2n-1 and W2n-1 can be (2, 1)-simulated on BAn. Let Vn = {1,..., n} be the nodes of both BAn and W,. The map y: i~-->2i-1, i = 1,..., n, maps Vn into V2n-~. Clearly, for every edge of BAn there is a path of length 2 in Wn connecting the corresponding nodes. Thus W2n-1 simulates BA, by Lemma 4.1. The converse is easy to see by mapping pairs of nodes 2i-1, 2i of Wn, into one node i of BA, (except the last node n). we restrict ~" only to the 'old' nodes of W2,-1 then the unrolling of this modified I~r2,_1 has only one component. The unrolling of I~¢' 7 is shown in Fig. 17 . We can also consider pure homogeneous network T2,-1 = (V, 7r, Q, ~b, A, ~') where V, 7r, Q, ~b are as in the unrolling of vfV2,_1, A = 1 for all edges and z is defined (and equal to zero) only on the top row. Clearly, the unrollings of T2~-~ and l~'2n-~ are isomorphic and therefore T2,-1 and ITV2n_~ are equivalent.
We can summarize the results demonstrated in Examples 4.8 to 4.11 in the following Theorem 4.12. This theorem generalizes similar results for various types of cellular, iterative or trellis automata defined on structures like unidirectional linear arrays, bidirectional arrays, and rings. In more details, Table 1 shows the values of (i,j). A pair i, j in the intersection of a row and column means that a network named in the row (i,j)-simulates the network named in the column. The values not shown in Examples 4.8 to 4.11 follow from transitivity of simulation. The exception is the pair 2, 2. From the transitivity we get 1, 4, but the value 2, 2 can be shown easily. This result means that a solution processor with selfloop is implemented as a device with memory then it is typically much cheaper to update one location than to rewrite the whole memory. Thus if a single update means to change the value q to q', then this can be easily done in one clock step inside one processor but might be impossible to do in our step in the neighboring processor, since this could require to communicate the whole contents of the memory in one step. Theorem 4.12 can be used to design systolic algorithms and prove their correctness. For example, the unidirectional (systolic) ring has been implemented by Ostlund and used specifically for computations in molecular dynamics (see [23] ). The algorithms described in [23] were designed directly for the unidirectional ring. However, having Theorem 4.12, it is typically easier to program such algorithms, and in particular to prove their correctness, for bidirectional linear arrays and then transform them to unidirectional rings. The algorithm for distributed sorting is designed using this method in the following example. 4.13. Example. A distributed sorting algorithm is designed and proved correct for the unidirectional ring of microprocessors each of which can store the same number of records (numbers). The well-known odd-even transportation sort for 2n records [15, p. 241] can be easily implemented on the network T2,-1 in time 2n-1. Each processor sorts two records (numbers), with suitable modification for the 'endprocessors'. Now we use the result that every sorting network also works for multisets when we start with sorted multisets and replace the operation of sorting two elements by merging two multisets, see [15, p. 241] . By Theorem 2.4 the unrolling of 1~¢2,_~ is isomorphic to I~2,_, itself. Thus both lg'7 and its unrolling are shown in Fig. 17 . We compare it with the unrolling of UR4 shown in Fig. 18 . We see that by omitting one node in each even row (and therefore also the dotted edges) in Fig. 18 we obtain a subgraph isomorphic to the one shown in Fig. 17 .
Theorem. Any of the homogeneous CN of the following type can be ( i,j)-simulated
It is now easy to verify that the following is a correct algorithm for (2n -1)-step sorting on URn. We assume that initially there is a sorted multiset of at most 2k elements in each processor (if necessary local sorting is performed first). Then in each step each processor performs the following. It sends the 'left half' of the multiset, i.e., the k smallest elements (or all of them if there are less than k of them) to the left neighbor and merges the remaining elements with those coming from the right neighbor. The only exception is that no sorting is done across 'the fence'. The fence is originally between the processors vn and Vl. The fence is sent at 'half speed' through the processor, i.e., it is in the 'middle' of processor Vn-, at the time 2t-1 and that processor is 'inactive'. After 2n-1 steps the 'fence' returns to its original position and the sorting is completed as shown in the following example of 3 processors, each containing 4 numbers. The bar represents the fence. 8 6 3 4 5 6 3 1 9 1 4 6 5 6 6 8 1 1 3 4 6 913 4 1 1 6 8 3 4 6 9 I 3 4 5 6 3 4 6 8 6 9[3 4 1 1 5 6 6 6 8 9 I 1 1 3 4 3 4 5 6 8 911 1 3 3 4 4 5 6 6 6 1 1 3 3 4 4 5 6 6 6 8 9
Note that if some processor is not 'full', i.e. contains less than 2n elements, it still sends n elements to the left, which means that it pretends that it contains additional dummy elements considered larger than all the other, and therefore these dummies are always retained. Thus the algorithm is correct also in this case. Alternatively we can retain the n largest elements and send the rest to the left thus treating the nonexistent elements as the smallest.
Except for the pure network T2n_l, our Theorem 4.12 deals with one-dimensional structures. This result can be generalized to m-dimensional structures (arrays and 'toroidal structures'). We formulate it for the most interesting case of the twodimensional structures, the other cases are left for the reader.
Example (Simulations between homogeneous bidirectional two-dimensional array
(BArn, n) and homogeneous unidirectional two-dimensional toroid (UTm, n)). BA4,3 is shown in Fig. 19 and UT4,3 in Fig. 20 . A straightforward generalization of the technique used in Example 4.9 (rotation of the nodes both horizontally and vertically) shows that the homogeneous bidirectional two-dimensional toroid BTm, n can be (1, 2)-simulated on UTm, n. Since BAm, n can trivially be (1, 1)-simulated on BT,,,n we conclude that BA,.,n can be (1, 2)-simulated on UT,.,.. To show the converse is easier. Using the argument as in Example 4.9 we conclude that not only BA,~n but Therefore, we have another useful design tool; namely, every algorithm for bidirectional two-dimensional array (mesh-connected processors) can be easily modified to run at half speed on a unidirectional two-dimensional toroid of the same size.
Further examples
We start this section by examining a relation between networks M and kM as far as simulation is concerned. Obviously M can be (1, k)-simulated on kM. To see this, it is enough totake p(v, t)=(v, kt) and ~b (v, q) =q. If we modify the definition of (j, k)-simulation so that k can be a rational number, then conversely kM can be (1, 1/ k) -simulated by M. As was already mentioned, the intuitive interpretation of 'N2 (m, k)-simulates Nl"is 'N2 can do what N1 does, but k-times more slowly'. The concept of time is particularly important when considering how networks receive their inputs. Informally, a network working k-times more slowly needs to get its input k-times more slowly to do the same work.
In Example 5.1 below we confirm this interpretation by considering a well-known example of a linear iterative array which recognizes palindromes.
First we describe an iterative array (see [5] ) as CN. Consider the network L in Fig. 21(a) , with ~" defined for node 1 only (~'(1) = 1). The unrolling of L is the CD U in Fig. 21(b) . In order to define a computation on U (and thus on L) initial values must be given in processors (input nodes) with indegree 0. These are the nodes (a) (0, t) for t = 0, 1, 2,..., which, because they are given to the same processor at different times, are called serial inputs, and (b) (j+ 1,j) and (j+2,j) for j = 0, 1, 2,..., which are called parallel inputs. Now, L is an iterative array if Q is a finite set which contains a special element, say, #, and the parallel inputs are 'quiescent', i.e., only those computations ot on L are considered for which a(j+ 1,j)= ct(j+2,j)= # for all j =0, 1, .... 
Example (Speed up).
Consider the example of a linear iterative array of finite-state machines recognizing a palindrome. Such an iterative array was given in [5] . In [22], Cole's result was reproved using the Systolic Conversion Theorem. It is easy to construct a semi-systolic linear array Po recognizing palindromes. Its underlying graph is shown in Fig. 22 , ~" is defined (as zero) in the leftmost node only. The description of 4> can be found in [22] , but is unimportant for the following discussion.
To convert this semisystolic network to a systolic network, we need to 'slow-down' the network, i.e., [2P0] rather than P0 is equivalent to some linear iterative array P. Unfortunately, P running at half speed, as explained above, needs its input coming We shall show this only for the example of a one-dimensional palindrome recognizer (k = 1), but the generalization is obvious.
In Fig. 22 Po represents a semisystolic network---a recognizer of palindromes. Network P~ is obtained by the Conversion Theorem (edges with delay 3 are omitted because they are redundant), and P~ is equivalent to [2Po] . Note that (V × Z, E), the graph underlying the network P~, has two components, one of which is the unrolling of P1, the other component absorbs the odd-numbered inputs, as discussed above. P~ is not a linear iterative array, since the selfloops have delay 2. However, the function ~b can be easily modified to obtain a network with the delay 1 on each selfloop of a linear array. Calling this modified network P~, we see that P~ is a linear iterative array, and also that any computation on P1 can be done on P~. However, P~ still does not recognize palindromes (because P~ does not).
Consider now the network P2-Informally, we may say that two steps of a computation on P1 correspond to one step on P2. Formally, we may only say that P~(1,2)-simulates P2, this follows from Lemma 4.1. We did not define (1,½)-simulation, but according to the discussion above, there is some justification in saying that P2 (1, ½)-simulates P~. Regardless of whether (1, ½)-simulation is defined or not, comparing computations on P~ and P2 we see that P2 is now a proper palindrome recognizer, however, P2 fails to be a linear iterative array. Thus, one more step is needed. By Lemma 4.1, P3 (2, 1)-simulates P2Nthis is seen by taking the function p from the lemma as p: i~ [(i+1)/2], where the processors (nodes) in both P2, P3 are numbered 0, 1,... from left to right. We can conclude that P3 is a linear iterative array which does recognize palindromes. Note that it would be possible to develop an alternative theory for retiming. Instead of using networks like 2N we could have allowed retiming by 1, 1, 3 however, as it does not seem to give any significant advantages, this route has not been taken.
Also note, that it was essential in this example that we consider iterative (linear or more general n-dimensional) arrays. The 'speed up' step from P2 to P3 cannot be done in a general case. In particular [10, Theorem 6.4] shows that speed up is not possible for iterative tree automata.
It is shown in [25] that for d-dimensional iterative arrays the power of a system is not increased by allowing Direct Central Control (or Global Control). In the following example we give another proof of this result using the Systolic Conversion Theorem. For simplicity we consider the case d = 2 only. Fig. 23 is a square grid network with the usual connections with delay 1, and with additional connections with delays 0. Clearly, these additional connections can be used to accomplish the global control. Note that the 0-connections are quite arbitrary as long as they connect the origin with each other node, and no 0-loops are introduced. The network [2A] can be converted, just as in the previous example, into the systolic network B (Fig. 23) . It is useful to compare A in Fig. 23 with Po in Fig. 22 . The difference, apart from Po being a one-dimensional array while A is two-dimensional is that the 0-connections are oriented in opposite directions. In Fig. 23 they lead from a fixed node to every node, while in Fig. 22 the)/lead from every node to a fixed node. Despite of this difference our initial transformations are the same.
Example. The network A in
Since between every pair of nodes connected by an edge in B there is a path of weight 3 we can apply Lemma 4.1. Thus, we conclude the network C in Fig. 23 (I, 3)-simulates network B. The initial function r for all three networks A, B, C is defined (as zero) for the origin only and therefore not affected by the modifications.
The following example demonstrates the influence of input and output considerations on geometric transformations. Cellular automata of [26] are Superficially similar to iterative arrays investigated in the previous example. C and U in Fig. 24 illustrate a network, and its unrolling, which corresponds to one-dimensional version of cellular automaton (CA). Some nodes and edges of U are drawn dotted--this indicates that we consider real-time computations on CA. The figure pictures the computation with four inputs and one output. Formally, the finite input is accommodated on an infinite network C by requiring again a fixed symbol, say # in Q, and extending any finite input by appending # # ... on the right. 
208
K. Culilq 11, L Fris
The network CG in Fig. 24 represents a possible definition of a (one-dimensional) CA with global (central) control. It could have been more natural to use also connection of delay 1 going in parallel with 0-delays, but, obviously these connections would be redundant. 5.3. Example. This example shows that any computation (in real time) on a CA with global control can be done (in real time) on the standard CA. Network C in Fig.  24 represents a CA, its unrolling is CD U in the same figure. The part of the unrolling which is irrelevant for real-time computations is drawn in dotted lines. Finally, CG in Fig. 24 represents a cellular automaton with additional 0-connections implementing the global control. Since CG has selfloops and 0-connections from left to right, the 1-connections from left to right become redundant and are omitted.
We proceeed initially in the same way as in Example 5.2 for iterative arrays, namely [2CG] is retimed. The resulting network C' and its unrolling U' are shown in In the following simple steps we demonstrate that any computation a on U' can be simulated (in fact (2, 1)-simulated) on a CA.
(1) For the price of doubling the size of Q, the delays on selfloops in C' (Fig.  25) can be changed to 1.
C":
Design of systolic networks (2) The unrolling U' of C' after this modification is now a subgraph (more precisely a preCD) of the unrolling of a network C" which is like C, but for input of double length. This is shown in Fig. 26 where the subgraph corresponding to U' is drawn in bold. It is easy to see that the network C" whose unrolling is U" can do the computation a (as modified already in (1)) as long as instead of the original input i~i2i3i4 we use ii #/2# i3# i5 (or similar). This is so because the diagonal path going from the top left node down right, the path on which the inputs of a are needed, can be 'computed' by sending a signal along it.
r-~---
(3) The dashed boxes in the network C" in Fig. 26 show which processors are mapped together in the final (2, 1)-simulation.
Note that a slightly simpler proof of the simulation of CG on CA would be possible had we not wanted to preserve the real-time. It is slightly simpler to show that CA (1, 2)-simulates C' than to show the (2, 1)-simulation; however, (i,j)-simulation preserves time only if j = 1.
We have just proved the following result for real time-CA (languange recognizer). This result can be easily generalized to n-dimensional cellular automata. 5.5. Example. We now briefly discuss two more results about one-way (unidirectional) cellular automata. One-way cellular automaton allows communication between two nodes in only one direction. Here we consider two-dimensional triangular grid in which the communication is in three directions. Such a one-way cellular automaton M] is shown in Fig. 27 for n = 3. We are interested in the output produced at the node with outdegree zero (lower left corner in Fig. 27 ). Similarly, we have networks M, b, M~, and M, a shown for n = 3 in Fig. 27 .
As networks, that is, the input coming to node i, j in one network comes to node i, j in any other network. The following result from [4] can be generalized to higher dimensions. Real time bidirectional cellular automata (working in time n) are equivalent to one-way (unidirectional) CA working in time 2n. One possible generalization of this is that the functions computable on network M e (Fig. 27 ) in real time are the same as those computable on network M~ d in time 2n.
5.6. Example. It was shown in [9] that regular sets can be recognized by a parallel algorithm on an unidirectional binary tree network. To recognize a string of length n we need any tree (not necessarily balanced) with at least n nodes; thus to accept arbitrary long input, this algorithm requires a potentially infinite tree. If the tree is (almost) balanced the recognition is in logarithmic time. A network based on such a tree is shown in Fig. 28(b) . Here, the initial function ~" is defined (as zero) at all the input nodes. In [8] it has been shown that we can use finite tree-like network Mk illustrated in Fig. 28 (a) for depth k = 2. Here the initial function z is defined (as zero) for the processors of the top row. On this network we recognize strings of length n, n <~ 2 k, in time k~ Longer strings are cut into pieces of length 2 k and recognized in time n/2k+ k. The correctness of the modified algorithm for Mk follows easily from the fact that the unrolling of Mk is the infinite binary tree shown in Fig. 28(b) . 5.7. Example. In Example 5.2 we considered a 'quarter plane' iterative array. Here we show that it is not important what regular infinite section of plane is used. We show this for the case of the 'full plane' iterative array A and the 'quarter plane' iterative array B (see Fig. 29 ).
The idea of showing two systolic systems equivalent by folding has been introduced in [3] . Clearly, the result of folding A twice is B. To illustrate here that folding is a special case of our technique we show that the two networks in Fig. 29 specific result, namely that 7"2 can (3, 1)-simulate T4; the generalization is straightforward. Fig. 30(a) (ignoring the boxes) shows T2. The boxes around nodes represent partition of nodes of this network. Corollary 4.2 shows that T4 of Fig. 30(b) can be (3, 1)-simulated on 7"2.
We will conclude this paper by showing the relation between various types of 'shuffle' networks. Perfect shuffle network has been introduced in [27] . It is a difficult network to layout but many algorithms can be efficiently implemented on it.
