Synthesis, structure and power of systolic computations  by Gruska, Jozef
Theoretrcal Computer Science 71 ( 1990) 4’-77 
North !iolktnd 
47 
SYNTHESIS, STRUCilJRE AND POWER OF SYSTOLIC 
COMPUTATIONS 
Jozef GRUSKA 
lnsriruw qf Technical Cj6emerics. Slovak Academy of Sciences. Utihrarskd 9. 8-f_’ .Z 7 Rraritlac~a, 
Czechoslorwkia 
Abstmet. A varier) ofprohlcm rcldted to systolic architectures. systems, models and computations 
are discussed. The emphases are on theoretical problems of a broader interest. Main motivations 
and interesting/important applications are also presented. The first pan is devoted to problem& 
related to synthesis, transformzttionc and simulatrons ol‘systoiic systems and archttectures. In the 
second part, the power and structure or tree and linear array computations are studied in detail. 
The goal is to suwey main research directions. problems. methods and techniques in not too 
h-mal 3 way. 
1. Imtrodoetioe 
Systolic architecture has been one of the most attractive ideas in computer 
architecture so far. It is very appeaiing from the design point of view, because it is 
based on repetiiions of simple processors, and on regularity, modularity and sim- 
plicity of interconnections. Moreover, many systolic systems can be designed using 
only very few types of processors, and also the repeated use of each input data in 
systolic systems significantly minimizes interaction with the memory of the host 
computing environment. All these are also reasons why systolic architecture is so 
suitable to make full use of the great potential of VLSI technology. Due to its 
simplicity, modularity and repeatability, systolic architecture also offers transparent, 
understandable and manageable, but still quite powerful, parallelism. Intricacy of 
data and communication flow in many systolic systems, offer the magic so useful 
and necessary to attract larger groups of designers to the idea. At the same time. a 
large variety of interesting theoretical problems of fundamental importance, and 
broader implications, have arisen in connection with the synthesis, analysis, and 
implementation of systolic and related systems and computations. 
Systolic systems can be seen as an interesting and useful modification (sim- 
plification mostly), of the cellular automata concept of von Neumann, with more 
emphasis on regularity and transparency of data and computation flow. New models 
and new problems have been motivated mainly by advances in technology. Similarity 
with cellular automata has immediately brought into use a whole bunch of theoretical 
techniques to study systolic architectures, systems and computations. Gn the other 
hand, systolic architecture is a natura! modificarion (generalization mostly) of 
0304-3975/90/93.50 @ 1990, Elsevier Science Publishers B. V. (North-Holland) 
48 1. Gmska 
pipelining-the architecturai concept that has contrihw-vl ST i.jLLh in recent vears 
to the significant increase in performance of modem computers. The creation of a 
quite general model of a practically important concept has been once again an 
important tool leading to a new quaiity- this time in computer architecture and 
supercomputing. This has, in turn, brought a new set of powerful formal methods 
into computer architecture. 
Two of the most powerful recent concepts in computer architecture, systolic 
systems by H. T. Kung and RISC architecture by J. Cocke, have quite similar genesis. 
Both autLors arrived at their ideas which are of major importance for computer 
architecture, by doing mainly theoretical research in quite remote areas of computer 
science and having in common mainly a long term search for improving efficiency. 
The first and nowadays seminal systolic systems by Kung and Leiserson-for 
matrix multiplication kird LR-decomposition-appeared in 1978. Since then various 
systolic systems have been designed. Several systolic system implementation patents 
have been taken in various countries. 
Most of the systolic systems have been designed using ad hoc methods and in 
many cases very similar systolic systems have been designed for seemingly quite 
different problems. This has naturally led to intensive research oriented to develop 
ping systematic and sufficiently automatizable methods to synthesise systolic systems 
from high level specifications. Significant progress has been achieved in this direction 
and several systolic system design methodologies are also discussec! in this paper. 
The method due to Ibarra and his coworkers [49, 481 deserves special attention. 
lnspite of being theory oriented and inspired, this method is very powerful and 
allows the design of systolic systems of various architectures to be reduced to the 
design of sequential programs for various simple sequential machines-a long time 
desire in the area of parallel computations. rn addition, various theoretically justified, 
systo!ic system transformation techniques and systolic architecture simulation tech- 
niques have been developed. They also contribute significantly to the improvement 
and efficiency of systolic system design methodologies [ 7 1, 8). 
The path from systolic systems, represented by attractive but still very high level 
abstract networks, to real and efficient implementations is long, indirect, and far 
from easy. In order to obtain really useful VLSI implementations, various tradeoffs 
and design modifications have to be considered [34, 29, 68, 851, and really very 
high density integrated circuits are required. All this can be seen ,>n perhaps the 
main project in this area, the warp machine at CMU. This machine consists of a 
linear array of programmable processors that have been designed in such a way 
that the whole array can implement efficiently various systolic systems especially 
for vision and signal processing computations. This project indicates especially well 
how long and complicated is the way from a simple though powerful idea to a really 
successful machine. Ilnpltmentation problems have also led to some interesting 
theoretical problems. 
VLSI implementations are not the only way to make use of the advantages of 
systolic systems. For several reasons they are also very suitable for effective simula- 
tions on current multiprocessor systems, for example multitransputer systems. This 
has led to attempts to develop systolic programming methodologies, programming 
languages and environments [32]. 
There have also been numerous attempts to develop and study various abstract 
models of systolic systems and computations and this will be our main interest in 
this paper. The models are of various degrees of abstraction and often far more 
genera! than a naive view of systolic systems, deduced from main examples, would 
suggest. The range of problems to be investigated for a particular model is partly 
determined by the generality and the abstraction of the mode!, but mainly by our 
desire to improve our insight and knowledge concerning synthesis, behavior and 
analysis of systolic architectures, systems and computations and to improve our 
methodology for dealing with them. 
The concept of topological transformation, as a formalization of time-space 
transformations, is a very useful too! to study various problems in the area of systolic 
computations, especially in the area of simulations of systolic architectures and 
transformations of systolic systems. Various results useful for the design and syn- 
thesis of systolic systems have also been obtained using this too!. Of special 
importance are results concerning the removal of broadcasting and the -placement 
of two-way communications by one-way, with minima! time overhead. Two special 
rransformations, retiming and slowdown, are nowadays powerful tools for design 
of netwcrks. Simulation results presented here concentrate on simu!a%n of one 
paraiiei architecture on another one, on emuia:ion of large networks on smaller 
ones and on a “universal” one. 
Parallel automata of several types-mainly array, trellis and tree-like networks 
of finite automata (often memory!ess)-have been the main theoretical models used 
to study various basic problems concerning systolic architectures, systems and 
computations: the power of various interconnections, the power of one-way and 
two-way communications, the power of difierent communication modes with the 
environment iinput/output) and the power of various types of nonhomogeneity. 
Some of these networks-two-dimensional arrays, trellises and trees-and some 
types of inputs are shown in Fig. 1. 
a) w-q_... $” 
: * i . : Cl 
. . . 
an. .a1 
Et3 
. . . 
. . . 
al a2 a3 a” 
,lbwww 
1 2 3 ab aS a6 a7 a8 
Fig. I. 
50 1. Grusra 
Time-space transformations of kear array computations led to the study of other 
and even more abstract models of systolic computations- two-dimensional infinite 
words of special types-that are also of a more general interest. Main attention has 
been devoted so far to the study of infinite tw-o-dimensional woras that form so 
called generalized Pascal triangles over arbitrary algebras of the signature (1, 1,2). 
They have a rich structure that has allowed various interesting results to be derived 
concerning linear array computations and their relations with the properties of the 
underlying algebras. 
Cellular automata in general and linear arrays in particular have also been 
considered as models suitable to study behaviour of complex systems and also as 
new models of the physical world [84, 90, 921. Models and results motivated 
originally by our desire to master parallel computations may also be very useful to 
obtain deeper results in these new and fundamental investigations. 
There are several other surveys on theoretical issues related to systolic computa- 
tions [37, 43, 871. 
2. Systolic syste!ms 
There have been several attempts to find a formal definition of systolic systems, 
for example in [26]. The most fruitful so far is the one given in [70], though at first 
sight it does not look that way because it is very general. A slight modification of 
this definition and related theoretical developments and applications are discussed 
in this section. 
Definition 2.1. A semisysrolic sysrem S = ( V, E, 6, 0, zr) is an oriented multigraph, 
with the set of nodes V, the set of edges E, the data domain 0, the processor 
mapping n that associates with each node UE V of the indegree p>O, a function 
a(u) : LY’ + 0, and the delay mapping S : E + N (the set of nonnegative integers) 
such that if S is extended in a natural way to map also paths of S into N, then 
S( p) > 0 for any cycle p. (6(e) represents the number of the delay-one registers on 
the edge e. and the requirement a(p)>0 excludes the existence of networks with 
“unlimited rippling”.) The nodes of V of the indegree (outdegree) 0 are a:a!!ed the 
input (output) nodes or processors of S. If a(e)>0 for any ec E, then S is called 
a sysrolic system. 
An informal description of the behaviour of a semisystolic system S is similar to 
that of a synchronous network working in discrete time. At each moment of the 
discrete time, each node and also each register output either a value from D or a 
don’t care element, say (‘#‘). In the case of the register it is the same value as it 
was at the previous time step on the output of its predecessor (i.e. of the preceding 
register on the same edge, if there is one, or of the outgoing node of that edge). In 
the case of the input node it is a value submitted from the environment. Finally, in 
the case of a node u of the indegree p > 0, the corresponding processor outputs the 
value of the fun&on a(u) applied to the arguments produced by its predecessors 
(registers or nodes) on all ingoing edges at the previous time steps. Moreover, it is 
assumed that all registers have an “initial value” to start a computation. 
It has turned out that in order to be able to deal more precisely with semisystolic 
systems, especially to formulate precisely the impacts of various transformations, 
and to develop synthesis techniques, a more rigorous treatment of their semantics 
is r eeded. 
One way of defining semantics of a semisystolic system S is to associate a “time 
function” with each node cf S [S]. By that we mean an arbitrary partial function 
from % (the set of all integers) to 0, that is defined only for finitely many negative 
integers. (The intended interpretation is that the value of the time function associated 
with a node u, and an argument f, is exactly the value produced by processor in D 
at the time 1.) This can be done as follows. 
Let V, be the set of input nodes UC S. Let UC fivt assign a time function 7,. to 
each input node UE V,. The semantics of S, with respect to the given input time 
functions, is then a mapping @ of nodes of V into time functions, such that @p(u) = rr 
for any u E V,, and for all other nodes u, a(u) is the minimal fix-point solution of 
the semantic equations of S. These semantic equations, one for each non-input node 
of S, relate, in a natural way, through functions associated with node-processors, 
the time function of each node with time functions of its predecessors on all ingokg 
edges. 
Two types of transformations on semisystolic systems are of special importance: 
slowdown rmnsfonnations and retiming trans$mnarions. 
If k E rJ* = N - {0}, then the k-slowdown transformation of a semisystolic system 
S = (V, E, 6, D, a) results in the semisystolic system S”’ = ( V, E, 6“‘. D, n), where 
S”-‘(e) = ks( e), for any e E E. A retiming transformation of S is given by any mapping 
r:V-,2suchthatforanyedgee:u+u,R(e!rr(~!- t{ 3 1 b (?. Tbt resulting semi- 
systolic system is S, = (V, E, 6,. 0, a), where s,(e) = 6(e) + r(u) - r(u) for any 
edge e:u+v. 
In order to define the effect of the slowdown and the retiming transformations 
we need to Introduce two sets of operators on time functions, parametrized by 
posit?:z integers k: 
(1) delay operarots eL such that (eAJ’f)( 1) = F( r -k) for any I; 
(2) spread operators R” such that (l2”f)(kr)=f(r) for any t, and undefined 
otherwise. 
Theorem 2.2. Let S = ( V, E, 6, D, a) be a semisysrolic system. L-et @, = {/;. 1 u E V,} 
be a set of time functions associated with input nodes of S. Let 0 = {1;, 1 u E V} be the 
leasr #x-point semanrics of S with respect o @, . 
(1) If k>O, then a”‘= {f2”fU 1u E V} is the least m-point semantics qf the 
k-slowdown semisystolic system S’ ’ ’ with respect o the input rime functions LJ”f,, for 
DE v,. 
52 1. G:_..A 
(?b If r is a retimirrg tr~wformation of S, then @, = {Or’c’~. 1 v E V} is thr least 
fix-@u semantics of the retimed semisystolic system S,, with respect o the input time 
functisrw t3r’w’f0 or v E V, . 
It is often easier to design. a semisystolic system for solving a problem than 2 
systolic one. The main reason for that is that in semisystolic systems one can use 
broadcasting to transmit data without any delay, wherever they are needed. Because 
cf that, sezisystolic systems are often more transparent. On the other hand, the 
eristence of delay-free interconnections requires an incrrasc in the length of the 
c!och cycle. Z! is therefore very desirable to be able tti transform semisystolic systems 
into systolic ones with the same interconnection structure. Necessary and sufficient 
conditions for the existta’;ce of retiming transformations capable of doing that are 
well known [7i j. T!:e ii~~~~ng of broadcasting is the main use of retiming transfor- 
mations. For some ;zli:isyr;olic systems S there is no retiming transform>ition to 
obtain a systolic system from S, but if iiM a proper slowdown tranjfo!maitti.: is. 
applied to S, then the resuitiug semisystolic system can be retimed to obtain a 
systolic one. This is the main use of slowdown transformations. in this way quite 
a few well known and tricky systolic systems can be easily obtained from very 
naiurai networks. 
Exam+ 2.3. Let A = {a@} be an n x n matrix such that its LU-decomposition into 
a lower triangular matrix L = (1,) with Is in the main diagonal, and into an upper 
diagonal matrix U = { uii} can be computed by Gaussian elimination without pivoting. 
The elements of matrices L and U can be computed according to the following 
recurrences: 
a; = a@, ai = a:-’ - IikuQ 
It is now easy to see that LU-decomposition of A, for the case I) = 4, can be computed 
by the semisystolic system and the flow of data as shown in Fig. 2 [62]. Computations 
(a) 
0 
a13 
0 0 %l a22 a23 
0 a31 a32 a33 
a4l a42 a43 
a14 
a24 
au 
a44 (b) 
Fig. 2. 
performed by four types of processors are shown in Fig. 2(b). tach diagonal has 
one register, verticai and horizontal edges have none. 
There is no way to retime this semisystolic system to obtain a systolic one. On 
the other hand if at first the 3-slowdown transformation is applied, then it is easy 
to retime the resulting system in such a way that each edge has exactly oqe register_ 
After some cosmetic changes one then quite easily obtains the well known systolic 
system of Kung and Leiserson [54] with quite tricky flow of computation. 
Retiming transformations can also be used to optimize synchronous networks 
with respect tC vzric :s cost criterra. For example, to minimize the cio& cycle fc,* 
the case when time needed to ~~:rf~rm processing in each node is given, or to 
minimize the total numbe; of rc;4isum. These optim&tion problems lead to ;Tarious 
linear progrdrr,miqe probloTs !.x jrhich eEcient algorithms are known [71]. Retim- 
ing transformations cc ,a ;rI~:_t hc: uscj to deal with various problems concerning two 
level pipelining and fault-rl,;rran:e i ‘2, 531. 
Based 0.1 the model 0: semisystolic systems discussed in this section 1701, the 
concept of a systo!ic flowchart s&,:me has been introduced in [2] to study syntactical 
properties of systolic systems. Equational axiomatization of such systolic flowchart 
schemes has been presented in [2]. Systolic flowchart schemes and their interpreta- 
tions have been considered to be an algebraic framework useful for the study of 
systolic systems [3]. 
In [39], quite a different type of transformations of systolic networks is considered. 
They preserve timing of systolic computation flows but they may change topology 
of the underlying network. These transformattons can also be used as a quite 
powerful systolic system design methodology. 
3. Systolic system desigm mctbdologies 
Transformations presented in the previous section represent powerful systolic 
system design tools. On a different level of abstraction, several, though quite 
restricted, design methodologies can be abstracted from the proofs of theorems 
dealing with the power of various classes of systolic automata (see Sections 5 and 
6). These results allow, for example, the automatic design of a systoiic treliis 
automaton to recognize any language being an intersection of a finite set of linear 
languages, directly from linear grammars that generate these languages. 
Starting with [88] a variety of more or less formal techniques for systematic design 
of systolic systems has appeared [S, 15,23,28,42,69,72-75,77,80]. Three of them 
are discussed now in more detail. 
Ibarra and his co-workers have developed several characterizations of networks 
(of various architectures) of finite automata in terms of variants of sequential Turing 
machines. These characterizations are a base for a powerful design methodology. 
One characterization is in terms of so-called full scan Turing machines (STM). 
There are actually several such machines, each for a different class of finite automata 
54 J. Glush 
networks. They differ in !ile initial configurations, in positions where they read txw 
input symbols, and how they react ta endmarkers [49, 371. We shall consider here 
only two of them, STMd and STM’. A STM M is a one-tape, one-head Turing 
machine with the external input (Fig. 3) to receive input words a, . . . Q,$, where 
a, E 2, and % is not in the input alphabet 22 M begins a computation in the initial 
state qo, and keeps performing right-to-left and left-to-right sweeps. An input is 
accepted if M writes an accepting sgmbol :fr;m a fOc f, the tape alphabet). Figure 
3b shows the initial configurations for STMd and STM’ and complete sweeps. “R” 
shows where the reading of the external input takes place. “w” above a square 
emphasizes that some writing into that square occurs and **w(S)” above a square 
indicates that “s” is written into that square as soon as it is read from the extem;ll 
input. During their left-to-right sweeps STM always stay in a special state of,, do 
not change tape contents and keep moving right until they meet $ or the blank 
symbol A. During right-to-left sweeps STM may change tape contents and states 
but they enter the state q,, if and only if they come to the square cith S or A, then 
always a left-to-right sweep starts. The swpep complexily of M on an input is the least 
number of complete sweeps needed to accept the input. 
The following theorem [49, 501 relates STMd and STM’ with systolic trellis 
automata with diagonal and vertical acceptance [37] (Fig. 4). These automata have 
(a) 
Fig. 4. 
the form of an infinite network of identical memoryless processors, with the lransi?ion 
function g such that g( A. A ) = A. They actually represent time-space transformation 
of linear array computations, and therefore ihe following theorem can easily be 
reformulated in such terms. 
Theorem 3.1. ( I ) A language L is accepted by a systolic trellis automaton with oertical 
acceptance in time 2 T( n ) - I, if and only [f it in accepted by a STM’ n-it/r the sweep 
complexity T( n ). 
(2) A language L is accepted by a systolic trellis automaton with diagonal acceptance 
in time T(n), if and on!r if i is accepted by a STMd witl; the weep comple.uit,v T( n ). 
While the previous theorem characterizes systolic automata (or linear arrays) as 
acceptors, the following one characterizes one-way two-dimensional cellular 
automata (OZMA) as transducers (Fig. 5) in terms of so-called two-dimensional 
sequential machines (ZDSM) (Fig. 6) [43,48]. 
A 2DSM A4 with the input of n symbols operates on a two-dimensional tape of 
I: x n squares. Initially all squares contain the symbol A. Similarly as for STM, 
2DSM also operate in sweeps. A sweep begins with M in a distinguished state 9,, 
and with the head on the leftmost square of the :opmost row. M then reads an 
input symbol and moves through squares of the first row, from left to #eight, rewriting 
symbols and changing states (except into 9,J as a Turing machine &es. After the 
rightmost square is visited, the head is reset to the first square of the next row in 
Fig. 6. 
56 J. Gnlska 
the state qo. This action is repeated fol ali POWS. At each step, new symbo! and new 
state depend on the previous state, on the old symbol in the S~*IS*D being scanned 
and, if there is any, alsr 3 ?h- s*Jriibol Etored in the square just above the scanned 
square. Afte: scantLug tii~ ias: sq:;?re of the last row an output symbol is produced, 
and M is reset to the first square of the first row to star; the next sweep. M is said 
to have sweep complexiry S(n) on an input ul . . . a,$ if it outputs S after at most 
S(n) sweeps. 
Theorem 3.2 (Ibarra [43]). Ler S(n) 2 n + 1. A 2DSM with sweep complexity S(n) 
can be simulated by a OZDCA in time S(n)+2n -2, and conotrsely. 
There are many other characterizations of array computations in terms of sequen- 
tial machines_ They have b?en used to derive new and efficient systolic systems for 
those tasks for which no such systolic systems are yet known. (No other design 
methodology has been so successful.) For example, the last theorem has been used 
to show that recognition and parsing of context-free ianguages can be done on 
02DCA in linear time [16]. These characterizations have a!so been used to obtain 
new theoretical results. For example a generalization of Theorem 3.2 to higher 
dimension has been used to show [43] that (k+l)-head nondeterministic finite 
automata can be simulated by a OZDCA in time (k + l)n + k - 1. 
The second important class of systolic system design methodologies consists of 
various data dependence graph manipulating strategies. The main idea is to analyse 
a data dependence graph and then to transform it into the equivalent one that 
satisfies certain constraints that are natural abstractions of systolic and VLSI require- 
ments. Methodologies of this type [36,77,79,80,83] have been especially successful 
for the design of systolic systems for matrix coriputations because in such cases 
one can naturally associate operations of computations with integer points of three- 
or four-dimensional space (see Fig. 7 for the data dependence graph for multiplica- 
tion of matrices of degree 2, i.e. C, = Es= Iaabkj)- 
In order to obtain more easy manipulable data dependence graphs, with only 
local interconnections, a slight modification of basic algorithms is usually useful- 
this results mainly in the introduction of some forms of pipelining of input data. 
k 
m-s 
/I ,I - i :: i 
. ‘: 2-m : L i . $1 : : -_ 
-8 
___-_ b 
. . . . . . . . . C 
i 
3 
Fig. 7. 
For matrix multiplication it has, for example, the form: 
c(i,j,k)=c(i,j,k-l)+a(i,i C:- i;c~,r--l.j,k), 
:I ( ; k)=a(i,j-l,k), 
b(i,j,k)=b(i-I,j,k). 
Various geometric transformations, e.g. affine transformations of data dependence 
graphs [23], followed by their projection into a plane or a line, result in a variety 
of two- or one-dimensionc: systems. 
The main problem is how to find, in a sufficiently systematic way, suitable 
transformations and projections. There have been attempts to solve this problem 
for computation tasks specitied by home restricted specification languages. Perhaps 
the best known is the method developed by Quinton [77]. It can be applied to the 
design of systolic systems for computations that can be expressed as a set of uniform 
reccurent equations 
U,(z)=f(U,(z-8,) ,..., U,(z-6,)) 
(I) 
over a convex set D of integer coordinates of the n-dimensional space. .f is a p-nary 
function, and 8,. . . . , :$ are “dependence vectors” from Z”. 
Quinton’s method consists of two steps: 
(i) to find a riming function I : D + IV, i.e. a schedule of computation that is 
compatible with dependences resulting from the equations (I 1; 
(ii) to find an o/location funcrion u : Z” +Z’ that maps D into a finite set of 
integer points, that represent positions of processors of a Fystolic system, in such a 
way that concurrent computations are mapped into different processors and resulting 
interconnections of processors, as well as data flows, are sufficiently reg&r. 
in i773 necessary and sufficient conditions are given for a quasi-@ine riming 
jimclion 
f(z)=Irr’z-a[, DE Q”, a E Q (set of rationals) (2) 
to satisfy all above mentioned requirements. These conditions allow 12 and a to be 
chosen. 
Once t is fixed. the task is to find an allocation function o such that a(D) is 
finite, and a(x) = a(r) + r(x) f f(y) if x, y are in D. In [77] it is shown how to find 
a quasilinear allocation function. This function actually represents a projection of 
D along a properly chosen vector-a ray for U. 
There have been many modifications and generalizations of this approach [36. 
79, 80, 831. Interactive software systems for the design of systolic system have also 
J. Gruska 
1 :on ‘%ril’ (;,I rF,s base (system DIASTOL in [28] and system S4 in 1831.) Of special 
interest IS the approach developed by Sedukhin [SO-831. His goal has been to develop 
z methodology for finding all (in a reasonable sense) systolic systems for a given 
computational problem specified in some more general specification language_ It is 
the language of recurrence quations with linear dependences 
where again p and 8,, . . . , 0, are vectors from Z” and 
p-8,=Ajp+bj forj=l,...,k, 
where Aj are constant n x n matrices and bj are vectors from Z”. In the case that 
the rank of the matrix is n - I, which is the case for practically all known examples, 
then there is a method [80] to transform (3) into the pipelined form ( 1) with constant 
dependence vectors. 
The vector n and the constant (I from the timing function in (2) can also be 
obtained by solving an appropriate number of equations 
t<Zj)=lW’Zj-al 
where t(zj), for points Zj, can be determined from the dependence graph by the 
maximal distance of the point Zj from the node representing the start of the 
computation process. After the timing function is specified, the next step is to project 
the data dependency graph along all directions that are not parallel to the hyperplane 
defined by the timing function and thereby all possible systolic architectures are 
obtained. For that it is of course important that projections preserve the nearest 
ncighbour property and for that it is suticient to consider as projection vectors H 
only vectors with coordinates -1, 0, I. After excluding the null and symmetric 
directions of projection vectors, then the number of projections, and therefore of 
potentially different systolic systems, is (3” -I)/2 which gives 13 for the most 
common case n = 3 and 4 for n = 2. The projections of nodes of the data dependence 
graph have to be conflict-free, with respect to the timing function, but this is achieved 
iC Vms3= )r =% nti p+M \o \ht %iinB hi~n~, i.e. i5 \\e 6ia-i protrutn \ ?I, 
I() is not zero. In [80-833, all systolic systems for matrix multiplication (13), 
LU-decomposition (13), the algebraic path problem (9), and discrete Fourier trans- 
formation (4) are derived, including some not previously known. Software system 
S’ (System of Systolic Structure Synthesis) (see [83]) generates the set of possible 
transformations at each step of the systolic system synthesis processes and also 
provides tools for selecting “the best” systolic system. 
Systolic system verification techniques are also an important part of systolic system 
design methodologies. One very natural idea [55] is to make use of the regularity 
of network interconnections and the regularity of data and computation flow. In 
many cases both the position of processors and the positions of data in data streams 
can, at any time moment, be naturally represented by integer points in the two- 
dimensional plane. This allows expression of the relation between the data and their 
Sywdic compuIarion3 59 
\ 
\ \ \ . 
/ 
/ 
2 
” 
\ 
\ 5 
/-- / 
/ f 
” 
I 
-m-s _ _ ___ 
a 
” 
x 
” 
B 
J 
*. 
\ 
\ z 0 \ 
h-m - _ -_ 
60 1. Gruska 
positicns at different time-moments by so-called space-timp-dare equorions and these 
equations can then be used to show that correct d 2~ arrive at the processors at the 
_ orrect time and on this basis the correctness of the whole systolic system is shown. 
Example. The well-known systolic system j54] for multiplication of matrices of an 
arbitrary degree but with bandwidth 4 is shown in Fig. 8. If the proper coordinate 
system is chosen (represented by the dotted lines in Fig. 8), then space-time-data 
equations for the movement of elements a;j of the matrix A, which relate the indices 
i, j and the coordinates (x, y) of position (Iii in time !, have the form 
x-j-i=0 , 
y-i-2j+2-1 =O. 
Similar space-tim:-dstrt equlions can coil y be derived for the movement of the 
elements of the matrices B and C. Using these equations one can show that whenever 
a c, arrives at a processor, then it meets there the proper elements of matrices A 
and B to make the computation that is needed. 
More formal and semantics theory based development of systolic system design 
and verification framework is given in [38,67,78]. In [67,89], it is shown how to 
develop and prove correctness of some systolic systems in the framework of algebra 
of communicafing processes. In [78] trace theory framework is used to discuss and 
develop systolic systems in terms of their input/output behaviour. 
4. Simoldoms 
Simulation problems of three types are of great importance for the design of 
parallel networks. 
( 1) How to simulate efficiently networks of one parallel archGcc:ure on networks 
of another parallel architecture. It often happens that it is easier to design an 
algorithm for implementation on a particular architecture (e.g. on two-directional 
cellular rings), than on a slightly different architecture (on one-directional cellular 
rings), networks of which are physically easier to implement. Therefore any technique 
that shows how to simulate systematically and efficiently, networks of one parallel 
architecture on another architecture, represents an important network design tool. 
(2) How to simulate efficiently large networks on smaller networks of the same 
parallel architecture. If one has to design a network for solving a particular problem, 
then it is usually very convenient to choose a network of the size that just matches 
the size of a given problem. This requires arbitrarily large networks to be considered. 
On the other hand, the size of available multiprocessor networks is, in practice, 
either fixed or with a severe upper bound. 
(3) How to simulate, time and/or space efficiently, networks of a given parallel 
architecture on various models of sequential machines. 
Quite a general concept of simulafion (of one network on another) has been 
defined in [9]. it describes the case where one processor of a network N, may be 
simulated at different time moments by different processors of a network N? and, 
moreover, that an edge connection of N, is simulated by the whole path in N,. In 
many cases it is, however, sufficient to use two simpler concepts of simulation: 
partial emulation and emulation. 
Informally, a network Nz parfiail~ emulates a network N,, if to each processor 
P of N, a processor P’ in Nz can be associated in such a way that any computation 
on P in N, is simulated by a comptitation on P’ in N?_ Similarly, N, emulates N,, 
if to each processor P in N, a processor p(P) in Nz can be associated in such a 
way that for each edge e: P, + Pz in N!, any communication along e is simulate4 
by a communication along 3n edge from ~(f’,) to p(P:). 
An emuiatior 0: h, LUI N1 is caiied computational/y uniform if the same number 
of processors of N, are mapped into each processor of N,, and also the same 
number of edges of N, are mapped into each edge of N?_ 
The concepts of unrolling and of the isumarphism 01’ unrdhgs are important to 
establish simulation results. Informally, the unrolling of a network N with a set of 
nodes V, is the time-space transformation of the computational process or, in other 
words, an infinite data dependence graph with nodes (q r ), where u E V and I 
represents time. Isomorphism of unroilings is then the usual graph isomorphism. 
Example (Culti and Fris [8]). Simulation between homogeneous networks on two- 
direuional cellular rings (CR,) (Fig. 9a) and homogeneous one-directional cellular 
rings (OCR,) of n processors. Emulation of OCR,, on CR, is trivial. In order to 
obtiiin a simulation in the opposite dire&m we proceed as follows: Let C,, be the 
network with n nodes, where zoch node u,, j = I, 2,. . . , n is connected with nodes 
q, V&l* v,@? by edges with the delay 1 (where 0 means the addition moduio n) 
(see Fig. 9b for C,). The unroiiings of CR, and C, are isomorphic; the isomorphism 
is established by the mapping 6 : (L’,, r) + (q, ,, I). This implies that each 
homogeneous network on CR, is sitxlated in real time on C, and vice versa. 
Fig. 9. 
6’ J. Grush 
Morec-/er, C, can clearly be partially emulated on OCR, in such a way that some 
edges of C, are simulated by paths of length 2 on OCR,,. 
Similar simulations have been established between two-directional cellular arrays, 
one-directional cellular toroids and two-directional cellular toroids [S]. 
In connection with the study of complex systems the concept of totalistic CA has 
been introduced [91, 921 and investigated in various papers. A totalistic CA is a 
CA states of which are integers and a new state of a processor depends only on the 
sum of the old states of the processo r and of its neighbours. It has been shown in 
[I] that for each CA C there exists a totalistic CA C’ which simulates C without 
loss of time. This result has been generalized in [21] for cellular automata over 
arbitrary graphs. 
Another interesting problem is to determine al! possible computationally uniform 
emulations of large networks on smaller ones. It has been shown [4] that the number 
of computationally uniform emulations of CR, on CR”,? (as well as of two- 
dimensional cellular toroids of the size n x II on toroids of the size $n x in) is 
exponential (at least exponential). On the other hand, there are exactly six computa- 
tionally uniform emulations of perfect shuttle networks of 2” nodes on networks of 
2”-’ nodes. 
Another important problem is to find, for important classes of networks, say C, 
another class of networks, say C’, such that on any network from C’, every network 
from C can be emulated in a computationally uniform, or a!most uniform, way. If 
such a class C’ exists, then a fixed-sire multiprocesso c system with a network from 
C’ can be used to emulate almost uniformly any multiprocessor system with a 
network from C; it is r.nly necessary to sufficiently enlarge the memory of the 
processors and the width of the interconnections. 
For rectangular arrays, such a class of networks, so-called pdynrorphic arrays, has 
been shown in [35]. They are borderless networks B, of arrays of the size S, = F, x L, 
whet-c 
F,=l, F*=l, ~=~_,+I$_2 forj>t 
are Fibonacci numbers and 
L,=i, &=2, Lj= Lj_,+ Lj_, forj>Z 
are Lucas numbers, and a processor of B, in the position (i, j) is connected with 
processors in the nodes 
((i+l)modF,,,(j+l)modL,), ((i-l)modF&(i-f)modI), 
((i+l)mod F,,(j-l)mod L,,), ((i-1)mod F,,(j+l)mod L,). 
It has been shown in [35], that on auy such network B,, any rectangular array C 
with at most S,/& processors, can be emulated in such a way that into each 
processor of B,, at most one processo r of C is mapped, and , moreover, arbitrarily 
large rectangular array C’ can be emulated by B. in such a way that the number 
of processors of C’ mapped into one processor of B, differs at most by O(log S,). 
The last issue we deal with in this section concerns simulations of parallel networks 
on sequential machines. The following theorems summarize some results concerning 
simulations of arrays, trellises and tree networks on Turing machines and RAMS. 
Theorem 4.1 (Chang et al. [ 171). (1) Any language accepted by a one-direcrianal 
cellular tree network can be accepred by a deterministic Turing machine in space 
(log’ n),/(log log n). 
(2) Any language accepted by a one-directional cellular trellis network can be 
accepted by a dererminisric Turing machine in space n&. 
(3) Anv language accepted by a k-dimensional cellular array ( of n A processors ) can 
be accepted bv a dererminisric Turing machine in space n’- Iih. 
llworem 4.2 (lbarra 1431, tern9 and Gruska [IO]). (1) Any ltqucge acceppred by 
a two-way k-dimensional cellular array can be accepted by a RAM in time 
O(rP+‘/log n’ifl’r). 
(2) Any language accepted by regular (mudular) [regular and modular] 
(homogeneous) real-time trellis automaton can be acccpred by a one-tape Turing 
machine in time O(n’), {O(n’logn)}, [O(n’m)], (O(n’)). 
(Regular and modular trellis automata are defined in Section 6.) 
5. Power of tree eomytatioms 
One of the main goals in parallel computations is to do as much as possible in 
(poly)logarithmic time From thi? point of view tree networks of finite automata 
are perhaps the very basic model to investigate. 
Two basic types of networks of finite automata have been intensively investigated. 
In both cases the underlying interconnection structure is an infinite leafless tree that 
is regular and nondegenerate in some reasonable way. In the case of iteratiw free 
automata (ITA) [24], sequential input (output) goes to (from) the root processor 
(Fig. tc) and only nonhomogeneous networks are of interest. We shall consider 
here only the case where the underlying tree is balanced (i.e. each node has the 
same number of ser. ;! , .?:!ri for ;ill relevant i, ith sons of all nodes are identical. In 
the case of systolic free automata (STA), 1941 the input is parallel (Fig. Ic), one 
input symbol per processor, and to the leftmost processors of the first level of 
processors with enough processors. Processors are memoryless, flow of computation 
is one-directional to the root, and most of the research has concentrated on the 
study of STA as acceptors. Regularity condition from [94] requires that there are 
only finitely many nonisomorphic subtrees and nondegenerativity condition requires 
that there is a constant Q > 1 such that the jth level has at least a’ nodes. 
The following theorem [19, 24, 251 shows that ITA are very powerful. In this 
theorem, by ITA we denote ITA over a k-nary balanced tree and by ITM a 
modification of ITA with Turing machines instead of finite automata as processors. 
64 1. Grudu 
Theorem 5.1. ( 1) 7he famiiy uf iunguages accepted by ITA in rime T(n) is rhe same 
as rhe fami/y of languages accepred by ITM in rime T(n). 
(2) Any language accepted by a nondeterministic Turing machine in time T( n ) ran 
be accepted by a ITA(2) in rime O( T(n)). 
(3) 77te family of languages uccepfed by linear rime and real time ITA are identical. 
(4) If 2 s s < t, then the family of hnguages accepted by ITA( s) in linear rime (ir! 
real time) is identical (is smaller) than the family of languages acceped in Imear rime 
(in real time) by ITA( I). 
(5) 7%e family of ianguages accepted in linear time by ITA contains all CFL, and 
ir is closed under Boolean opemrions, concatenation, Kieene closure and morphism. 
In the above-mentioned model of ITA, and also of ITM, no restriction has been 
made concerning the depth oi the treec really involved in particular computations. 
and therefore actually exponentially many processors can be active during a compu- 
tation. It is therefore of importance to investigate bow much can be done within 
depth-bounded ITA computations, i.e. computations on ITA where processors only 
of a restricted distance from the root can be used. Main results from [19; are 
summarized in the following tkorem whete D(f(n))-bounkd lTA( k) denote 
ITA( k), the depth of computation of which is bounded by f (n ) for inputs of length 
R 
Theorem 5.2. ( 1) A D( T( n ) Mounded ITM with each processor being an S( n )-space 
bounded Turing machine, can be simulated by an D( 0( T( n 1) + log S( n I)-bounded 
ITA. 
(2) S(n)-space bounded on-line Turing machines are equilwlenr to D(log S(n))- 
hounded ITA. 
(3) Every CFL can be accepted by a D(O( log n))-bounded ITA. Eoer), determinisric 
CFL can b accepted in linear time by an ITA. 
Proof of the main results of the previous theorem is based on a clever simuiation 
of pushdown stacks of size S(n) on D(log S(n))-bounded ITA. A similar idea is 
used in [a’] to implement such data structures as stacks, queues, priorit queues, 
deques, and dictionaries on D(log n)-bounded ITA in such a way that, except for 
dictionary, all data structures have a unit response time. For dictionary operations 
the response time is O(log n) but instructions can be pipelined to the root at constant 
speed. 
In the case of systolic tree automata [31, 941, attention has been paid so far to 
automata over trees with a finite base. (It is defined as a finite tree, all leaves of 
which have the same distance from the root (Fig. IOa-c).) An infinite leafless tree 
T is said to be over the finite base b, in short T(b)-tree, if it can be obtained from 
b by an infinite process at each step of which all leaves of the tree designed at the 
previous step are replaced by b-trees (see Fig. lOal-cl ) for trees with bases from 
Fig. IOa-c, respectively). 
(a) 
(al) 
. . 
(b) 
. 
. . . 
. . . . . 
Fig. IO. 
(bl) 
t-nary balanced STA, in short t-STA, are a special case of T(b)-trees (see Fig. 
Mat). l-et T(b)-STA denote the class of all systolic tree automata over T(b) and 
f( T(b)-STA), the family of languages accepted by P(b)-STA. 
The foliowing theorem summarizes relations between the recognition power of 
7(b)-STA for various bases [30]. 
M S-3. (I) Ifs and t are (are not) powers of the same integer, then thefamilies 
f(s-STA) and f(t-STA) are identical (are incomparabre). 
(2) ff the 6ase 6 has s leaves, then f (s-STA) s f ( T( 6).STA), and the equality 
holds if and only if/or any 0 s i < h(6) (where h(6)) is the depth of 6). the set of 
prime divisors of the number of nodes of the ith level of 6 is a subset of prime divisors 
of s. 
The main rest&s concerning the languages accepted by 2.STA are now presented. 
Tbaxem S-4. (1) 7&e family f (2.STA) contains all regufar languages, and also some 
languages very high in the language hierarchies. It is closed under Boolean operations, 
right concatenation wirh regular sets, restricted concatenation and selective concatena- 
tion. It is not closed under left concatenation with regular set, Kleene closure, morphism 
and e-free morphism. 
(2) Nondeterminstic 2-STA are as powerful as deterministic. 
(3) The emptiness, finiteness and equivalence problems are decidable. 
66 J. Grwka 
Dec:dability of the emptiness problem for general STA is an open problem closely 
lelated to well-known decision problems from formal power series [ 1 I]. 
There is a characterization of balanced STA in terms of special Turing machines 
[46]. This characterization allows closure properties of E (2-STA) to be proved using 
standard sequential computation techniques. 
There are various modifications of STA concepts. Some of them consider different 
input modes. For example stable and superstable STA at which inputs can be 
submitted to processors of any level with sufficiently many processors and also to 
any chosen subsequence of processo rs at that level [ 11,30]. STA where each input 
is first permuted (by a host) are considered in [40] and STA with an (infinite) 
program (represented by an infinite word, the initial part of which, of the same 
length as a given input word, is supplied to inputs of processors in parallel with 
the input word) are studied in [41]. Nonhomogeneous STA where each node- 
processor uniquely determines sons’ processo rs are shown to be as powerful as 
homogeneous ones. STA where each node is also connected to the left brother of 
its left son and to the right brother of its right son, so called PC-trees, are investigated 
in [27,33]. 
Fast recognition of regular languages by STA is of special interest. They can also 
be recognized by STA over finite rrees with afeedbuck (which leads to an interesting 
recognition of regular languages by programmable systolic trees) [22]. Problems 
related to the optimization of STA as regular language recogn;zCrs are studied in 
[51]. ST.4 have also been investigated as transducers [20] to obtain fast implcmenta- 
tions of finite state automata realisable functions. 
6. Powerand atNetueofliuararrayeompuratim 
One-dimensional arrays of finite automata and their computations have been 
intensively investigated in the last years and the results siizlv that they are not only 
a very basic model of parallel architecture but also a model with many interesting 
properties and of broader import;dncz for theory of compu*tit;ou. 
It has also turned out that many practically important computational problems 
can be solved sufficiently fast on linear arrays of simple processors [63,65]. 
One-dimensional cellular automata (CA) have also been used as a basic model 
to study general problems of complexity because they seem to captuit, in a reason- 
able sense, essential features responsible for complex behaviour of sytems composed 
from simple elements. This is also closely connected with the approach considering 
cellular automata as an alternative model of the physical world 1843. The underiying 
goal is to extract from such a study some general features of such phenomena as 
the self-organising behaviour, chaos and so on. 
The importance of CA for such fundamental investigations and for the key 
applications in computing makes it very desirable to obtain more insight into the 
power and structure of linear array computations and to develop mod& and 
concepts suitable for this purpose. 
The power of cellular automata depends on the amount oicomputational resources 
available (time, space, number of processors used j, on the type of interconnections, 
and on the type of input. 
Two basic models investigated are shown in Fig. 1 la, c. They are one-dimensional 
cellular automata (with parallel input) and one-dimensional iterative arrays (with 
serial input). Their one-directional versions (OCA and OIA) are shown in Fig. 11 b, 
d. Perhaps the most interesting case is the one in which there is no additional 
memory, i.e. for the input of the length n only n finite automata are v@qed. Such CA 
and IA are called linear and denoted LCA and LEA (Fig. lie, f), and tbir 
one-directionai versions ate denoted OLCA and OLIA. 
(a) 
0.0 ++@+ OCA (b) 
-** -@I-@ 14 (cl 
l ** +&@4 OIA (d) 
. . . 
-0 afl 
LCA (e) 
- . . LIA (f) 
, 
n 
Fig. II. 
The basic problem is to determine how powerful is one-way communication 
(especially comparing with two-way communication) and what is the relation 
between computation in real-time (i.e. in time n for the input of the length n), 
pseudo-real time (in the time 2n) and in linear time (in time cn for a constant c). 
The following theorem summarixes recent results that show surprising power of 
one-way communication and the relation of some of the open problems of this type 
to other well-known problems (for example, to some closure properties). 
lbeorem 6.1. (1) OLCA accept exactly the same family of languages as OLIA [45]. 
(2) lie fami/y of languages accepted by OLIA is an AFL, it is closed under 
intersection, complementation and reversal; it contains some PSPACE-complete 
I&l ges, 
sp” 
all languages accepted: by linear-time bounded alternating TM, by 
n’-space bounded nondetenninistic TM and by multihead two-way nondeterministic 
pushdown automata operating in c ““*” time (and therefore all CFL) [ 181. 
68 1. Gmska 
(3) ff 1 s r i s, then the family of languages accepted by OLIA in time n’ is smaller 
rhan the family of languages accepted by OLIA in time n’ [ 181. 
(4) 7Ie family of languages accepted by OLCA in linear-time is the same as the 
family of ianguages accepted by OLIA in pseudo-real-time, and rhe same as the family 
of languages accepred by LCA in real-:ime [45]. 
(5) 7Ie families of languages accepted by LCA m linear-rime and by LCA in 
real-time are identical, if and only if rhe class of languages accepted by LCA in real-time 
is closed under reversal [44]. 
(6) if LCA are more powerful than OLCA, then nonlinear-time LCA are more 
powerful rhan linear-rime LCA [44]. 
The basic open problem is whether LCA are more powerful than OLCA. 
A natural modification of the models of linear CA and LA is to consider models 
where the amount of processor used is S(n) for an input of the length n. Various 
results for these models have been obtained in [ 18 ] also for the case St n) < n. 
A time space transformation of CA is shown in Fig. 12. By discarding vertical 
edges and adding processors on edge crossings we obtain the trellis network of 
memoryless processo rs shown in Fig 4. In the case of the vertical acceptance in 
real-time we obtain so-called real-time trellis automata (ROTA) (see Fig. 13). The 
following theorem summarizes basic properties of the families of languages accepted 
by RlTA and their nondeterministic versions [ 13, IS,46 1. 
Thawem 6.2. (I) 7ke family of languages accepted by deterministic RlTA is the 
abstract family of deferministic hnguages which conrains all linear contexr-free 
languages, some languages complete for pdynomial rime with respecf to log-space 
Fig. 12. 
Systolic romputali0n.s 69 
reducibility, and it is closed under Boolean operations, injective length multiplying 
morphism but not under morphism. 
(2) 77te famil_v of languages accepted by nondetenninistic RTTA is on AFL that 
contains all context-free languages, some NP-complete languages ond it is csntained 
in DSPACE (n log n). 
Real-time trellis automata are also a good model to study the power of various 
input modifications. For example, it has been shown that superstable RlTA are as 
powerful as ordinary ones [ 141. Superstability requires that the acceptance does not 
depend either on the level (or row) of processors at which an input is submitted or 
on inputs of which processor symbols of an input word are submitted, provided 
the order is preserved. In [4O], it is shown that the power of RlTA can be increased 
if input can be permuted (for example. by the host) before processing begins, and 
basic properties of languages accepted by RITA with permuted inputs are investr- 
gated. The power of RITA can also be increased if with any input of length n, the 
prefix of the length n of a fixed infinite word (called program) is also submitted in 
parallel to inputs of processors [41]. In [fM], a special input mode for linear arrays, 
called pipeline processing, has been considered. 
A trellis automaton can formally be defined as a quintuple A = (Z, f, f’. A, g), 
where B is the input alphabet, f is the operating alphabet, A E f, f’r, f, is the 
accepting alphabet, and g : f x f + f is the transition function such that g( A, A ) = A. 
With the automaton A one can associate the algebra E = (/I 1, r. *) of signature ( I, 
1,2),whete1(a)=g(A,a)andr(a)=g(a,A)foru~f-{A~,anda*b=g(~b)for 
u, bc f-(A). Similarly with any algebra of signature (I, 1, 2) one can associate a 
class of trellis automata that differ only in the input alphabet and in the accepting 
alphabet. 
With any algebra E = (f, f, r, *) of signature ( I, I, 2) and any word w = wl, . . . w, E 
f “*I. one can associate the mapping GPT(E, w) that is defined on {(i, j)l i E N, 
j E N, i +j 3 n} as follows: 
GPT(E,w)(i,j)=w, ifi+j=n, 
GPT(E, w)(O.j) = /(GPT(E, w)(O, j - I )) _ ifj) n, 
GPT(E,w)(i,O)=r(GPT(E,w)(i-l,O)) ifi>n, 
GPT(E,,;~)(~,~)=(GPT(E, w)(i-l,j))*(GPT(E,w)(i,j-1)) 
ifi+j>n,tj#O. 
The mapping GPT(E, w) is called the generalized Pascal triangle over (the 
algebra) E and (the word) w, and it can be depicted in the fohm shown in Fig. 14, 
where wii =GPT(E, w)(i, j). GPT(F, 1) for the algebra E = (N, id, id, +) is the usual 
Pascal triangle, and for the algebra E = ((0, 11, id, id, @), with 0 being the addition 
modulo 2, the GPT (E, 1) is shown in Fig. !5. 
“0 l-l “l,“-1”-“n-l,A,U 
“O,n+l”l,n”““““““,l “ri+l,O 
“0,“+2”1,“+1............““+1,1”“+2,0 
“~,n+~wl,~+2_“““““““-wn+2,1w~~~,~ 
. . . 
. . 
. . . 
Fig. 14. 
1 
11 
101 
1111 
10001 
110011 
1010101 
11111111 
100000001 
Fig IS. 
Generalized Pascal triangles can also been seen as special two-dimensional infinite 
words that can be considered as a new abstract model of linear array imputations. 
This new model naturally gives a rise to a new set of problems, for example: 
(1) What is the structure of diagonals, columns and other components of GFT? 
(2) What can be said about the density of occurrences of elements in GIT? 
(3) When are the basic decision problems concerning the structure of GFT’ 
decidable? (For example the problem to determine whether a given symbol (a word 
or a pattern) occurs at least once (or infinitely often) in a given GPT.) 
(4) What are the conditions for a complete GFT(E, w) to have the so-called 
self-embedding property (i.e. that there exist integers p. q such that GPT(E, w)( i + p, 
j+ q) = GPT(E, w)( & j) for any i, j in IV)? 
(5) With any GPT one can associate the language of row-words. What are the 
properties of such languages? 
All these problems are related to basic problems concerning linear array computa- 
tions. In this way the model of generalized Pascal triangles brings new techniques 
into the study of linear array computations. It also relates structural and com- 
binatorial properties of GF7 (and thereby also of linear array computations) and 
algebraic properties of the underlying algebras. 
Generalized Pascal triangles have recently been intensively investigated especially 
by Korec [59]. Some of the results are now presented. 
71 
In order to study properties of diagonals of GFT, it is convenient to a;wder 
instead of algebras E =(A, 1, r, *) of the signature (1, !, 2), the algebras E’= (A, K, 
4 r, *) of the signature (0, 1, 1, L), and to consider GPT(E’, K), in short GPT(E’), 
as (complete) GPT over E’. 
Diagona!s of GlT are clearly ultimately periodic. Let LPER(E’, k) denote the 
smallest period of the kth left diagonal of the GPT(E’). Cleariy LPER(E’, k) s IA/“. 
The algebra E’ is said to be the algebra wifh maximal lefr periods if LPER(E’, k ) = iA{’ 
for any k E N. 
Theorem 6.3 (Kochol (561). (i) E’=(A, K, I, r, *) is rhe algebra with maximal lefi 
periods ifand only ifetmy word u E A* is the prejhc of at least one (ofinjnitely many) 
rows of GPT(E’). 
(ii) ‘Ihere is an algebra oj’ the signature (0, 1, I, 2) of curdinality n with maximal 
left’ periods, ifand on& if n is odd_ There are exactly six di&rent algebras of cardinality 
3 with maximal left periods_ 
(iii) If E’ = (A, K, 4 r, *) is the algebra with maximal left periods, then the operation 
I is a cydic permutation of A_ the operation * is net associative, and for ewry a, b c A 
there exists just one x E A such rhar a * x = 6. 
The next problem we shall deal with is the density of occurrences of patticular 
elements in GPT. 
For an algebra E = (A, k r, f). for w c A* and x E A, let us define the density of 
the occurrence of x in the GlT(E. w) as follows: 
DENS@, w, x) = lim 
number of x in first n rows of GPT( A, w ) 
I(-= number oiall elements in first n rows of GPT(A, w) * 
The next theorem says that from the point of view of density, there exists, surprisingly, 
a “universal” algebra at which one can achieve any feasible density of elements by 
choosing a proper initial word. 
Theorem a4 (Korec [60]). For any s E N rhere exists a finire algebra E = (A, k, r, *) 
suchfhut(l,2,..., s} c A, and for ewry ordered s-rup/e (a,, . . . , a, ) of nonnegariw 
rationals such that 
rhere is u w E A* such that DENS(E, w, i) = a,. 
Decision problems, the self-embedding problems, and row-language characteriz- 
ation problems are related, and they are also nicely related to algebraic properties 
of the underlying algebras as the following theorem shows [59, 601. (The concept 
of simple semilinear language (of degree k) that is used in the following theorem 
is defined as follows. A language Lc Z’* is called simple linear (of degree k) if it 
has the form f. = {u,u~u~u~. . uku:uk+, 1 u,, D,, . . . , uk. 4. uk+, are fixed words in 
72 1. Grrrcka 
2 and i z 0). a language is called simple semilinear (of degree k) if it is a union 
of finitely many simple linear languages (of degree k).) 
lheorem 6.!j (Korec [58,59]). (i) 73eproblem whether a given elemenr occurs ar least 
once (infinitely often) [and many other related problems] is undecidable ven for GPT 
overfinire algebras wirh a commurariue binary operation *. 
(ii) All decision problems menrioned in (i) [and many orher relared problems] are 
decidable for rhose GFT, row-languages of which are simple semilinear. 
(iii) ifa GPT has the self-embedding properry, rhen the corresponding row-language 
is simple similineaor and conrexr-free. 
(iv) If rhe binary operarion of the algebra E is idemporenr and associarive, rhen all 
Gm over E have rhe sewembedding properry. 
Some GPT can be designed from simple modules (actually finite and rectangular 
two-dimensional words) using simple recurrences; for example GPT from Fig. 15 
can also be obtained using reccurrences from Fig. 16. This has led to the investigation 
of so-called modular two-dimensional words, in short modular trellises, as mappings 
T: N x N + f that can be designed in such a modular way. Formally, modular 
trellises can be defined in a way that is a two-dimensional generalization of the way 
a Thue-Morse infinite word is defined using iterated morphisms. 
A trellis T over an alphabet f is said to be strictly (p, 9)-mo&rlar if 
T=limp’(a) 
j 4 3 
where p maps f into rectangular two-dimensional words 
b b,, Ii” 
p(b)= ; --. ; 
b b 91”’ 4P 
such that a ,, = a. A trellis T over the alphabet f is said to be ( p. q)-modular if T 
can be obtained from a strictly (p, 9)-modular trellis T’ over an alphabet f’ by 
renaming (i.e. if T( i, j) = fI( T’( i, j)) where 6 : f’+ f is a mapping). .A trellis 7‘ is 
said to be [strictly] modular if it is [strictly] (p, 9)-modular for scme p, 9. For 
example the trellis depicted in Fig. 14 is (2, 2)-strictly modular. 
A. 6. 
A. 
J+1 
= A. J A. 
Jtt. J gj+l 
= B. J 8. 
Ja. J 
J J 
Fig 16. 
Modular and strictly modular trellises have beon investigated in [9] where twc 
quite different characterizations of them have also been derived. The first one 
characterizes (p, q)-moduiar trellises as exactly those trellises 7 for which T( i, j) 
can be computed by a so-called (p, q)-sorting automaton. (It is a finite automaton 
such that if it receives on its input (in parallel) integers i and j (in pnary and 
q-nary notation, respectively), then it comes to the state that uniquely determines 
T( i, j)). Due to this characterization one can determine T( i, j) for a modular trellis 
in time O(log( i +j)), while one seems to need, in general, time 6( i + j) to compute 
T( i, j) for a GPT Z The second characterization characterizes modular trellises as 
exactly those trellises that are fix-points of special substitutions on trellises. (A 
substitution first partitions a trellis into rectangular words of the same size and then 
replaces each such two-dimensional subwords by another two-dimensional finire 
word, in general of a different size, but in all cases replacement is by subwords of 
the same size.) It has been shown that the family of modular trellises is quite robust; 
it is closed under various operations on trellises. Modularity of trellises of a special 
type has also been studied. For example it has been shown [61] that a Pascal 
triangle modulo p is modular (strictly modular), if and only if p is a prime power 
(a prime). 
RTTA is also a suitable model to study the power of nonhomogeneity in systolic 
trellis computations. Three types of nonhomogeneous systolic RlTA have been 
investigated in more detail [ 10,13,46,61]. Regular RlTA are such ntinnomogeneous 
RlTA that their proccsso r distribution forms a CWT. In an analogical way mooular 
RITA are defined as nonhomogeneous RlTA such that their processor distribution 
forms a modular trellis. It has been shown that in the deterministic case both regular 
and modular RlTA are more powerful than homogeneous ones. On the other hand 
nondeterministic regular RlTA are as powerful as nondeterministic homogeneous 
RlTA. (Whether the same holds true for modular RITA is an open problem.) 
Regular and also modular RlTA may have a large number of different types of 
processors and quite complicated distribution of them. It is therefore interesting 
that for each regular (modular) RTTA one can design an equivalent regular 
(modular) RlTA that uses only processors of two types [ 10, 571. This result also 
indicates that the concepts of regular and modular RlTA are a reasonabie generaliz- 
ation of the concept of homogeneous RTTA. 
In [61] the third type of nonhomogeneous ROTA has been investigated, semilinear 
RlTA. They are more powerful than homogeneous RlTA, and they seem to be less 
powerful than regular and modular ROTA. A nonhomogeneous ROTA is said to be 
semilinear of the degree k, if the set of the row words of the underlying trellis is 
semilinear of degree k. 
It has been shown [6i] that to every semilinear RlTA there is an equivalent 
semilinear RITA that is both regular and modular. Moreover every semilinear ROTA 
is equivalent with a ROTA that is both regular and semilinear of degree 3 and with 
a semilinear ROTA of degree 2. 
74 1. Gtusko 
We have considered here systolic tree and trellis automata as two verv basic 
models of parallel computations. A natural question is what is the relation between 
their power. It has been shown in [ 131 that families of languages accepted by systolic 
tree automata over balanced trees and by homogeneous ROTA are incomparable. 
On the other hand it has recently been shown [93] that each language accepted by 
a systolic tree automaton over a balanced tree can be accepted by a regular RlTA 
and that simulation of systolic tree automata over a balanced tree on regular RlTA 
can be done in quite a universal way. 
[ 11 J. Albert and K. Culik II. Simple universal cellular automaton and its one-way and totalistic version, 
Comphzx Sys@ms I (1987) l-16. 
[2] M. Bar&a, An equational axiomatization of systolic systems. 771eoref. Compuf. sci 55 ( 1987) 265-289. 
[3] M. Bartha, interpretations of systolic flowchart schemes. TR, Bolyai Institute. Szeged. 1989. 
(41 H. L Bodlaender and J. van Lecuwen, Simulation of large networks on smaller networks, Dept 
of Computer Science. TR-RULCS-844. University of L&red& 1984. 
[ 51 S. D. Brooks, Reasoning about synchronous systems, Dcp~ of Computer Science, TR-CMUCS-84- 
145. Carnegie-Mellon University, 1984. 
[6] J. H. Chang, M. J. Chung, 0. H. !barra and K K Rao. Systolic tree implementations of data 
structures, DepL of Computer Science. TR-85-32. University of Minnesota, 1985. 
[7] K. Culik I! and C. Cbolfrut, On real-time allular automara and aellis automata, ACID Cjtiricu 
21 (1984) 393-407. 
[8] K. Culik II and 1. Fris. TopolOe;cal transforma!ion as 4 +-I in the design of systolic systems, 
7%eore#. Conpcr sci 37 (1985) 183-216. 
[9] A. &rnf and J. Gruska, Modular trellises. in: G. Rozenberg and A. Salomaa @is.). * Book o/ 
L (Springer, Ber!in, 1985) 45-61. 
[IO] A. eem$ and J. Gruska. Modular real time trellis automata. Fund Infom~ IX (1986) ZU-2!!2. 
[ll] K. Culik I!, J. Gruska and A. Salomaa. On a family of L languages resulting from systolic tree 
automata. w Corqnrr. ti 23 (1983) 231-242. 
[ 121 K. Culik II. J. Gruska and k Salomaa. Systolic automata for VLSI on balanced bees. Acta Injii 
18 (1983) 335-344. 
[ 131 K. Culik I!. J. Gruska and A. Salomaa, Systolic trellis automata. Parrs 1 and II, Inrem 1. Compur. 
Math. IS (1984) 195-212; 16 (1984) 3-22. 
[14] K. Culik II. J. Gruska and A. Salomaa. Systolic trellis automata: stabi!i?y, de&lability and 
complexity, Injorm. and Conrad ‘II (1986) 218-230. 
[IS] M. Chen, Very-high level programming in Crysta!, Dept of Computer Science, TR 506, Yale 
University. 1986. 
[16] J. H. Chang, 0. H. lbarra and M. A. Palis. Parallel parsing on a one-way array of finite state 
machines, De@. of Computer Science TR-85-20. University of Minnesota, 1986. 
(171 J. H. Chang. 0. H. lbarra and M. A. Palis. Efficient simulations of simple mod& of parallel 
computation by time-bounded ATM’s and space bounded TM’s, in: Rut. ICALP’88, !_ecture Notes 
in computer Science 317 (Springer, Berlin. 1988) 119-132. 
[18’ J. H. Chang. 0. H. Ibarra and A. Vergis, On the power of one-way communications. Dept. of 
Ci;mputer Science, T9-86-11. University of Minnesota, 1986. 
[ !9] K. Culik II, 0. H. Iban;r and S. Yu, Iterative tree arrays with logarithmic depth, Internat. 1. Compu~. 
Moth. 20 (1986) 187-204. 
1201 K. Culik II, H. Jiirgensen and K. Ma!c, Systolic tree architecture for some standard functions, Dept. 
of Computer Science, TR-140. University of Western Ontario, 1985. 
[21] K. Culik !I and J. Kar!mmPki.On totalisticsystolicnetworks, Infirm ptoeess. Let?. 26( 1988) 231-236. 
[22] K. Cu!lk II and H. Jiirgensen, Programmable finite automata, for VLSI, Inremut. 1. Compur. Math. 
14 (1983) 259-275. 
S~slolic rom;rurations 75 
(231 P. Cappello and K. Steiglitx, Unifying VLSI design with geometric transformations. in: hoc. lE=‘F 
Interna:. Conj on Pamllel Processing ( l9SE I 448-451. 
[24] K. Culik II and S. Yu. Iterative tree automata. Theon% Compur. St-i 32 (1984) 227-247. 
[25] K. Culik II and S. Yu, Real-time, pseudo real-time and linear time ITA Tlreorer. Compur. sci 47 
(1986) 15-X 
[26] LX de Racr and J. Paredaens. A formal definition for systolic systcris. Dept. of Mathematics. TR, 
University of Antwerp, 1984. 
127) E. Fachini, R Francese. M. Napoli and D. Parente, RC-tree systolic automata: characterization 
and property. Compur. Artificial Inre//igence 8 ( 1989) 53-82. 
[28] P. Frison. P. Gachetand P. Quinton, Designing systolic arrays with DIASTOL, RR-578, INRl.4,1986. 
[Zsj A. L. Fischer and H. T. Kung Synchronising large Vlli processo r arrays. IEEE Tram Cornpa 
34 (1983) 734-740- 
[30] E. Fachini, A. Maggiolo Schettini. G. Resta and D. Sangiorigi, Nonacceptability criteria and closure 
properties for the ckss of languages accepted by binary systolic tree automata, Dept. Informatica 
al Applicaxioni. TR-44-88, University of Salerno. 1988. 
[31] E. Fachini, A_ Maggiolo Schettini. (2 Resta and D. Sangiorgi, Some structural prop&es of systolic 
tree automata, Dept Informatica ed Applicazioni. TR-55-88, University of Salerno, 1988. 
[32] M. A. Frumtin. Systolic programming (in Russiral, Vo~rtxsy K&met (1988) 101-120. 
[33] E. Fachini and M. Napoli, C-tree systolic automat& 7%eorec Comprr. sa‘. SL (1988) 15%186. 
1341 A. L Fischer. Memory and modularity in systolic array i.nplcmcntation. in: ptnc 1985 Con/: on 
Pom&?f RucBsmg (1985) 99-101. 
[35] A Fiat and A Shamir. Polymorphic arrays: a novel VLSI layout for systolic computers, in: f4oc 
STDC (19g4) 37-45. 
[36] N. Faroughi and M. A Shanblatt, Systematic pncration and enumeration of systolic arrays from 
the algorithw. in: I4oc InrernoL Curt$ on M m University Part ( 1987) 844-847. 
[37] J. Gruska Systolic automata: power. cbaractmizatum. nonhomogeneity. in: Prac MFICS‘H, Lecture 
Notes in Computer Science 176 (Springer. Retiin. 19g4l 32-49. 
[38) M. Hermessy, Proving systolic systems correct. in: #%c 7’OPLAS (1986) 3u-J87. 
[39] !I. W. Hotnick, A unified approach to the analysis and synthesis of systolic arrays, Dept of Electrical 
Engineering. TR-1039. Univetsity of IlIinois, 1985. 
[40] 1. HnmtkovH and D. Patdub&& Some compkrity aspet% of VLSI computations. On the power 
of input bit permutations in tree and ttellis automata. Camprr. A@iciaI fn&/igenee 7 ( 1988) 397-412. 
[41] J. Htomkovit and D. Patdubsk& Some complexity aspects of VLSI computations, VLSI circuits 
with pmgnms, Cmymr A@&@ frudtigence 7 (1988) 481-496. 
[42] C. H. Huang and C. Lengauer, An implemented method for incremental systolic design. in: Rot. 
PARLE. Lecture Nom in Computer Science 29) (Springer, Rerlin. 1987) 160-177. 
[43] 0. H. Ibana. Systolic arrays: characterization and complexity, in: phc. MFCS, Lecture Notes in 
Computer Science 233 (Springer. Rerlin, 1986) 140-153. 
[44] 0. H. Ibarra and T. Jiang, On some open problems concerning cellular arrays, Dept. of Computer 
Science. TR, Univenity of Minnesota, 1987. 
[SS] 0. H. Ibarra and T. Jiang, On one-way cellular arrays. SIAM J. Compur. I6 ( 1987) 1135-l ; SJ. 
[46] 0. H. Ibarra and S. M. Kim. Characterizations and computational complexity of systolic trellis 
automata, Tlreoret. Cornput sci 29 (1984) 123-153. 
[47] 0. H. Ibana and S. M. Kim. A characterization of systolic binary tree automata and applications. 
Acra Inform 21 (1984) 193-207. 
[48] 0. H. Ibarra and M. A. Palis. Two-dimensional systolic arrays: characterizations and applications, 
77tcorrr. Comput ScL 57 (1988) 47-86. 
1491 0. H. Ibatra, M. A PaIis and S. M. Kim, Designing systolic algorithms using sequential machines, 
in: Phxc SXJC (1984) 46-55. 
[Xl] H. 0. Ibarra, M. A. Palis and S. M. Kim, Some results concerning linear iterative (systolic) arrays, 
1. Rtm/le/ Distriburice Compur. 2 (1985) 182-218. 
[SI] H. Jiirgensen and A Salomaa, Syntactic monoids in the construction of systolic tree automata. 
Inremat. J. Ctnnpur lnfom Sci I4 (1985) 35-49. 
1521 H. T. Kung and M. S. Lam, Fault-tolerance and two-level pipelining in VLSI systolic arrays, in: 
f+oc Con$ Adwnced Research in VLSI. MIT (1984). 
[53] H. T. Kung and M. S. Lam, Wafer-scaie integration and two level pipelined implementations of 
systolic arrays, 1. Pam/k4 Distributed Procuss. 1 (1984). 
76 1. Gruka 
[54] H. T Kung and C. E. Leiserson, Systolic arrays (for VLSI). in: Sparse Mo1ti.x fiJce&ingJ (SO,-. 
for rndusxial and .4pp!ied Mathematics, 1978) 256-282. 
[S] C. J. Kw, B. C. Levy and B. R Musirus, The specification and verification of systolic wave 
algorithms, TR, Dept. of Electrical Engineering and Computer Sciences, MIT, Cambridge, 1984. 
[56] M. Kochol, Generalized Pascal triangles with maximal left periods, Compur. Artifciu/ Inre//igence 
6 (1987) 54-76. 
1571 I. Korec, Two kinds of processors are sufficient and large operating alphabets are necess:;ry for 
regular trellis automata languages, BulL EATCS 23 ( 1382: 35-42. 
[58] I. Korec, Generalized Pascal triangles, decidability results, Actu ~eorh Unit. Comeniaa 4647 
(1985) 93-130. 
;59] I. Korec, Generalized Pascal triangles (in Slovak). Doktoral Thesis. Comenius University. 
[60] 1. Korec, Asymptotical densities in genealized Pascal triangles. Compur. Arri$ciial Inre//igence 5 
t 1986) 187-198. 
[61] 1. Korec. Semilinear real-time trellis automata, in: Thor. FCr89. Lecture Notes in Computer Science 
(Spi %gcf, Berlin, 1987). 
[62] S. Y. Kung. On supercomputing with systolic/wavefront arary processors, ti_ IEEE 72 (1984) 
867-884. 
(63 ] H. T. Kung, Systolic algorithms for the CMU warp processor. Dept. of Computer Science, TR-CM U- 
CS84-158. 1984. 
[64] H. T. Kung, Memory requirements for balanced computer architectures. Dept. of Computer Science. 
TR, Carnegie-Mellon University, 1985. 
[65] H. T. Kung, Warp Demo, Dept of Computer Sciena, Carnegie-Mellon University. 1986. 
(661 H. T. Kung. Special-purpose supercomputers. in: h&mwion hxessing IFIPtt6. Participants 
Edition I1986) 565-570. 
(671 L. Kossen. and W. P. Weijland. Comnms~ proofs of systolic algorithms: palindroms and sorting. 
Dept. of Computer Science, TR-WI-87-04. University of Amsterdam. 1987. 
(681 F. 7. Leighton and C. E. Leiserson, Water scale integtation of systolic arrays. fEEE Tmns Comput. 
41(1985) 448-461. 
[69] M. Lam and J. f&stow. A transformational model of VLSI systolic d:sign. in: ptoc 616 /memar. 
Symp. on Corrtpsuer Ho&ear Deser@ia~ Langrmges and Iheir Applirarions. IFlP i 1983) 65-67. 
[70] C. E. Leiserson and J. B. Saxc. Optimizing synchronous systems. m: f%nc F(iLS t 1981) 23-ja. 
[71] C. E. Leiserson. F. M. Rose and J. B. Saxc. Dptimising synchronous circuitry by retiming. in: Pmc 
CO/I& Con/: on VW Large Scuk /nfegmrion (1983) 87-1 lb 
[72] G.-J. Li and B. W. Wah, The design of optimal systolic arrays, fEEE Tmnr Comput. 34 (1985) 66-77. 
[ 731 B. L&per, Description and synthesis of systolic arrays. Dept of Numerical Analysis and Computing 
Science, The Royal institute of Tecbno@y, Stockholm. 1986. 
[74] D. I. Moldovan, Dn the design of algorithms for VLSI systems. &x. IEEE 71 ( 1983). 
[75] W. L Miranker and A Winkler. Space-time representations of computational structures. Computing 
32 (1984) 93-114. 
[76] D. Pardubski, Closure properties of the family of languages detined by systolic tree automata, 
Cowtpur. Arr&iaf Inrei&enee 7 (1988) 59-64. 
[77] P. Quinton, The systematic design of systolic arrays. TR-193, IRISA, 1983. 
[78] M. Rem. Trace theory and systolic computations, Dept. of Mathematics and Computer Science. 
TR, Eindhoven University of Technology, 1988. 
1791 S. Rajopadhye. S. Purushothaman and R. Fujimoto. Dn synthesising systolic arrays from recurrent 
equations with linear dependencies, in: Fkoc. Foundotk~~ of sofrvow Techndqv ond 77teorerical 
Cornprter Science, Lecture Notes in Computer Science 241 (Springer, Berlin. 1986) 488-503. 
[80] S. G. Sedukhin, Systematic approach to the design of VLSI networks (in Russian), Preprint, Academy 
of Sciences, Novosibirsk, 1985. 
[81] S. G. Sedukhin, Design and analysis of systolic algorithms for algebraic path problem, TR, 
Computing Center, Academy of Sciences, Novosibirsk 1987. 
[82] S. G. Sedukhin, Design and analysis of systolic algorithms for the algebraic path pmblems. TR, 
Computing Ccutc., Academy of Sciences, Novosibirsk, 1988. 
1831 S. G. Sedukhin and E. V. Trishina, From the set of recurrent equations to the set of systolic wavefmnt 
algorithms, TR, Computing Center, Academy of Sciences, Novosibirsk. 1989. 
[84) T. Toffoli. Cellular automata as an alternative to (rather than an approximation of) differential 
equALions in modelling physics. in: Ce//ular Automora, hoc. Interdisciplinotyv Workshop, Los Alamos 
(North-Holland, Amsterdam, 1983) 117-127. 
[SS] P. J. Varman and I. V. Ramakrishnan, A fault-tolerant VLSI matrix multiplier, Dept. of Computer 
Science, TR-85-29, SiiNY at Stony Brook, 1985. 
[86] R Vollmar, Some remarks on pipelined processing by cellular automata, Compur. Arr$rio/ he/- 
ligence 6 ( 1987) 263-278. 
1871 R Vollsar. Basic research for cellular processing. in: Rot. P/lRCELf_A ( Akademie-Verlag, Berlin, 
1988) 205-222. 
[&8] U. Weisser anJ A. Davis. A wavefront notation tool for VLSI array design, in: VLSI Systems und 
Computations (Computer Science Press. 1981) 226-234. 
[89] W. P. Weijland, A systolic algorithm for matrix-veaor multiplication. in: Fmc. Cnmpuring Scicvce 
in The Netherhnds (1987) 143-160. 
[90] S. Wolfram. Statistical mechanics of cellular automata. R~c_ Modern ph?l 55 ( 1983 1 &II-644. 
1911 S. Wolfram. Computation theory af ccliular automata. Comm Mafk HIJX % ( 1984) 15-57. 
[92] S. Wolfram, t’niversahty, and complexity in cellular automata, phxsim IOD ( 19841 l-35. 
[93] E. Fachini. J. Gruska, A. Maggiolo Sch*ttn;i and D. Sangiorgi. Simulation of systolic tree automata 
on trellis automata TR Dipanimento di Informatica, Universita di Pisa. 
[94] IL Culik II. A. Satomaa and D. Wood. Systolic tree acceptors. RAIRO Inform. 7Wor. 18 (1984) 
53-69. 
