# Analysis of data flow for SIMD systems 

Reinhard Klette

## 0. Introduction

A general approach to characterizing the inherent complexity of computational problems is given by the quantitative analysis of the extent of the data flow that has. to be performed during the solution of these problems. On the other hand, any parallel processing system possesses a restricted ability for fast data transfer determined essentially by the interconnection pattern of the processing elements. In the present paper, these general observations, as previously mentioned by Gentleman (1978), Siegel (1979), Abelson (1980), or Klette (1980), will be transformed into precise definitions of local, global and total data transfer within SIMD systems, and the corresponding definitions of local, global and total data dependencies for computational problems as well. The basic relation between these corresponding notions - the computational time must at least be sufficient for realizing the necessary extent of data transfer - will be represented in a so-called data transfer lemma that outlines the starting point of our formalized method of obtaining lower time bounds by data flow analysis. This approach will be illustrated by application to a variety of different parallel processing architectures where the unifying feature will be that we shall use SIMD models that employ an interconnection network and use no shared memory. Our parallel processing systems will be abstract models of computation where the level of abstraction may be compared with that of a random access machine (RAM); cp. Aho et al. [2] for this model of serial computation. For computational problems such as those mentioned in the present paper the author was inspired by the digital image processing area, where reference is made to Rosenfeld et al. [9] and Klette [5]. But, of course, this does not represent a serious restriction; e.g., matrix multiplication or pattern matching are computational problems of general importance.

The general SIMD model as used in this paper is characterized by a finite or infinite set of processing elements (PEs), an interconnection network, and a central processing unit (CPU). For a rough scheme of an SIMD system which the reader may have in mind throughout this paper, see Fig. 1.

CPU. The CPU has a (central) random access memory which consists of a finite or infinite sequence of registers $r_{0}, r_{1}, r_{2}, \ldots$ with a distinguished accumu-


Figure 1.
Scheme of an SIMD system
lator $r_{0}$. Let $D_{\text {CPU }}$ be the depth of this random access memory, i.e., the number of CPU registers, for $1 \leqq D_{\text {CPU }} \leqq \infty$. Furthermore, let $W_{\text {CPU }}$ be the word length of these registers (number of bit positions), which is assumed to be constant for all CPU registers, for $1 \leqq W_{\text {CPU }} \leqq \infty$. The CPU spreads a single instruction stream to the synchronized working PEs. The programs of the system are stored in $a$, potentially size-unlimited, special program memory of the CPU. Part of any instruction addressed to the PEs is an enable/disable mask to select a subset of the PEs that are to perform the given instruction; the remaining PEs will be idle. The CPU may read the accumulator contents of any one PE of a specified subset of all PEs, and is able to transfer its accumulator contents to some of the PE accumulators. Any data transfer between CPU and PEs is restricted to serial mode.

PEs. Each PE has some (local) random access memory which consists of a finite or infinite sequence of registers $r_{0}, r_{1}, r_{2}, \ldots$ with a distinguished register $r_{0}$ called the accumulator. Let $D_{\mathrm{PE}}$ be the depth of these random access memories, i.e., this depth is assumed to be constant for all PEs of a given system, for $1 \leqq D_{\mathrm{PE}} \leqq \infty$. Furthermore, let $W_{\text {PE }}$ be the unique word length of the PE registers, for $1 \leqq W_{\text {PE }} \leqq \infty$. Each PE is capable of performing some basic operations which take place in its accumulator. Direct data access is restricted to its own registers, to the accumulators of the directly connected PEs in the sense of the given interconnection network, and, possibly, to the accumulator of the CPU. The PEs are indexed by integers or tuples of integers. Each PE knows its index. Let $N_{\text {PE }}, 0 \leqq N_{\text {PE }} \leqq \infty$, be the number of PEs of a given system, and ind $=\left\{j_{1}, j_{2}, \ldots, j_{N_{P E}}\right\}$ be the set of all PE indices of a given SIMD system.

Interconnection network. Each PE is located in a node of a given undirected graph representing the two-way interconnection scheme. Any PE may uniquely identify the different edges connected to its node by using a given coding scheme. Let $N_{\text {IN }}$ be the branching degree of the network, i.e., the maximum degree of the nodes of the given graph, for $0 \leqq N_{\text {IN }}<\infty$.

For the selection of a specialized SIMD model the following system features may be concretely specified:

- off-line or on-line communication with the outside world,
- special values for $N_{\mathrm{PE}}, N_{\mathrm{IN}}, D_{\mathrm{CPU}}, D_{\mathrm{PE}}, W_{\mathrm{CPU}}$, or $W_{\mathrm{PE}}$,
- the set ind,
- the interconnection network structure including the edge coding scheme,
- the CPU instruction set including the available set of enable/disable masks as well as the method of the data exchange between CPU and PEs, and
- the restrictions on the system in communication with the outside world, i.e., input and output management.

Note that as regards the technical realization of an SIMD computing facility, in principle, one implementation may offer different ways to run such a system, i.e., the working principles of several SIMD models as considered in the present paper may be unified within one implementation. Essentially, this is the problem of constructing a flexible interconnection network with reconfigurability, and/or of running a system using different modes.

The outline of this paper is as follows. In the first section we shall present some standardized system description features for specifications of SIMD models. In Section 2 we shall describe how the data flow of an SIMD system may be measured
by functions in a quantitative way. Then, in Section 3 the corresponding notions of data dependencies will be explained for computational problems. In Section 4 the data transfer lemma will be given as well as some applications of this lemma to different models of computation for lower time bound determination. Our concluding remarks are given at the end of the paper.

The standard SIMD models as described in Section 1 constitute the framework of a parallel simulation system (PARSIS) presently under implementation; cp . Legendi [7] for a similar project for simulation of cellular processors.

## 1. OFF-NETs and ON-NETs

In our experience in parallel program design the exclusion of given technical restrictions, e.g., on $N_{\text {PE }}, N_{\text {IN }}$, etc., in the first steps of problem solutions, enables us to find important methods of parallelization of solution processes as well as general features for system description. Of course, for concrete implementation quite a lot of time must be spent in taking given restrictions for $N_{\mathrm{PE}}, N_{\mathrm{IN}}$, etc. into consideration. The present paper is concerned with the first phase, the theoretical preparation for the second phase, which is the concrete implementation. In this sense, we shall deal with abstract SIMD models throughout this paper. More detailed discussion will be the subject of forthcoming papers, depending on the progress of the PARSIS project.

The common one-accumulator computer, e.g., the random access machine (RAM) in the sense of Aho et al. [2], may be considered as the simplest example of an abstract SIMD system - $N_{\text {PE }}=0$ and $D_{\text {CPU }}=W_{\text {CPU }}=\infty$. We shall use the RAM as the underlying model for serial data processing where, in distinction to [2], infinite precision, real number arithmetic is assumed, which is convenient for our theoretical considerations of computational problems such as the Fourier transform, or for operations on finite sets of points in the real plane, by avoiding discussions of round-off errors. In this sense, our standardized system description features start with the declaration of abstract registers.

Abstract registers. For an SIMD system with abstract registers we assume that any register may store one real number at a time, without any special encoding tricks. For our theoretical considerations in this paper, it is not important to specify how the reals are stored in these abstract registers by special bit representations.

Standard register enumeration. We assume a unique enumeration of all registers as follows. For registers $r_{m}$ of the PE with index $j$ or $(j, k)$, called $\mathrm{PE}(j)$ or $\mathrm{PE}(j, k)$ in the sequel, we use the integer tuples $(j, m)$ or $(j, k, m)$, respectively, and for register $r_{m}$ of the CPU just the integer $m$.

Uniform network structure. Either $N_{\mathrm{IN}}=0$, or $N_{\mathrm{IN}}=p \geqq 1$ and the network structure is characterized by $p$ different functions $f_{0}, f_{1}, \ldots, f_{p-1}$ on the set ind of all PE indices in the following way. For $j, k \in \operatorname{ind}, \mathrm{PE}(j)$ and $\mathrm{PE}(k)$ are directly connected iff there exists an $i, 0 \leqq i \leqq p-1$, such that $f_{i}(j)=k$. Because of our assumption that all connections are two-way it follows that

$$
(\wedge j, k \in \text { ind })\left[(\vee i \in\{0,1, \ldots, p-1\}) f_{i}(j)=k \equiv(\vee h \in\{0,1, \ldots, p-1\}) \quad f_{h}(k)=j\right]
$$

In [10] the functions $f_{0}, f_{1}, \ldots, f_{p-1}$ were called interconnection functions. With the exception of a fixed set of PEs at the network border, we also claim that all

PEs are directly connected to exactly $p$ different PEs. When $f_{i}(j)=k, \operatorname{PE}(k)$ is called the $i$ th neighbor of $\operatorname{PE}(j)$. In this way, the edge coding scheme for uniform networks is defined. For each PE, the neighborhood consists of all (i.e., at most $p$ ) neighbor PEs. Examples of infinite networks as well as finite networks matching our uniformity demand are given in Table 1. In the sequel we shall use these networks as defined here.

Some remarks are necessary regarding Table 1 . The left-right $2^{i}$ (LR2I) network and the left-right-up-down $2^{i}$ network (LRUD2I) network were used for vector machines in Pratt et al. [8] and Klette et al. [6], respectively, without the restriction by an integer $m$ as stated in Table 1. Note that we have restricted ourselves to interconnection networks with finite branching degree. The special form of the set ind in the Quadtree network is determined by our standard PE address masking scheme as defined later on. The finite uniform networks mentioned in Table 1 were studied by Siegel [10] - the perfect shuffle (PS), the ILLIAC, the Cube, the plus-minus $2^{i}$ (PM2I), and the wrap-around plus-minus $2^{i}$ (WPM2I) network, with the modification that the PS network is an undirected graph to match our uniform network convention, i.e., for the PS network the inverse shuffle function was added in comparison to [10]. For $j \in \mathbf{i n d}=\left\{0,1, \ldots, 2^{m}-1\right\}$ let $a_{m-1} \ldots a_{1} a_{0}$ denote the binary representation of $j$ and $\bar{a}_{i}$ denote the complement of $a_{i}$. Then

$$
\begin{gathered}
\operatorname{exch}\left(a_{m-1} \ldots a_{1} a_{0}\right)=a_{m-1} \ldots a_{1} \bar{a}_{0}, \\
\operatorname{shuf}\left(a_{m-1} \ldots a_{1} a_{0}\right)=a_{m-2} \ldots a_{1} a_{0} a_{m-1}, \\
\operatorname{shuf}^{-1}\left(a_{m-1} \ldots a_{1} a_{0}\right)=a_{0} a_{m-1} \ldots a_{2} a_{1}, \\
\text { cube }_{i}\left(a_{m-1} \ldots a_{i+1} a_{i} a_{i-1} \ldots a_{0}\right)=\bar{a}_{m-1} \ldots a_{i+1} \bar{a}_{i} a_{i-1} \ldots a_{0}, \\
\text { WPM }_{+i}\left(a_{m-1} \ldots a_{i} \ldots a_{0}\right)=b_{m-1} \ldots b_{i} \ldots b_{0},
\end{gathered}
$$

where $b_{i-1} \ldots b_{0} b_{m-1} \ldots b_{i+1} b_{i}=\left(a_{i-1} \ldots a_{0} a_{m-1} \ldots a_{i+1} a_{i}\right)+1 \bmod 2^{m}$,

$$
\mathrm{WPM}_{-i}\left(a_{m-1} \ldots a_{i} \ldots a_{0}\right)=b_{m-1} \ldots b_{i} \ldots b_{0}
$$

where $b_{i-1} \ldots b_{0} b_{m-1} \ldots b_{i+1} b_{i}=\left(a_{i-1} \ldots a_{0} a_{m-1} \ldots a_{i+1} a_{i}\right)-1 \bmod 2^{m}$, for $0 \leqq i<m$ and $m \geqq 1$.

Standard PE masking scheme. As standard masks we shall use the simple bit patterns for PE indices as used, for example, in [10]. In the case of integer indices, a standard PE address mask is given by an arbitrary, non-empty word on the alphabet $\{0,1, x\}$ enclosed by brackets, where $x$ represents the "dont't care" situation. The only PEs that will be active are those whose address (i.e., index) matches the mask from right to left, where the indices are given in binary representation; 0 matches 0,1 matches 1 , and either 0 or 1 matches $x$. For example, by mask [ $x$ ] all PE's are activated. For the representation of concrete standard masks within programs, etc. we take liberties such as [all PE's] instead of [ $x$ ], or [odd PE's] instead of $[1 x]$ if the rightmost bit position is assumed to be the sign position. In the case of integer tuple indices, the standard PE address masks are arbitrary tuples of non-empty words on $\{0,1, x\}$ enclosed by brackets. Note that for infinite networks as given in Table 1 any given PE address mask activates an infinite manifold of PE's. For example, the mask [ $0 x x$ ] applied to the bintree network will

[^0]Table 1．Uniform networks

| Network | ind | $N_{\text {IN }}$ | Case | Edge coding scheme |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| LINEAR | integers | 2 | all | $j-1$ | $j+1$ | － | － | － | － | －． | － |
| LR21 ${ }^{\text {m }}$ | integers | $2 m$ | all | $f_{2 i}(j)=j+2^{i}$ and $f_{2 i+1}(j)=j-2^{i}$ for $0 \leqq i<m$ and $m \geqq 2$ |  |  |  |  |  |  |  |
| BINTREE | positive integers | 3 | $\begin{aligned} & j \geqq 2 \\ & \text { all } \end{aligned}$ | Li／2 ${ }^{\text {j }}$ | $\overline{2 j}$ | $\overline{2 j}+1$ | － | 二 | 二 | 二 | 二 |
| TRIANGLE | positive integers | 5 | $\begin{aligned} & j \geqq 2 \\ & \text { all } \\ & j \neq 2^{i} \\ & j \neq 2^{i}-1 \end{aligned}$ | Lj／2】 － | $\underline{\text { 2j }}$ | $\overline{2 j}+1$ | $\overline{-}$ | － | 二 | 二 | 三 |
| QUADTREE | $\begin{gathered} \bigcup_{i=0}^{\infty} \cdot \\ \cdot\left\{4^{i}, \ldots, 2 \cdot 4^{i}-1\right\} \end{gathered}$ | 5 | $\begin{aligned} & j \geqq 4 \\ & \text { all } \end{aligned}$ | $\underline{i j / 4 j}$ | $4 j$ | $4 j+1$ | $4 j+2$ | $4 j+3$ | — | - | - |
| HEXAGONAL | tuples of integers | 3 | $\left.\begin{array}{l} \left.\begin{array}{l} \text { all } \\ j+k \\ \text { even } \\ j+k \\ \text { odd } \end{array}\right\}, ~ \end{array}\right\}$ | $(j, k-1)$ | $(j, k+1)$ | $\begin{gathered} - \\ (j-1, k) \\ (j+1, k) \end{gathered}$ | $\begin{aligned} & - \\ & - \\ & - \end{aligned}$ | $\begin{aligned} & - \\ & - \\ & - \end{aligned}$ | $\begin{aligned} & - \\ & - \\ & - \end{aligned}$ | － <br> - | $\begin{aligned} & - \\ & - \\ & - \end{aligned}$ |
| SQUARE | tuples of integers | 4 | all | $(j, k-1)$ | $(j, k+1)$ | $(j-1, k)$ | $(j+1, k)$ | － | － | － | － |
| TRIAGONAL | tuples of integers | 6 | all | $(j, k-1)$ | （j，k＋1） | $(j-1, k)$ | $(j+1, k)$ | （j－1，k－1） | $\overline{(j+1, k+1)}$ | － | － |
| DIAGONAL | tuples of integers | 8 | all | $(j, k-1)$ | $(j, k+1)$ | $(j-1, k)$ | $(j+1, k)$ | $\overline{(j-1, k-1)}$ | $\overline{(j+1, k+1)}$ | $\overline{(j-1, k+1)}$ | $\overline{(j+1, k-1)}$ |
| LRUD2I ${ }^{\text {m }}$ | tuples of integers | $4 m$ | all | $f_{41}(j, k)=\left(j+2^{i}, k\right), f_{4 i+1}(j, k)=\left(j-2^{i}, k\right), f_{4 i+2}(j, k)=\left(j, k+2^{i}\right),$ $f_{4 i+3}(j, k)=\left(j, k-2^{i}\right)$ ，for $0 \leqq i<m$ and $m \geqq 2$ |  |  |  |  |  |  |  |
| PS ${ }^{\text {m }}$ | $\left\{0,1, \ldots, 2^{m}-1\right\}$ | 3 | all | exch | shuf ． | shuf ${ }^{-1}$ | － | － | － | － | － |
| ILLIAC ${ }^{\text {m }}$ | $\left\{0,1, \ldots, 2^{m}-1\right\}$ | 4 | all | $+1 \bmod 2^{m}$ | $-1 \bmod 2^{m}$ | $\left\|+\frac{m}{2} \bmod 2^{m}\right\|$ | $-\frac{m}{2} \bmod 2^{m}$ | － | － | － | － |
| $\mathrm{CUBE}^{\text {m }}$ | $\left\{0,1, \ldots, 2^{m-1}\right\}$ | $m$ | all | $f_{1}(j)=$ cube $_{i}(j)$ ，for $0 \leqq i<m$ |  |  |  |  |  |  |  |
| PM2I ${ }^{\text {m }}$ | $\left\{0,1, \ldots, 2^{m-1}\right\}$ | $2 m$ | all | $f_{2 i}(j)=j+2^{i} \bmod 2^{m}, f_{2 i+1}(j)=j-2^{i} \bmod 2^{m}$ ，for $0 \leq i<m$ |  |  |  |  |  |  |  |
| WPM2I ${ }^{\text {m }}$ | $\left\{0,1, \ldots, 2^{m-1}\right\}$ | $2 m$ | all | $f_{2 i}(j)=\mathrm{WPM}_{+i}(j), f_{2 i+1}(j)=\mathrm{WPM}_{-i}(j)$, for $0 \leq i<m$ |  |  |  |  |  |  |  |

activate the processing elements $\mathrm{PE}(2)$ and $\mathrm{PE}(3)$ on layer 1 of the bintree, disables layer 2, enables the first four PE's of layer 3, and so on, where the common binary representation of non-negative integers is assumed for the PE indices of the bintree network.

Abstract CPU instruction set. For any one of our theoretical SIMD systems, we shall assume that its CPU instruction set may be obtained by special interpretation and selection of the instructions of an abstract CPU instruction reservoir defined as follows. There are two different types of instructions, parallel instructions for activating some of the PEs, and serial instructions where the CPU itself is addressed for certain activity. Any parallel instruction consists of a PE address mask, an operation code (READ, WRITE, LOAD, STORE, OP, or OP ${ }_{l+1}, l \geqq 1$ ), and an operation address $\alpha$ where we shall use the standard register enumeration for explaining the meaning of these operation addresses. For the serial instructions, we assume branching instructions JUMP $b$, JGTZ $b$, JZERO $b$, JLTZ $b$ (where $b$ symbolizes an instruction number in a CPU program and the contents of the CPU accumulator are tested), the HALT instruction, and instructions consisting of an operation code (READ, WRITE, LOAD, STORE, $\mathrm{OP}_{1}$, or $\mathrm{OP}_{2}$ ). See Table 2

Table 2. Abstract CPU instruction set without test and stop instructions

for the complete abstract CPU instruction set without jump and stop instructions. In the case of a parallel instruction, $\mathrm{OP}_{1}$ denotes a unary operation determining the new accumulator contents of all activated PEs by a certain transformation of the contents of the register addressed by $\alpha$ as well as the old accumulator contents of the activated PEs; and $\mathrm{OP}_{l+1}$ denotes an $(l+1)$-ary operation in the same sense. For the activated $\mathrm{PE}(j)$ the operation address $m$ indicates the contents of register $(j, m),{ }^{*} m$ indicates the contents of register ( $j, n$ ) if the nonnegative integer $n$ is the contents of register $(j, m)$ at that moment (i.e., indirect operand addressing, in any situation of incorrect programming; e.g., in the case that ( $j, m$ ) does not have a nonnegative integer contents at that moment, an interrupt of the programmed system is assumed), and the operand : $i_{1}, i_{2}, \ldots, i_{l}$ for $l \geqq 1$ indicates the contents of the accumulators of those neighbors of the activated PEs that are encoded by
$i_{1}, i_{2}, \ldots, i_{l}$ according to the edge coding scheme of the interconnection network. LOAD and STORE have the obvious meanings that the accumulator contents of the activated PEs are replaced by the addressed value, or copied to the addressed registers, respectively. READ and WRITE denote the necessary operations for communication with the outside world where the source and the destination of the data in the "outside world" remain unspecified (certain places within a computing environment not belonging to the given SIMD system itself). In the case of a serial instruction, the unary operation $\mathrm{OP}_{1}$ and the binary operation $\mathrm{OP}_{2}$ produce new CPU accumulator contents by a certain transformation of the addressed values, where in the case of $\mathrm{OP}_{2}$ the old CPU accumulator contents is used as the operand in the first position. READ, WRITE, LOAD, and STORE have the obvious fixed meanings. The operands $=x, m,{ }^{*} m$, and ( $j$ ) indicate the data unit $x$ itself, the contents of CPU register $m$, the contents of CPU register $n$ if register $m$ contains the nonnegative number $n$ at that moment, and the contents of register $(j, 0)$, respectively. Note that with this abstract CPU instruction set data transfer between the CPU and the PEs is possible via the accumulators in serial mode only. Furthermore, for a specialized SIMD model, it is convenient to identify the basic computational power of the PEs and the CPU with that of the RAM as represented by the RAM instruction set [2, Fig. 15], roughly speaking. In this way, an interesting point is provided by the description of how the PEs are able to perform local logical decisions in SIMD mode as we shall explain in Example 1 by equation (1) for a special SIMD model.

Off-line I/O convention. For the off-line communication of an SIMD system with the outside world we assume that a special set of input registers of the system is fixed such that all other registers of the system contain value zero at the beginning of any computation (moment $t=0$ ) as it is assumed for those input registers not actually needed for the placement of input data. Each of the input registers may contain at most one data unit of the input data. Thus, for concrete problem solutions, it is necessary to specify

- what data structure is assumed for the given input data, and
- how the data are placed in the given input register set.

Also, a set of output registers of the system must be fixed. In this sense, for concrete problem solutions it has to be clear

- what is the desired data structure for the output data, and
- how this data structure has to be stored, or computed in the predetermined output register set.
As off-line I/O convention we declare that for a certain $L, 1 \leqq L \leqq D_{\text {CPU }}$, the CPU registers $0,1, \ldots, L-1$ are fixed to be input and output registers, and for any $\operatorname{PE}(j)$, if there exists a certain $m \geqq 0$ such that register ( $j, m$ ) is fixed to be an input register (output register) then register ( $j, 0$ ) is an input register (output register) as well. What is true for the register holds for the accumulator, too.

On-line I/O convention. For the on-line communication of an SIMD system with the outside world some registers are predetermined to act as input and/or output registers. As on-line I/O convention we adopt the same rules as in the offline case. But, at the beginning of any on-line computation (moment $t=0$ ), all registers of the system are assumed to hold value zero. Input data or output data may enter or leave the system at a moment as specified by the CPU program according to READ or WRITE instructions. In any correct program these input (output)
instructions have to be addressed to a proper subset of all registers specified as input (output) registers. For the input (output) data it is assumed that there exists a memory facility in the outside world from where (to where) the input (output) data are obtained (given) by the system. Thus, for concrete problem solutions it is necessary to specify

- what data structures are assumed for the input and output data, and
- how these data are partitioned into waves of information such that one wave may enter (leave) the system per input (output) operation as performed according to the CPU program.
$\therefore 4 \mathrm{~F}$
The size of these waves of information, i.e., the number of data units forming those waves, may alter during a computation process, and just one data unit, for examplè by LOAD $=x$, will be considered to be the simplest case of a wave of information.

Uniform cost criterion. For measuring the time complexity of computations, we assume that any (basic) instruction of the SIMD system needs one unit of time for performance on this system.

Definition 1. A model of computation SYS is called a standard off-line network system (SYS $\in O F F-N E T$ ) iff SYS is defined by

- a CPU and a fixed set of indexed PEs, with concrete values for $D_{\text {CPU }}$ and $D_{\text {PE }}$,
- abstract registers if not otherwise specified, and the standard register enumeration,
- a uniform interconnection network with $0 \leqq N_{\text {IN }}<\infty$,
- the standard PE masking scheme,
- a special interpretation and selection of instructions of the abstract CPU instruction set wheré
(OFF. 1) no READ and WRITE instructions are contained in the instruction set of SYS,
(OFF. 2) for the CPU all RAM instructions [2, Fig. 1.5] except READ and WRITE are avilable,
(OFF. 3) for $N_{\mathrm{IN}}=p \geqq 1$ at least one instruction of the type [all PE's] $\mathrm{OP}_{p+1}$ : $0, ., \ldots, p-1$ is available, and
(OFF.4) for any output register ( $j, 0$ ), i.e., accumulator of $\mathrm{PE}(j)$, at least one instruction of the type $\mathrm{OP}_{2}(j)$ is available, i.e., the CPU may have control of any outputting PE,
- the off-line I/O convention, and
- the uniform cost criterion.

For the defined class OFF-NET we may define subclasses - e.g., OFF-NET ${ }_{P}$ to be the set of all SYS $\in$ OFF-NET having the branching degree $p=N_{\mathrm{IN}}$, OFFSQUARE to be the set of all SYS $\in$ OFF-NET having a square network as defined in Table 1, OFF-BINTREE with the same reference of Table 1, OFF-PS = $=\bigcup_{m=1}^{\infty}$ OFF-PS ${ }^{m}$, or just OFF-RAM.

Example 1. Let us consider the following special SIMD system EXAMP1 $\in$ $\in$ OFF-SQUARE. Let $D_{\mathrm{CPU}}=D_{\mathrm{PE}}=\infty$. Additionally to the CPU registers $0,1, \ldots, L-1$ for a certain $L \geqq 1$, all the accumulators $(j, k, 0), 0 \leqq j<M$ and $0 \leqq k<N$ for some $M, N \geqq 1$, are fixed as input and output registers of EXAMP1. The system possesses the following instruction set:
[mask] ADD $\alpha, \alpha$ for $m,{ }^{*} m,: i_{1}, \ldots, i_{l}$ for $i_{1}, \ldots, i_{1} \in\{0,1,2,3\}$,
[mask] $\mathrm{OP}_{1} \propto, \alpha$ for $m,{ }^{*} m$, : $i$ for $i \in\{0,1,2,3\}, l=1,2$,
[mask] LOAD $\alpha, \alpha$ for $m,{ }^{*} m$, : $i$ for $i \in\{0,1,2,3\}$,
[mask] STORE $\alpha, \alpha$ for $m,{ }^{*} m,: i_{1}, \ldots, i_{i}$ for $i_{1}, \ldots, i_{i} \in\{0,1,2,3\}$, LOAD $\alpha, \alpha$ for $=x, m,{ }^{*} m,(j, k)$, STORE $\alpha, \alpha$ for $m,{ }^{*} m,(j, k)$, $\mathrm{OP}_{2} \alpha, \alpha$ for $=x, m,{ }^{*} m,(j, k)$,
JUMP $b$, JGTZ $b$, JZERO $b$, JLTZ $b$, and HALT.
Here, [mask] represents an arbitrary PE address mask, $\mathrm{OP}_{1}$ is ABS (absolute value) or SIGN (signum function), $\mathrm{OP}_{2}$ is ADD, SUB, MULT, or DIV, for the tuples ( $j, k$ ) with $0 \leqq j<M$ and $0 \leqq k<N$.

To give a short illustration of the computing power of EXAMP1 let us consider the computation of the parallel Roberts gradient (cp. [9] for its importance to digital image processing), where the input image $A=\left(a_{j k}\right)$ of size $M \times N$ is assumed to be stored in the PE input registers ( $a_{j k}$ in register ( $j, k, 0$ ) ) at the beginning of the computation. At the end of the computation, value $\max \left\{\mid a_{j k}-a_{j+1, k+1}\right\}$, $\left.\left|a_{j+1, k}-a_{j, k+1}\right|\right\}$ has to be present in register ( $j, k, 0$ ).

By performing the following sequence pf parallel instructions,

1. [all PEs] STORE 1
2. [all PEs] STORE 3
3. [all PEs] LOAD :2
4. [all PEs] LOAD 1
5. [all PEs] STORE 2
6. [all PEs] LOAD :1
7. [all PEs] LOAD :1
8. [all PEs] SUB 2
9. [all PEs] SUB 1
10. [all PEs] ABS 0
11. [all PEs] ABS 0
12. [all PEs] STORE 4
all registers $(j, k, 3)$ contain value $\left|a_{j k}-a_{j+1, k+1}\right|$, and all registers $(j, k, 4)$ contain value $\left|a_{j+1, k}-a_{j, k+1}\right|$, for $0 \leqq j<M$ and $0 \leqq k<N$. These values may be considered as two $M \times N$ matrices $B$ and $C$. For $\max (B, C)=\left(\max \left\{b_{j k}, c_{j k}^{j}\right\}\right)$ we have

$$
\begin{equation*}
\max (B, C)=B \times \operatorname{sign}(B-C)+C \times \operatorname{sign}(C-B)+B-B \times \operatorname{sign}|B-C| \tag{1}
\end{equation*}
$$

where $\times$. means the parallel MULT operation (cross product of two matrices), and sign the parallel SIGN operation. Using this formula, the parallel Roberts gradient may be computed on the defined special OFF-SQUARE system within time 29 or less, independent of the values of $M$ and $N$, as the reader may check easily. Note that formula (1) describes a way in which the PEs are able to perform local logical decisions in SIMD mode.

Example 2. By some easily described modifications, the system EXAMP1 may be altered dramatically. Replace the square network by LRUD2I ${ }^{m}$, for $m<\max \left\{\log _{2} M, \log _{2} N\right\}$, let $W_{\mathrm{PE}}=1$, and replace the parallel operations ADD, $\mathrm{OP}_{1}$ and $\mathrm{OP}_{2}$ by logical operations AND, NOT, and OR, respectively. What results is a special OFF-LRUD2I ${ }^{m}$ system EXAMP2 which essentially coincides with the PBS (paralleles Binärbildverarbeitungssystem). The computational power of the PBS was extensively studied in [4].

Definition 2. A model of computation SYS is called a standard on-line network system (SYS $\in O N-N E T$ ) iff SYS is defined by

- a CPU and a fixed set of indexed PEs, with concrete values for $D_{\text {CPU }}$ and $D_{\text {PE }}$,
- abstract registers if not otherwise specified, and the standard register enumeration,
- a uniform interconnection network with $0 \leqq N_{\text {IN }}<\infty$,
- the standard PE masking scheme,
- a special interpretation and selection of instructions of the abstract CPU instruction set where, for $N_{\mathrm{IN}} \geqq 2$, an integer tuple ( $p, q$ ) may be denoted to be the characteristic of SYS in the following sense:
(ON. 1) $P=N_{\mathrm{IN}}$ and $1 \leqq q<p$,
(ON. 2) a proper subset $\left\{i_{1}, i_{2}, \ldots, i_{q}\right\}$ of all directions $\{0,1, \ldots, p-1\}$ is specified,
(ON. 3) at least one instruction of the type [all PE's] $\mathrm{OP}_{q+1}: i_{1}, i_{2}, \ldots, i_{q}$ is avaible,
(ON. 4) for any of the instructions [mask] LOAD : $j$ or [mask] $\mathrm{OP}_{k(+1)}: j_{1}, j_{2}, \ldots, j_{k}$, $k \geqq 1$, it follows that $j, j_{1}, j_{2}, \ldots, j_{k} \in\left\{i_{1}, i_{2}, \ldots, i_{q}\right\}$,
(ON. 5) for any of the instructions [mask] STORE : $j_{1}, j_{2}, \ldots, j_{k}, k \geqq 1$, it follows that $j_{1}, j_{2}, \ldots, j_{k} \in\{0,1, \ldots, p-1\}-\left\{i_{1}, i_{2}, \ldots, i_{q}\right\}$, i.e., the result sof consecutive parallel operations may be shifted through the system in directions $\{0,1, \ldots, p-1\}-\left\{i_{1}, i_{2}, \ldots, i_{q}\right\}$ only, and, furthermore
(ON. 6) for the CPU all RAM instructions are avilable including READ and WRITE,
(ON.7) for any output register $(j, 0)$, at least one instruction of the type $\mathrm{OP}_{2}(j)$ is available,
- the on-line I/O convention, and
- the uniform cost criterion.

For the defined class ON-NET we may define subclasses - e.g., ON-NET ${ }_{p \wedge q}$ to be the set of all ON-NET systems with characteristic $(p, q), \mathrm{ON}^{2} \mathrm{LR} 2 \mathrm{I}^{m}$ to be the set of all SYS $\in O N-N E T$ having a left-right $2^{i}$ network as defined in Table 1, ON-ILLIAC ${ }^{m}$ with the same reference to Table 1 , ON-PM2I $=\bigcup_{m=1}^{\infty}$ ON-PM2I ${ }^{m}$, or just ON-RAM.

Any infinite network class OFF-LINEAR or ON-DIAGONAL may be considered as an abstraction of a finite network system, or as the union of classes of finite network systems in the following way.

Definition 3. Let OFF-IN be the set of all OFF-NET systems which are defined by a special infinite network IN, e.g., IN = LINEAR or IN = LRUD2I ${ }^{m}$. A model of computation SYS is called a finite OFF-IN system (SYS $\in$ FIN-OFF-IN) iff there exists a system SYS $_{0} \in O F F-I N$ such that SYS may be obtained as a restriction of $\mathrm{SYS}_{0}$ in the following sense:

Let ind ${ }_{0}$ and $D_{\text {PE }}^{0}$ be the PE index set and the PE memory depth for SYS $_{0}$, respectively. A finite cut-off of the PE register set of $\mathrm{SYS}_{0}$ is defined by a certain finite subset ind of ind ${ }_{0}$ and a (possibly infinite) memory depth $D_{\mathrm{PE}} \leqq D_{\mathrm{PE}}^{\mathbf{0}}$. The work of SYS may be described as follows. All registers in a certain finite cut-off of SYS ${ }_{0}$ are available in SYS but all registers not in this finite cut-off will be considered to be dummy registers, i.e., they are assumed to store value zero if addressed as an operand, and to "forget" any value handed over to them; this is the only difference between SYS $_{0}$ and SYS.

Analogously the set FIN-ON-IN may be defined.
Example 3. An example of a FIN-ON-BINTREE system may be specified as follows. Let $D_{\mathrm{CPU}}=\infty$ and $D_{\mathrm{PE}}=m \geqq 2$. The finite cut-off of the bintree network is given by ind $=\left\{1,2, \ldots, 2^{m}-1\right\}$. Additionally to the CPU accumulator which acts as an input and output register $(L=1)$, the registers $\left(2^{m-1}, 0\right)$, $\left(2^{m-1}+1,0\right), \ldots,\left(2^{m}-1,0\right)$, i.e., the accumulators of the $2^{m-1}$ leaf node PEs, are fixed as input registers, and register ( 1,0 ), i.e., the accumulator of the top node PE, is fixed as an output register. The system possesses the following instruction set:
[mask] ADD $\alpha, \alpha$ for $m,{ }^{*} m,: 1,: 2,: 1,2$,
[mask] $\mathrm{OP}_{l} \alpha, \alpha$ for $m,{ }^{*} m,: 1,: 2$ and $l=1,2$,
[mask] LOAD $\alpha, \alpha$ for $m,{ }^{*} m,: 1,: 2$,
[mask] STORE $\alpha, \alpha$ for $m,{ }^{*} m,: 0$,
[subset leaf nodes] READ 0,
[top node] WRITE 0,
LOAD $\alpha, \alpha$ for $=x, m,{ }^{*} m$, (1), STORE $\alpha, \alpha$ for $m,{ }^{*} m,(1)$, $\mathrm{OP}_{l} \alpha, \alpha$ for $=x, m,{ }^{*} m,(1)$, and $l=1,2$, READ 0, WRITE $\alpha, \alpha$ for $=x, 0$,
JUMP $b$, JGTZ $b$, JZERO $b$, JLTZ $b$, HALT.
Here, [mask] represents an arbitrary PE address, $\mathrm{OP}_{1}$ either ABS or SIGN, $\mathrm{OP}_{2}$ one of the operation codes ADD, SUB, MULT, or DIV. Altogether, a FIN-ON-BINTREE system EXAMP3 is defined which may be obtained by a restriction of an infinite ON-BINTREE model where infinite sets of input and output PE registers are available in the infinite origin.

To give a short illustration of the computational power of the system EXAMP3 let us consider the computation of the arithmetical average $\frac{1}{N} \sum_{i=0}^{N-1} a_{i}, N=2^{n-1}$ and $n$ odd, for $M$ consecutive waves of information $\left(a_{0}, a_{1}, \ldots, a_{N-1}\right)$ where $a_{i}$ is fed to the accumulator of the $\operatorname{PE}\left(2^{n-1}+i\right)$, for $i=0,1, \ldots, N-1$. In order of the $M$ consecutive waves of information the arithmetical average have to leave the system via register $(1,0)$.

For initialization of the system, at first the instruction LOAD $=N$, STORE (1), [top node] STORE 1 will be performed in this order. For $M \geqq(n-1) / 2$ the following sequence of instructions is executed $(n-1) / 2$ times:
[leaf nodes] READ 0,
[all PEs] ADD : 1, 2,
[leaf nodes] LOAD 1,
[all PEs] ADD : 1, 2,
followed by the following sequence of instructions which is executed $M-[(n-1) / 2]$ times:
[top node] DIV 1,
[top node] WRITE 0,
[leaf nodes] READ 0,
[all PEs] ADD : 1, 2,
[leaf nodes] LOAD 1,
[all PEs] ADD : 1, 2.

Finally, the following sequence of instructions is executed $(n-3) / 2$ times:

$$
\begin{array}{ll}
\text { [top node] } \\
\text { [top node] } & \text { DIV 1, } \\
\text { [all PEs] } & \text { ADD : }, 2, \\
\text { [all PEs] } & \text { ADD :1,2, }
\end{array}
$$

followed by the last two instructions [top node] DIV 1 and [top node] WRITE 0. Thus, altogether, the arithmetic averages of $M \geqq(n-1) / 2$ consecutive waves of information ( $a_{0}, a_{1}, \ldots, a_{N-1}$ ) may be computed within $6 M+n$ basic operations of EXAMP3, instead of $O(N \cdot M)$ basic operations in the serial case using a RAM as model for computation.

In conclusion, we point out that SIMD now denotes not a general concept (single-instruction, multiple data) but an exactly defined class of models for computation, namely the union of all system classes given by Definitions 1, 2, and 3.

## 2. Local, global, and total data flow measures

Let $S Y S \in$ SIMD; throughout this paper such a special parallel processing system will be used as a standard system for considerations of data transfer restrictions in computing systems. Any computational process performed on such a model SYS may be uniquely specified by a CPU program $\pi$ and a concrete input situation $I$ characterized by the placement of input values into the set of input registers if off-line mode is used, or by the partition of the input data into consecutive waves of information fed to some of the input registers of the system from the outside world if on-line mode is used.

As suggested by applications to visual perception, the set of input registers of the model SYS may be considered as the retina of the system, and any new wave of information to this set of input registers represents a snapshot of the outside world. In this sense, after $t$ steps of a computational process characterized by a program $\pi$ and an input situation $I$, for any register $r$ of the system we may mark out a certain receptive field $\operatorname{rec}_{\pi}^{I}(r, t)$ containing all the names of those input registers which have had any influence on the contents of register $r$ up to the moment $t$, where new waves of information to the retina of the system create new names of the input registers, formally represented by $r^{(0)}, r^{(1)}, r^{(2)}, \ldots, r^{(i)}, \ldots$ for register $r$.

Standard register names. At time $t=0$ of any computational process, each register $r$ in our standard enumeration possesses the name $r^{(0)}$. At $t=0$ let the wave number $W N=0$ also. At time $t+1$ assume that a serial or parallel READ instruction, or an instruction $\mathrm{LOAD}=x, \mathrm{OP}_{1}=x$, or $\mathrm{OP}_{2}=x$ has to be performed. Then, by this operation we obtain $W N \leftarrow W N+1$ and the new names $r^{(W N)}$ for all registers which were addressed by these instructions. For example, the number $(j, c(j, m))^{(W N)}$ in the case of an instruction [mask] READ ${ }^{*} m$ for all activated processing elements $\mathrm{PE}(j)$, where $c(j, m)$ denotes the actual contents of register $(j, m)$, or the name $0^{(W N)}$ in the case of an instruction $\mathrm{OP}_{2}=x$.

Definition 4. Let SYS $\in$ SIMD. Standard register names are assumed. For a program $\pi$ of SYS, an input situation $I$ of SYS, a register $r$ of SYS, and an arbitrary moment $t \geqq 0$, the receptive field $\operatorname{rec}_{\pi}^{I}(r, t)$ is recursively defined as follows:
moment $t=0$ :

$$
\operatorname{rec}_{\pi}^{I}(r, 0)= \begin{cases}\left\{r^{(0)}\right\} & \begin{array}{l}
\text { if input register } r \text { stores an input value according } \\
\text { to } I, \text { for off-line mode }
\end{array} \\
& \text { empty set, otherwise }\end{cases}
$$

moment $t+1, t \geqq 0$ :
At moment $t+1$ a certain instruction has to be applied according to $\pi$ and $I$, or the-HALT instruction is assumed for this moment.
(i) Depending on this instruction, if it is one of those listed in Table 3, the changes of receptive fields are defined as given in this Table where we omit the indices $\pi$ and $I$ for simplification of the expressions. In the case of parallel instructions, the mentioned changes are valid for all activated PEs PE $(j)$ where $j$ matches [mask].

Table 3. Changes of receptive fields in step $t+1$

| Instructions | Changes of receptive fields |
| :---: | :---: |
| [mask] $O P_{1} m$ | $\operatorname{rec}((j, 0), t+1)=\operatorname{rec}((j, m), t)$ |
| [mask] $O P_{1}{ }^{*} m$ | $\operatorname{rec}((j, 0), t+1)=\operatorname{rec}((j, m), t) \cup \operatorname{rec}((j, c(j, m)), t)$ |
| [mask] $O P_{1}: i$ | $\operatorname{rec}((j, 0), t+1)=\operatorname{rec}\left(\left(f_{i}(j), 0\right), t\right)$ |
| [mask] $O P_{2} m$ | $\operatorname{rec}((j, 0), t+1)=\operatorname{rec}((j, 0), t) \cup \operatorname{rec}((j, m), t)$ |
| [mask] $O P_{2}{ }^{*} m$ | $\operatorname{rec}((j, 0), t+1)=\operatorname{rec}((j, 0), t) \cup$ |
|  | $\cup \operatorname{rec}((j, m), t) \cup \operatorname{rec}((j, c(j, m)), t)$ |
| [mask] $O P_{1+1}: i_{1}, i_{2}, \ldots, i_{1}$ | $\begin{aligned} & \operatorname{rec}((j, 0), t+1)=\operatorname{rec}((j, 0), t) \cup \operatorname{rec}\left(\left(f_{i_{t}}(j), 0\right), t\right) \cup \\ & \cup \operatorname{rec}\left(\left(f_{i_{2}}(j), 0\right), t\right) \cup \ldots \cup \operatorname{rec}\left(\left(f_{i_{i}}(j), 0\right), t\right) \end{aligned}$ |
| [mask] STORE m | $\operatorname{rec}((j, m), t+1)=\operatorname{rec}((j, 0), t)$ |
| [mask] STORE *m | $\operatorname{rec}((j, c(j, m), t+1)=\operatorname{rec}((j, 0), t) \cup \operatorname{rec}((j, m), t)$ |
| [mask] STORE : $i_{1}, i_{2}, \ldots, i_{1}$ | $\begin{aligned} & \operatorname{rec}\left(\left(f_{i_{i}}(j), 0\right), t+1\right)=\operatorname{rec}((j, 0), t), \operatorname{rec}\left(\left(f_{i_{i}}(j, 0), t+1\right)=\right. \\ & =\operatorname{rec}((j, 0), t), \ldots, \operatorname{rec}\left(\left(f_{i_{i}}(j), 0\right), t+1\right)=\operatorname{rec}((j, 0), t) \end{aligned}$ |
| [mask] READ $m$ | $\operatorname{rec}(j, m), t+1)=\left\{(j, m)^{\left(W^{N}\right)}\right\}$ |
| [mask] READ * $m$ | $\operatorname{rec}\left((j, c(j, m), t+1)=\operatorname{rec}((j, m), t) \cup\left\{(j, c(j, m))^{\left(W^{N}\right)}\right\}\right.$ |
| OP $P_{1}=x$ | $\mathrm{rec}(0, t+1)=\left\{0^{(W N)}\right\}$ |
| $\bigcirc P_{1} m$ | $\operatorname{rec}(0, t+1)=\operatorname{rec}(m, t)$ |
| $\bigcirc P_{1}{ }^{*} m$ | $\operatorname{rec}(0, t+1)=\operatorname{rec}(m, t) \cup \mathrm{rec}(c(m), t)$ |
| $\bigcirc P_{1}(j)$ | $\operatorname{rec}(0, t+1)=\operatorname{rec}((j, 0), t)$ |
| $O P_{2}=x$ $O P_{2}$ | $\operatorname{rec}(0, t+1)=\operatorname{rec}(0, t) \cup\left\{0^{\left(W^{N}\right)}\right\}$ |
|  | $\operatorname{rec}(0, t+1)=\operatorname{rec}(0, t) \cup \mathrm{rec}(\mathrm{m}, t)$ $\operatorname{rec}(0, t+1)=\operatorname{rec}(0, t) \cup \operatorname{rec}(m, t) \cup \operatorname{rec}(c(m), t)$ |
|  | rec $(0, t+1)=$ rec $(0, t) \cup \mathrm{rec}(m, t) \cup \mathrm{rec}(c(m), t)$ $\mathrm{rec}(0, t+1)=\mathrm{rec}(0, t) \cup \mathrm{rec}(j, 0), t)$ |
| STORE $m$ | $\operatorname{rec}(m, t+1)=\operatorname{rec}(0, t)$ |
| STORE *m | $\operatorname{rec}(c(m), t+1)=\operatorname{rec}(0, t) \cup \mathrm{rec}(m, t)$ |
| Store (j) | $\operatorname{rec}((j, 0), t+1)=\operatorname{rec}(0, t))$ |
| READ m | $\operatorname{rec}(m ; t+1)=\left\{m^{\left(W^{N}\right)}\right\}$ |
| READ * $m$ | $\operatorname{rec}(c(m), t+1)=\operatorname{rec}(m, t) \cup\left\{c(m)^{\left(W^{N}\right)}\right\}$ |

(ii) For the parallel or serial LOAD instructions the changes of receptive fields are the same as for the corresponding $\mathrm{OP}_{1}$ instructions.
(iii) In the case of a WRITE, JUMP, or HALT instruction no changes of receptive fields appear.
(iv) In the case of a JGTZ, JZERO, or JLTZ instruction no changes of receptive fields appear in step $t+1$, but the set rec $(0, t)$ will be added at moment $t^{\prime} \geqq t+2$ to any receptive field that alters at moment $t^{\prime}$ according to (i) or (ii), if at moment $t^{\prime}$ an instruction has to be performed covered by cases (i) and (ii). For example, the instruction [mask] $\mathrm{OP}_{2} m$, at moment $t^{\prime} \geqq t+2$, will produce the changes $\operatorname{rec}\left((j, 0), t^{\prime}\right)=\operatorname{rec}\left((j, 0), t^{\prime}-1\right) \cup \operatorname{rec}\left((j, m), t^{\prime}-1\right) \cup \operatorname{rec}(0, t)$ for all activated PEs.

For illustration of this definition, consider the special OFF-SQUARE system as defined in Example 1. Let $I$ be any concrete input situation for computing the parallel Roberts gradient and let $\pi$ be the sequence of the 12 parallel instructions as given there. At moment $t=0$ we have $\operatorname{rec}((j, k, 0), 0)=\left\{(j, k, 0)^{(0)}\right\}$, for $0 \leqq j<M$ and $0 \leqq k<N$, and for any other register $r$ of the system EXAMP 1 , rec $(r, 0)$ is the empty set. After performing the 12 instructions of $\pi$ the reception fields of maximal cardinality 2 belong to the registers $(j, k, 0),(j, k, 3)$ and $(j, k, 4)$; for $0 \leqq j \leqq M-2$ and $0 \leqq k \leqq N-2$, where, e.g., rec $((j, k, 0), 12)=\left\{(j+1, k, 0)^{(0)}\right.$, $\left.(j, k+1,0)^{(0)}\right\}$. For the system defined in Example 3, and the program and the input situation as described there, after performing the $6 M+n$ instructions the receptive field of maximal cardinality $N M+1$ belongs to the register $(1,0)$, i.e., to the accumulator of the top node PE.

Definition 5. Let SYS $\in$ SIMD. For a set $R$ of registers of SYS and a moment $t \geqq 0$ define the local data transfer function $\lambda_{\text {sys }}$ by

$$
\lambda_{\mathrm{SYS}}(R, t)=\max _{\pi} \max _{I} \max _{r \in R} \operatorname{card}\left(\operatorname{rec}^{I}(r, t)\right),
$$

the global data transfer function $\gamma_{\mathrm{SYS}}$ by

$$
\gamma_{\mathrm{SYS}}(R, t)=\max _{\pi} \max _{I} \operatorname{card}\left(\bigcup_{r \in R} \operatorname{rec}_{\pi}^{I}(r, t)\right),
$$

the total data transfer function $\tau_{\text {sYs }}$ by

$$
\tau_{\mathrm{SYS}}(R, t)=\max _{\pi} \max _{\Psi_{I}} \sum_{r \in R} \operatorname{card}\left(\operatorname{rec}_{\pi}^{I}(r, t)\right)
$$

By this definition, it follows immediately that the functions $\lambda_{\text {SYS }}, \gamma_{\mathrm{SYS}}$ and $\tau_{\text {sys }}$ are monotonically increasing for any set $R$ of registers of SYS and increasing values of $t$. Furthermore,

$$
\begin{equation*}
\lambda_{\mathrm{SYS}}(R, t) \leqq \gamma_{\mathrm{SYS}}(R, t) \leqq \tau_{\mathrm{SYS}}(R, t) \tag{2}
\end{equation*}
$$

for all models SYS $\in$ SIMD, sets $R$ of registers and moments $t \geqq 0$. Also note that for any model SYS, if within $t$ steps of an arbitrary program $\pi$ for SYS starting with an arbitrary input situation I for SYS at most $\omega_{\mathrm{SYS}}(t)$ input data may be fed to the system, then

$$
\begin{gather*}
\gamma_{\mathrm{SYS}}(R, t) \leqq \omega_{\mathrm{SYS}}(t), \quad \text { and }  \tag{3.1}\\
\tau_{\mathrm{SYS}}(R, t) \leqq \lambda_{\mathrm{SYS}}(R, t) \cdot \operatorname{card}(R) \tag{3.2}
\end{gather*}
$$

for any set $R$ of registers of SYS and $t \geqq 0$.

Example 4. In Section 4 we shall characterize the way to use these data transfer functions for obtaining lower time bounds for concrete computational problems. For serial data processing we shall apply the system $\mathrm{RAM}_{\mathrm{L}}$, cp. [2, Fig. 1.5], as model for computation, where $R_{\mathrm{L}}=\{0,1,2, \ldots, L-1\}, L \geqq 1$, is assumed to be the set of all input/output registers of such a machine ( $D_{\mathrm{CPU}}=\infty, N_{\mathrm{PE}}=0, W_{\mathrm{CPU}}=\infty$ ). For $t \geqq 0$, we have $\omega_{\text {OFF-RAM }}(t)=L+t$ and $\omega_{\mathrm{ON}-\mathrm{RAM}_{\mathrm{L}}}(t)=t$. For OFF-RAM $=$ $=\bigcup_{L=1}^{\infty}$ OFF-RAM $M_{L}$, note that $\omega_{\text {OFF-RAM }}(t)=\max _{L} \omega_{\text {OFF-RAM }}(t)$ is not defined. Furthermore, we have

$$
\begin{align*}
& \lambda_{\mathrm{OFF}-\mathrm{RAM}_{\mathrm{L}}}\left(R_{\mathrm{L}}, t\right)= \begin{cases}2 t+1 \text { for } 0 \leqq t \leqq\lfloor(L-1) / 2\rfloor \\
\lfloor(L+1) / 2\rfloor+t, & \text { otherwise, }\end{cases}  \tag{4.1}\\
& \gamma_{\mathrm{OFF}-\mathrm{RAM}_{\mathrm{L}}}\left(R_{\mathrm{L}}, t\right)=L+t \text {, and }  \tag{4.2}\\
& \tau_{\mathrm{OFF}-\mathrm{RAM}_{\mathrm{L}}}\left(R_{L}, t\right)=L(t-\lfloor L / 2\rfloor+1) \text { for } t \geqq\lfloor L / 2\rfloor \text {, } \tag{4.3}
\end{align*}
$$

in the case of using the $\mathrm{RAM}_{\mathrm{L}}$ in off-line mode, and

$$
\begin{gather*}
\lambda_{\mathrm{ON}-\mathrm{RAM}_{L}}\left(R_{L}, t\right)=\gamma_{\mathrm{ON}-\mathrm{RAM}_{L}}\left(R_{L}, t\right)=t, \\
\tau_{\mathrm{ON}-\mathrm{RAM}_{L}}\left(R_{L}, t\right)=\left\{\begin{array}{l}
t(t+l) / 2 \text { for } t \leqq L \\
L(t-(L / 2)+1 / 2) \text { for } t \geqq L
\end{array}\right. \tag{4.5}
\end{gather*}
$$

in the case of using the RAM ${ }_{L}$ in on-line mode. The maximal data flow for obtaining equation (4.1) is possible by indirect addressing $\mathrm{OP}_{2}{ }^{*} m$, followed by $\mathrm{OP}_{2}=x$ operations. For (4.3), the same sequence of operations is extended by $L-1$ instructions STORE $m$. For (4.4), $t$ operations of the type $\mathrm{OP}_{2}=x$ may be considered. For small $t$ the exact derivation of the function $\tau_{\text {OFF-RAM }}$ represents a sophisticated problem already, for this quite simple model of serial computation.

Example 5. For further illustration of the concrete derivation of these data transfer functions, let us consider both systems EXAMP1 and EXAMP3 as defined above.

For the system EXAMP1, first we see that $\omega_{\text {EXAMP1 }}(t)=M N+L+t$, for $t \geqq 0$. Let $R_{M, N}$ be the set $\{(j, k, 0): 0 \leqq j<M$ and $0 \leqq k<N\}$ of all PE input/ iutput registers of the system. By using $t$ operations of the type

$$
\text { [all PE's] ADD :0, 1, 2, } 3
$$

we obtain the maximal local and total data transfer within the field of PE accumulators, where

$$
\begin{gather*}
\lambda_{\text {EXAMP1 }}\left(R_{M, N}, t\right)=2 t^{2}+2 t+1  \tag{5.1}\\
\left(2 t^{2}+2 t+1\right) M N-\left(\frac{t+1}{3}-(t+1)^{2}+\frac{2(t+1)^{3}}{3}\right)(M+N) \leqq \\
\leqq \tau_{\mathrm{EXAMP1}}\left(R_{M, N}, t\right) \leqq\left(2 t^{2}+2 t+1\right) M N \tag{5.2}
\end{gather*}
$$

for $2 t+1 \leqq \min \{M, N\}$, by elementary combinatorial considerations and (3.2). For $t \geqq t_{0}=\lfloor M / 2\rfloor \cdot\lfloor N / 2\rfloor$ we have

$$
\begin{equation*}
M N+\left(t-t_{0}\right) \leqq \lambda_{\operatorname{EXAMP} 1}\left(R_{M, N}, t\right) \leqq M N+L+t \tag{4.3}
\end{equation*}
$$

For $t \geqq t_{0}=M+N-2$ we can easily see that

$$
\begin{equation*}
M^{2} N^{2}+\left(t-t_{0}\right) \leqq \tau_{\text {EXAMPI }}\left(R_{M, N}, t\right) \leqq M N(M N+L+t) \tag{5.4}
\end{equation*}
$$

Finally, for the case of global data transfer we obtain

$$
\gamma_{\mathrm{EXAMPI}}\left(R_{M, N}, t\right)= \begin{cases}M N & \text { for } t=0  \tag{5.5}\\ M N+2 t+1 & \text { for } 2 t+1 \leqq L \text { and } t>0 \\ M N+[(L-1) / 2\rfloor+t & \text { for } 2 t+1>L\end{cases}
$$

where, for $2 t+1 \leqq L$, the maximal global data transfer is possible by $t$ operations of the type ADD ${ }^{*} m_{t}$ and one operation $\operatorname{STORE}(j, k)$, e.g.

For the system EXAMP3, at first we have $\omega_{\text {EXAMP3 }}(t)=t \cdot N$, for $N=2^{n-1}$ and $t \geqq 0$ by using $t$ operations of the type
[leaf nodes] READ 0.
Let $R_{0}=\{0,(1,0)\}$ be the set of the two distinguished output registers of this syste EXAMP3. By using the instruction pair
[leaf nodes] READ 0,
[all PEs] ADD :1,2
repeated ( $m-1$ ) times, $m \geqq 1$; the single instruction
[leaf nodes] READ 0
again; and finally ( $n-1$ ) instructions
[all PEs] ADD : 1, 2,
we obtain the maximal local data transfer for register $(1,0)$ in any case $t \geqq m$. We have

$$
\lambda_{\text {EXAMP }_{3}}\left(R_{0}, t\right)= \begin{cases}0 & \text { for } t=0 \\ 2^{t-1} & \text { for } 1 \leqq t \leqq n-1 \\ m \cdot N & \text { for } t=n+2 m-l, \quad m \geqq 1 \\ & \text { and } l=1 \text { or } l=2,\end{cases}
$$

for all $t \geqq 0$. Analogously, for the same set $R_{0}$ and $t \geqq 0$

$$
\begin{aligned}
& \gamma_{\mathrm{EXAMP3}}\left(R_{0}, t\right)= \begin{cases}0 & \text { for } t=0, \\
2^{t-1} & \text { for } \quad 1 \leqq t \leqq n-1, \\
m \cdot N & \text { for } t=n+2 m-2, \quad m \geqq 1, \\
m \cdot N+1 & \text { for } t=n+2 m-1, \quad m \geqq 1,\end{cases} \\
& \tau_{\mathrm{EXAMP3}}\left(R_{0}, t\right) \begin{cases}0 & \text { for } t=0, \\
2^{t-1} & \text { for } 1 \leqq t \leqq n+1, \quad \\
2 m \cdot N & \text { for } t=n+2 m-1, \quad m \leqq 1 . \\
2 m \cdot N+1 & \text { for } t=n+2 m, m \geqq 1 .\end{cases}
\end{aligned}
$$

Of course, the values of $\lambda_{\text {EXAMF3 }}, \gamma_{\text {EXAMP3 }}$, and $\tau_{\text {EXAMP3 }}$ depend on the choice of the set $R_{0}$, and may be quite different for some other sets of registers.

Definition 6. Let CLASS $\subseteq$ SIMD. The general data transfer functions are defined as follows, for such a set CLASS of models of computation, for $t ; n \geqq 0$ :
$\Lambda_{\text {CLASS }}(t)$ denotes the maximal value of all $\lambda_{\mathrm{SYS}}(R, t)$,
$\Gamma_{\text {CLASS }}(n, t)$ denotes the maximal value of all $\gamma_{\text {SYS }}(R, t)$ with card $(R)=n$, and $T_{\text {CLASS }}(n, t)$ denotes the maximal value of all $\tau_{\mathrm{SYS}}(R, t)$ with card $(R)=n$, where SYS is an arbitrary element of CLASS, and $R$ denotes a set of registers of SYS.
Interesting examples of CLASS are sets like OFF-NET ${ }_{p}, \mathrm{ON}^{\mathrm{ONET}} \mathrm{N}_{p, q}$, OFFSQUARE, OFF-BINTREE, or ON-HEXAGONAL, where these general data transfer functions are fully defined.

Theorem 1. For standard off-line network systems and $2 \leqq p<\infty$ we have

$$
\Lambda_{\mathrm{OFF}-\mathrm{NET}_{p}}(t)= \begin{cases}2 t+1 & \text { for } \quad p=2 \\ p\left(\frac{(p-1)^{t}-1}{p-2}\right)+1 & \text { for } p \geqq 3\end{cases}
$$

and

Proof. First, let us consider the local situation. For $p=2$, the maximal transfer of data units is possible by indirect addressing to the CPU accumulator, e.g. For $p \geqq 3$, there exist special OFF-NET ${ }_{p}$ models SYS ${ }_{t}$ such that, according to (OFF.3), at any moment $1 \leqq s \leqq t$ the maximal possible number of $p(p-1)^{s-1}$ new names of input registers may enter the receptive field of a certain register $r$, for $t \geqq 0$. Thus,

$$
\lambda_{\mathrm{SYS}_{\mathrm{t}}}(\{r\}, t)=1+\sum_{s=0}^{t-1} p(p-1)^{s}=p\left(\frac{(p-1)^{t}-1}{p-2}\right)+1
$$

For the total and global situation note that by choosing sufficiently complex SYS $_{n, t}$, for $n, t \geqq 0$, the maximal local situations of data transfer characterized by receptive fields of cardinality $\Lambda_{\mathrm{OFF}^{\left(\mathrm{NET}_{\mathrm{P}}\right.}}(t)$ at moment $t$ may appear in $n$ different registers and time $t$ such that these registers are far enough from one another so that their receptive fields are pairwise disjoint.

Example 6. By (4.1) and Theorem 1, it follows that $\Lambda_{\mathrm{OFF}-\mathrm{RAM}}(t)=\Lambda_{\mathrm{OFF}-\mathrm{NET}_{2}}(t)=$ $=2 t+1$, for $t \geqq 0$. Of course, this coincidence is not true in the total and global cases. According to Theorem 1 we have $\Gamma_{\mathrm{OFF}-\mathrm{NET}_{2}}(n, t)=T_{\mathrm{OFF}-\mathrm{NET}}^{2} 2(n, t)=n(2 t+1)$, for $n, t \geqq 0$, but by elementary considerations $\Gamma_{\mathrm{OFF}-\mathrm{RAM}}(n, t)=2 t+n$, for $n \geqq 1$ and $T_{\text {OFF-RAM }}(n, t)=2 n(t-n+2)-2$, for $t \geqq n \geqq 2$.

In Table 4 the general local data transfer functions are collected for some classes of off-line systems as defined in Section 1. For these classes, the functions $\Lambda_{\mathrm{OFF}-\mathrm{NET}_{p}}$ as given in Theorem 1 act as upper bounds, where the proper value of $p$ has to be specified. The classes OFF-LINEAR, OFF-PS, OFF-BINTREE and OFF-QUADTREE represent examples for the maximal transfer situations as characterized by Theorem 1 , for $p=2,3,5$, respectively.

Some remarks about Table 4 and about the other networks which were defined in Table 1.

1. For the bintree, triangle and quadtree network note that the maximal receptive fields may be obtained for central nodes of these tree structures only, and not at the top node. The maximal possible cardinalities of receptive fields of top node accumulators are given for illustration of this fact.

Table 4. General local data transfer functions for offline systems

| CLASS | $P$ | $\Delta_{\text {OFF-CLASs }}(t)$ | $t=4$ | $t=8$ |
| :--- | ---: | :--- | ---: | ---: |
| LINEAR | 2 | $2 t+1$ | 9 | 17 |
| HEXAGONAL | 3 | $\frac{3}{2} t^{2}+\frac{3}{2} t+1$ | 31 | 109 |
| SQUARE or ILLIAC | 4 | $2 t^{2}+3 t+1$ | 41 | 145 |
| TRIAGONAL | 6 | $3 t^{2}+3 t+1$ | 61 | 215 |
| DIAGONAL | 8 | $4 t^{2}+4 t+1$ | 81 | 289 |
| PS | 3 | $3 \cdot 2^{t}-2$ | 46 | 766 |
| BINTREE | 3 | $3 \cdot 2^{t}-2$ | 46 | 766 |
| top node |  | $2^{t+1}-1$ | 31 | 511 |
| TRIANGLE | 5 | $3 \cdot 2^{t+1}+t^{2}-2 t-5$ | 99 | 1,579 |
| top node |  | $2^{t+1}-1$ | 31 | 511 |
| QUADTREE | 5 | $\left(5 \cdot 4^{t}-2\right) / 3$ | 426 | 109,226 |
| top node |  |  | $\left(4^{t+1}-1\right) / 3$ | 341 |

2. For all examples of CLASS given in Table 4, we have $\Gamma_{\text {off-class }}(n, t)=$ $=T_{\mathrm{OFF}-\mathrm{CLASS}}(n, t)=n \cdot \Lambda_{\mathrm{OFF}-\mathrm{CLASS}}(t)$, for $n, t \geqq 0$.
3. The hexagonal, square, triagonal, and diagonal networks are special examples of infinite graphs of constant degree $p$ such that the general local data transfer function is equal to $\frac{p}{2} t^{2}+\frac{p}{2} t+1$. Such networks correspond to usual digital metrics for the orthogonal grid in a natural way, e.g., the metrics $d_{4}$ or $d_{8}$ as used in digital image processing, cp. [9], to the square or diagonal network, respectively.
4. For the networks CUBE $^{m}$, $\mathrm{PM}^{( }{ }^{m}$, $\mathrm{WPM}^{m}{ }^{m}$, LR2I ${ }^{m}$, or LRUD2I ${ }^{m}$, the derivation of the three general data transfer functions represents a very sophisticated problem. Of course, the values of these functions depend on the value of $m$, and the consideration of classes like

$$
\mathrm{CUBE}=\bigcup_{m \geqq 2} \mathrm{CUBE}^{m}
$$

would lead to undefined general data transfer functions. In [4] the general local data transfer functions were analyzed for some concrete SIMD systems similar to FIN-OFF-LR2I ${ }^{m}$ or FIN-OFF-LRUD2I ${ }^{m}$ systems like EXAMP2 which was defined above. But, for the present paper, we recommend data transfer analysis for specialized (finite) SIMD systems to the interested reader, and are satisfied with some hints:
$C U B E^{m}$ : For this system, the exact derivation of the local transfer function should be a solvable task. We have

$$
\Lambda_{\mathrm{OFF}^{2}-\mathrm{CuBE}^{m}}(t) \begin{cases}=\sum_{i=0}^{t}\binom{m}{i} & \text { for } t<m \\ \geqq 2^{m} & \text { for } t=m \\ \geqq 2^{m+1}(t-m)^{m} & \text { for } t>m\end{cases}
$$

For example, we have $\Lambda_{\text {OFF-CUBE }} 256(4)=177,589,057$ and $\Lambda_{\text {OFF-CUBE }} 256(8)$ is about $4 \cdot 10^{14}$.

PM2I": For this, as for the other "power-of-two systems", the analysis of data flow represents quite a hard problem, cp. [4]. But, to give the reader some feeling about the complexity of the data transfer functions for these systems, some values will be collected:

$$
\Lambda_{\mathrm{OFF}-\mathrm{PM} 21 m}(t) \begin{cases}=1 & \text { for } \quad t=0 \\ =2 & \text { for } \quad t=1 \\ =2(m-1)(m-2)+4 & \text { for } \quad t=2 \\ \vdots & \\ \geqq 2^{m} & \text { for } \quad t=\lceil m / 2\rceil \\ \geqq 2^{m+1}(t-[m / 2 \mid) & \text { for } \quad t=\lceil m / 2\rceil\end{cases}
$$

Note that exponential increase changes to linear increase at $t=[\mathrm{m} / 2]$.
$W P M 2 I^{m}$ : It may be that this is the most complicated situation of any network; we have.

$$
\Lambda_{\text {OFF-WPM2Im }}(t) \begin{cases}=1 & \text { for } \quad t=0 \\ =2 & \text { for } \quad t=1 \\ \vdots & \\ \vdots & \text { for } \quad t=\lceil m / 2\rceil \\ \geqq 2^{m} & \text { for } \quad t \geqq[m / 2\rceil\end{cases}
$$

This great difficulty in analyzing data paths should be a hint to the limited practical importance of this network.
$L R 2 I^{m}$ : For brevity we shall use the function $\sigma(i)=\sum_{j=1}^{i} j^{2}=\frac{1}{6}(i+1)-\frac{1}{2}(i+1)^{2}+$ $+\frac{1}{3} 1(i+1)^{3}$. We found the following interesting values:

$$
\begin{array}{ll}
1 & \text { for } t=0 \\
2 m+1 & \text { for } t=1 \\
2(m-2)^{2}+4 m+1 & \text { for } t=2 \\
1+6 m+4(m-2)^{2}+2 \cdot \sigma(m-4) & \text { for } t=3 \\
1+8 m+6(m-2)^{2}+4 \cdot \sigma(m-4)+ & \\
+4 \cdot \sum_{i=1}^{m-6} \sigma(i) & \text { for } t=4 \\
+8 \cdot \sum_{i=1}^{m-6} \sigma(i)+ & \text { for } t=5 \\
+8 \sum_{i=1}^{m-8} \sum_{j=1}^{i} \sigma(j) & \text { for } t \geqq\lfloor(m-1) / 2\rfloor
\end{array}
$$

The contents $c_{m}$ depend on the value of $m$ only, for example $c_{2}=-1, c_{3}=1, c_{4}=7$, $c_{5}=25, c_{6}=71, c_{7}=185, c_{8}=455, c_{9}=1081$, and $c_{10}=2503$. Because the LR21 ${ }^{m}$ is an infinite network. $\Gamma_{\mathrm{OFF}-\mathrm{LR} 21^{m}}(n, t)=T_{\mathrm{OFF}-\mathrm{LR21m}}(n, t)=n \cdot \Lambda_{\mathrm{OFF}-\mathrm{LR2Im}}(t)$, for $n, t \geqq 0$.

LRUD2I ${ }^{m}$ : Of course, we have
$\Lambda_{\text {OFF-LRUD2Im }}(t) \geqq 2 \cdot \Lambda_{\text {OFF-LR2Im }}(t)-1$, for $t \geqq 0$, and, because LRUD2I ${ }^{m}$ is an infinite network we have $\Gamma_{\text {off-LRUD21m }}(n, t)=T_{\text {OFF-LRUD2Im }}(n, t)=n$. - $\Lambda_{\text {OFF-LRUD } 2 I m}(t)$, for $n, t \geqq 0$.

Theorem 2. For standard on-line network systems and $2 \leqq p<\infty, 1 \leqq q \leqq p-1$,

$$
\Lambda_{\mathrm{ON}-\mathrm{NET}_{p, q}}(t)= \begin{cases}0 & \text { for } t=0, \\ 2 t-1 & \text { for } t \geqq 1 \text { and } q=1, \\ \left(q^{t}-1\right) /(q-1) & \text { for } t \geqq 1 \text { and } q \geqq 2,\end{cases}
$$


Proof. Consider the local data transfer situation first. At $t=1$ assume that a sufficiently large set of input registers obtain input data in parallel by a READ instruction. Then $(q-1) /(q-1)=2 t-1=1$ for $q \geqq 2$, or $t=1$. For $q=1$, the maximal local transfer situation, i.e., the maximal transfer of data units to a given register, is possible by indirect addressing. Thus, $\Lambda_{\mathrm{ON}-\mathrm{NET}_{p, 1}}(t)=2 t-1$ for $t \geqq 1$. For $q \geqq 2$, according to (ON.3) it follows that

$$
\Lambda_{\mathrm{ON}-\mathrm{NET}_{p, q}}(t)=\sum_{i=0}^{t-1} \bar{q}^{i}=\left(q^{t}-1\right) /(q-1)
$$

where these maximal cardinalities of receptive fields may be obtained in certain PE accumulators. For given $n, t \geqq 0$, by choosing a sufficiently large field of PEs obtaining input data in their accumulators at the first instruction ( $i=1$ ), $n$ receptive fields of maximal cardinality $\Lambda_{\mathrm{ON}-\mathrm{NETp,q}}(t)$ may be pairwise disjoint.

Example 7. By (4.4) we know that $\Lambda_{\mathrm{ON}-\mathrm{RAM}}(t)=\Gamma_{\mathrm{ON}-\mathrm{RAM}}(n, t)=t$, for $t \geqq 0$ and $n \geqq 1$, and thus $\Lambda_{\mathrm{ON}-\mathrm{RAM}}(t)<\Lambda_{\mathrm{ON}-\mathrm{NET}_{p, 1}}(t)$ as well as $\Gamma_{\mathrm{ON}-\mathrm{RAM}}(n, t)<\Gamma_{\mathrm{ON}-\mathrm{NET}_{p, 1}}(n, t)$ for $t \geqq 2$ and $n \geqq 1$. Furthermore, $T_{\mathrm{ON}-\mathrm{RAM}}(n, t)=n\left(t-\frac{n}{2}+\frac{1}{2}\right)$, for $t \geqq n \geqq 1$, and thus $T_{\mathrm{ON}-\mathrm{RAM}}(n, t)<T_{\mathrm{ON}-\mathrm{NET}_{p, 1}}(n, t)$ for $t \geqq n \geqq 2$.

In table 5 for classes of on-line systems mentioned in Section 1 some results on the analysis of general local data transfer functions are collected. For these classes the functions given in Theorem 2 act as upper bounds where the proper values of $p$ and $q$ have to be correlated. By $\mathrm{ON}-\mathrm{IN}_{\left\{i_{1}, i_{2}, \ldots, i_{q}\right\}}$ we denote a special ON-IN system with fixed set $\left\{i_{1}, i_{2}, \ldots, i_{q}\right\}$ according to (ON.2). The classes ON-LINEAR \{o $^{\prime}$, ON-BINTREE Ol, $_{2\}}$, and ON-QUADTREE $_{\{1,2,3,4\}}$ represent examples for maximal transfer situations as characterized by Theorem 2.

Some remarks about Table 5 and about the other networks which were defined in Table 1:

1. For all examples of CLASS in Table 5 we have $\Gamma_{\mathrm{oN}-\mathrm{CLAss}}(n, t)=$ $=T_{\mathrm{ON}-\mathrm{CLASS}}(n, t)=n \cdot \Lambda_{\mathrm{ON}-\mathrm{CLASS}}(t)$, for $n, t \geqq 0$.

Table 5. General local data transfer functions for on-line systems

| CLASS | $p$ | $\left\{i_{1}, i_{2}, \ldots, i_{q}\right\}$ | Aon-class $^{\text {( }}$ ( $)$ | $t=4$ | $t=8$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| LINEAR | 2 | \{0\} | $2 t-1$ | 7 | 15 |
| HEXAGONAL | 3 | $\begin{aligned} & \{0,1\} \\ & \{0\} \end{aligned}$ | $\begin{aligned} & t(t+1) / 2 \\ & 2 t-1 \end{aligned}$ | $\begin{array}{r} 10 \\ 7 \end{array}$ | $\begin{aligned} & 36 \\ & 15 \end{aligned}$ |
| SQUARE or ILLIAC | 4 | $\begin{aligned} & \{0,1,2\} \\ & \{0,2\} \\ & \{0,1\},\{0\} \end{aligned}$ | $\begin{aligned} & t^{2} \\ & t(t+1) / 2 \\ & 2 t-1 \end{aligned}$ | $\begin{array}{r} 16 \\ 10 \\ 7 \end{array}$ | $\begin{aligned} & 64 \\ & 36 \\ & 15 \end{aligned}$ |
| TRIAGONAL | 6 | $\begin{aligned} & \{0,1,2,3,4\} \\ & \{0,2,3,4\} \\ & \{0,2,4\} \end{aligned}$ | $\begin{aligned} & \frac{5}{2} t^{2}-\frac{5}{2} t+1 \\ & \frac{3}{2} t^{2}-\frac{1}{2} t \\ & t^{2} \end{aligned}$ | 31 22 16 | $\begin{aligned} & 121 \\ & 92 \\ & 64 \end{aligned}$ |
| DIAGONAL | 8 | $\{0,1,2,3,4,6,7\}$ | $\frac{7}{2} t^{2}-\frac{7}{2} t+1$ | 43 | 197 |
| BINTREE | 3 | $\begin{aligned} & \{1,2\} \\ & \{0,1\} \end{aligned}$ | $\begin{aligned} & 2^{t}-1 \\ & t^{t}(t+1) / 2 \end{aligned}$ | $\begin{aligned} & 15 \\ & 10 \end{aligned}$ | $\begin{array}{r} 255 \\ 36 \end{array}$ |
| TRIANGLE | 5 | \{1, 2, 3, 4\} | $2^{t}-1$ | 15 | 255 |
| QUADTREE | 5 | $\{1,2,3,4\}$ | $\left(4^{t}-1\right) / 3$ | 85 | 21,845 |
| PS | 3 | $\{0,1\}$ | $\begin{aligned} & \left(\left[(1+\sqrt{5})^{t+8}-(1-\sqrt{5})^{t+3}\right] /\right. \\ & \left.\sqrt{5} \cdot 2^{t+3}\right)-2 \end{aligned}$ | 11 | 87 |

2. The class ON-PS ${ }_{\{0,1\}}$ denotes special SIMD systems using the PS network in its original [10] meaning. Let $f_{0}=1, f_{1}=1, f_{2}=2, \ldots, f_{n+2}=f_{n}+f_{n+1}, \ldots$, where

$$
f_{n}=\left[(1+\sqrt{5})^{n+1}-(1-\sqrt{5})^{n+1} / \sqrt{5} \cdot 2^{n+1}\right.
$$

denotes the $n$th Fibonacci number, $n \geqq 0$. We have $\Lambda_{\mathrm{ON}-\mathrm{PS}\{0,1\}}(t)=\sum_{n=1}^{t} f_{n}=\dot{f}_{n+2}-2$, for $t \geqq 0$; cp. [3] for a similar result.
3. For the bintree, triangle, and quadtree network note that the maximal receptive fields may be obtained for the top node accumulator, for $\left\{i_{1}, i_{2}, \ldots, i_{q}\right\}$ equal to $\{1,2\},\{1,2,3,4\},\{1,2,3,4\}$, respectively.
4. The analysis of the general data transfer functions for classes $\mathrm{ON}-\mathrm{CUBE}^{m}$, ON-PM2I ${ }^{m}$, ON-WPM2I ${ }^{m}$, ON-LR2I ${ }^{m}$, and ON-LRUD2I ${ }^{m}$ will not be considered in the present paper.

## 3. Local, global, and total data dependence measures

For parallel processing systems, the optimal time for the solution of a computational problem depends upon the data transfer abilities of the given system as well as on the principal possibilities of parallelization of a solution process for a given problem. The first may be characterized by the data transfer functions $\boldsymbol{\Lambda}_{\mathrm{SYS}}$, $\Gamma_{\text {SYS }}, T_{\text {SYs }}$ by a general system analysis as considered in Section 2. The second property, however, requires individual consideration of the given computational problem.

For example, consider the multiplication of two $N \times N$ real matrices $A \cdot B=C$. For a given system SYS assume that all $N^{2}$ elements of matrix $C$ have to be computed in $N^{2}$ different output registers represented by the set $R_{\text {Out }}$. Let $r \in R_{\text {OUT }}, R_{0} \subseteq R_{\text {OUT }}$, and $R_{1}$ be the set of $N$ distinctive registers for outputing the $N$ diagonal elements of $C$. Then it follows that $\lambda_{\text {SYS }}\left(r, t^{*}\right) \geqq 2 N, \gamma_{\text {SYS }}\left(R_{1}, t^{*}\right) \geqq$ $\geqq 2 N^{2}$ and $\tau_{\mathrm{sYS}}\left(R_{0}, t^{*}\right) \geqq 2 N \cdot \operatorname{card}\left(R_{0}\right)$ if the product $A \cdot B$ is to be computed on SYS within time $t^{*}$. Thus, if the functions $\Lambda_{\mathrm{SYS}}, \Gamma_{\mathrm{SYS}}$ or $T_{\mathrm{SYS}}$ are known, lower time bounds are derivable from these inequalities for the solution time $t^{*}$ immediately, where the maximal lower time bound from the three possible values is taken as the result. For example, according to our considerations in Section 2 for the system EXAMP1 we have $t^{*} \geqq \sqrt{N}-1$ under the assumption that $M=2 N$. But note that a better lower time bound for this system and the matrix multiplication problem may be obtained by more specialized considerations as demonstrated by Gentleman [3, Theorem 1]. Because each data unit transfer from a certain register $r_{1}$ to a certain register $r_{2}$ of the system EXAMP1 may be performed in - the reverse direction, from $r_{2}$ to $r_{1}$, in the same time, the proof of Theorem 1 in [3] matches the situation given by the system EXAMP1, i.e., for $r \in R_{\text {OUT }}$ we have $\lambda_{\text {EXAMP1 }}\left(r, 2 t^{*}\right) \geqq N^{2}$, and thus $t^{*} \geqq \frac{1}{4}\left(2 N^{2}-1\right)^{1 / 2}-\frac{1}{4}$.

For a general approach to the derivation of lower time bounds for parallel processing systems we shall use the quantitative description of data dependencies of the desired output data in relation to the input data specification, for computational problems which may be identified with special functions as described later on.

Definition 7. Let $n, m \geqq 1$. Let $f$ be an $n$-ary function defined on a certain set domain $(f)$ of $n$-tuples of real numbers, and into the set of $m$-tuples of real numbers. For an $n$-tuple $\left(x_{1}, x_{2}, \ldots, x_{n}\right) \in \operatorname{domain}(f)$, define

$$
\operatorname{sub}_{i}\left(x_{1}, x_{2}, \ldots, x_{n}\right)=\left\{j: 1 \leqq j \leqq n \&\left(\vee x^{\prime} \neq x_{j}\right)\left(x_{1}, x_{2}, \ldots, x_{j-1}, x^{\prime}, x_{j+1}, \ldots, x_{n}\right\} \in\right.
$$

$\left.\in d o m a i n(f) \& \operatorname{proj}_{i}\left(f\left(x_{1}, x_{2}, \ldots, x_{n}\right)\right) \neq \operatorname{proj}_{i}\left(f\left(x_{1}, x_{2}, \ldots, x_{j-1}, x^{\prime}, x_{j+1}, \ldots, x_{n}\right)\right)\right\}$
to be the set of all positions $j$ such that changes in the $j$ th component of $\left(x_{1}, x_{2}, \ldots, x_{n}\right)$ have an effect on the projection $\operatorname{proj}_{i} f$, for $1 \leqq i \leqq m$. Then, define

$$
\begin{aligned}
& \lambda_{f}=\max _{\left(x_{1}, x_{2}, \ldots, x_{n}\right)} \max _{1 \leqq i \leqq m} \operatorname{card}\left(\operatorname{sub}_{i}\left(x_{1}, x_{2}, \ldots, x_{n}\right)\right), \\
& \gamma_{f}=\max _{\left(x_{1}, x_{2}, \ldots, x_{n}\right)} \operatorname{card}\left(\bigcup_{i=1}^{m} \operatorname{sub}_{i}\left(x_{1}, x_{2}, \ldots, x_{n}\right),\right.
\end{aligned}
$$

and

$$
\tau_{f}=\max _{\left(x_{1}, x_{2}, \ldots, x_{n}\right)} \sum_{i=1}^{m} \operatorname{card}\left(\operatorname{sub}_{i}\left(x_{1}, x_{2}, \ldots, x_{n}\right)\right) .
$$

The function $f$ is called locally $d$-dependent iff $d \leqq \lambda_{f}$, globally $d$-dependent iff $d \leqq \gamma_{f}$, and totally $d$-dependent iff $d \leqq \tau_{f}$, for an integer $d \geqq 0$.

By this definition, for arbitrary functions $f$ defined on $n$-tuples of real numbers and into the set of $m$-tuples of real numbers, it follows immediately that $\lambda_{f}=\gamma_{f}=\tau_{f}$ if $m=1$, and for $m \geqq 1$

$$
\begin{align*}
& \lambda_{f} \leqq \gamma_{f} \leqq \tau_{f},  \tag{7.1}\\
& \gamma_{f} \leqq n, \tag{7.2}
\end{align*}
$$

and

$$
\begin{equation*}
\tau_{f} \leqq m \cdot \lambda_{f} \tag{7.3}
\end{equation*}
$$

For example, in the case of the following function $f$,

$$
f\left(x_{1}, x_{2}, x_{3}, x_{4}, x_{5}\right)=\left\{\begin{array}{lll}
x_{1}+x_{2} & \text { if } & x_{5}=0 \\
x_{3}+x_{4} & \text { if } & x_{5} \neq 0,
\end{array}\right.
$$

we have $\operatorname{sub}_{1}\left(x_{1}, x_{2}, x_{3}, x_{4}, 0\right)=\{1,2,5\}$ if $x_{1}+x_{2} \neq x_{3}+x_{4}$, and $\operatorname{sub}_{1}\left(x_{1}, x_{2}, x_{3}\right.$, $\left.x_{4}, 0\right)=\{1,2\}$ if $x_{1}+x_{2}=x_{3}+x_{4}$. Because of $\lambda_{f}=\gamma_{f}=\tau_{f}=3$, this function is local, global, or total $1-, 2$-, and 3 -dependent, but not 4 - or 5 -dependent.

Now, in a sequence of examples, the data dependence measures as given by Definition 7 will be analyzed for certain computational problems. The results are collected in Table 6, i.e., the following examples may be considered as explanatory remarks to this table.

Example 8. The multiplication of two $N \times N$ real matrices may be considered as a $2 N^{2}$-ary function into the set of $N^{2}$-tuples of real numbers. For this computational problem, it is evident that

$$
\begin{gathered}
\lambda_{\text {MATRIX-MULTIPLICATION }}=2 N, \\
\gamma_{\text {MATRIX-MULTIPLICATION }}=2 N^{2}, \text { and } \tau_{\text {MATRIX-MULTIPLICATION }}=2 N^{3},
\end{gathered}
$$

where these maximal values of data dependence are true for each input vector of length $2 N^{2}$ containing non-zero values in all positions. By this example it follows that the upper bounds (7.2) and (7.3) cannot be reduced in general. The inversion of an $N \times N$ real matrix in place may be considered as an $N^{2}$-ary function into the set of $N^{2}$-tuples of real numbers. We have
and

$$
\lambda_{\text {MATRIX-INVERSION-IP }}=\gamma_{\text {MATRIX-INVERSION-IP }}=N^{2}
$$

$$
\tau_{\text {MATRIX-INVERSION-IP }}=N^{4}
$$

where this maximal case of data dependence appears for any matrix containing non-zero values in all $N^{2}$ positions. These data depence quantities may be considered as a direct consequence of the data dependence quantities for the determinant of an $N \times N$ real matrix,

$$
\lambda_{\text {DETERMINANT }}=\gamma_{\text {DETERMINANT }}=\tau_{\text {DETERMINANT }}=N^{2} .
$$

The solution of a system of $N$ linear equations in $N$ unknowns may be considered as an $\left(N^{2}+N\right)$-ary function into the set of $N$-tuples of real numbers. We obtain

$$
\lambda_{\text {LINEAR-EQUATIONS }}=\gamma_{\text {LINEAR-EQUATIONS }}=N^{2}+N,
$$

and

$$
\tau_{\text {LINEAR-EQUATIONS }}=N^{3}+N^{2} .
$$

Transposing an $N \times N$ real matrix in place may be considered as an $N^{2}$-ary function into the set of $N^{2}$-tuples of real numbers,

$$
\lambda_{\text {TRANSPOSITION-IP }}=1, \quad \text { and } \quad \gamma_{\text {TRANSPOSITION-IP }}=\tau_{\text {TRANSPOSITION-IP }}=N^{2},
$$

but for binary operations on permutated $N \times N$ real matrices in place,

$$
\left(a_{i j}\right)_{i, j=0,1, \ldots, N-1} \Rightarrow\left(\operatorname{op}_{2}\left(a_{i j}, a_{\pi(i, j)}\right)\right)_{i, j=0,1, \ldots, N-1}
$$

considered as $N^{2}$-ary functions into the set of $N^{2}$-tuples of real numbers,

$$
\begin{gathered}
\lambda_{\mathrm{MATRIX}-\pi-I P}=2 \text { for } \pi \neq i d, \\
\gamma_{\mathrm{MATRIX}-\pi-\mathrm{IP}}=N^{2}
\end{gathered}
$$

and

$$
\tau_{\mathrm{MATRIX}-\pi-\mathrm{IP}}=2 N^{2}-\operatorname{card}\{(i, j): 0 \leqq i, j \leqq N-1 \& \pi(i, j)=(i, j)\}
$$

the transposition may be considered as a special permutation $\pi^{*}, \tau_{\text {MATRIX }-\pi^{*}-\mathrm{IP}}=$ $=2 N^{2}-N$, and $\mathrm{op}_{2}$ as the exchange operation in this case, $\mathrm{op}_{2}\left(a_{i j}, a_{\pi^{*}(i, j)}\right)=$ $=\left(a_{\pi^{*}(i, j)}, a_{i j}\right)$, where the second component of these resulting tuples will be considered as a dummy result.

Example 9. In this example, three two-dimensional transforms of $N \times N$ pictures will be dealt with. First, the Fourier transform of an $N \times N$ complex matrix (2D-DFT, two-dimensional discrete Fourier transform, cp. [9]) may be considered as a $2 N^{2}$-ary function into the set of $2 N^{2}$-tuples of real numbers. In this case, we have

$$
\begin{gathered}
2 N^{2}-4 \leqq \lambda_{2 \mathrm{D}-\mathrm{DFT}} \leqq 2 N^{2}-1, \\
\gamma_{2 \mathrm{D}-\mathrm{DFT}}=2 N^{2}, \quad \text { and } \quad 2 N^{4} \leqq \tau_{2 \mathrm{D}-\mathrm{DFT}} \leqq 4 N^{4}-2 N^{2},
\end{gathered}
$$

where these maximal values of data dependence are true for each input vector of length $2 N^{2}$ containing non-zero values in all positions. For the exact determination of $\lambda_{2 \mathrm{D}-\mathrm{DFT}}$ and $\tau_{2 \mathrm{D}-\mathrm{DFT}}$, the influence of different values of $N$ has to be studied. The Walsh transform of an $N \times N$ real matrix ( $2 \mathrm{D}-\mathrm{WT}$, two dimensional Walsh transform, cp. [9]) may be considered as an $N^{2}$-ary function into the set of $N^{2}$-tuples of real numbers,

$$
\lambda_{2 \mathrm{D}-\mathrm{WT}}=\gamma_{2 \mathrm{D}-\mathrm{WT}}=N^{2}, \quad \text { and } \quad \tau_{2 \mathrm{D}-\mathrm{WT}}=N^{4},
$$

where these maximal values of data dependence are true for any input vector of length $N^{2}$. The computation of the parallel Roberts gradient (see Example 1) on images of size $M \times N$ may be considered as an $M N$-ary function into the set of $M N$-tuples of real numbers. For this function,

$$
\lambda_{\text {ROBERTS-GRADIENT }}=4,
$$

$$
\gamma_{\text {ROBERTS-GRADIENT }}=M N, \quad \text { and } \quad \tau_{\text {ROBERTS-GRADIENT }}=4 M N-2 M-2 N-2
$$

by considering the case of non-zero values in all $M N$ positions, and by paying attention to border effects.

Example 10. The computation of the convex hull of a simple polygon, cp. [5]' where the $N$ extreme points of the polygon are given by coordinate tuples of real numbers starting with the uppermost-leftmost point, may be considered as a 2 N -ary function into the set of $2 N$-tuples of real numbers. In the resulting vector of length $2 N$, there appear all coordinate tuples of the extreme points of the convex hull of the given polygon in order, starting with the uppermost-leftmost point, and with the same run orientation as the given polygon. Positions actually not needed in this resulting $2 N$-tuple contain value zero by assumption. In this case, it follows that

$$
\lambda_{\mathrm{CH}-\mathrm{SIPOL}}=\gamma_{\mathrm{CH}-\mathrm{SIPOL}}=2 N, \quad \text { and } \quad 2 N^{2}-8 N+12 \leqq \tau_{\mathrm{CH}-\mathrm{SIPOL}} \leqq 4 N^{2}
$$

by analyzing the input situation of special convex polygons with $N$ extreme points as illustrated in Fig. 2, for $N \geqq 4$. The computation of the convex hull of $N$ planar


Figure 2.
Convex polygon for analyzing the maximal possible data dependence situation, for $N \geqq 4$
points, cp. [5], given by coordinate tuples of real numbers, may be considered as a 2 N -ary function into the set of 2 N -tuples of real numbers as described above, analogously to the simple polygon situation. For this problem,

$$
\lambda_{\mathrm{CH}-\mathrm{POINT}}=\gamma_{\mathrm{CH}-\mathrm{POINT}}=2 N, \quad \text { and } \quad \tau_{\mathrm{CH}-\mathrm{POINT}}=4 N^{2},
$$

where these maximal values are true for any input situation. The computation of the Voronoi diagram of $N$ planar points, cp. [5], given by coordinate tuples of real numbers, may be considered as a 2 N -ary function into the set of ( $18 \mathrm{~N}-33$ )-tuples of real numbers in the following sense. The Voronoi diagram may have $2 N-5$ vertices at most, and, as a special planar graph, $3 N-6$ edges at most, for $N \geqq 3$. See Fig. 3 for an illustration of the construction of such a "maximal Voronoi diagram", where the number $v(N)$ of vertices, and the number $e(N)$ of edges satisfy the recursive equations

$$
\begin{gathered}
v(3)=1, \quad e(3)=3 \\
v(N+1)=v(N)+2, \quad \text { and } e(N+1)=e(N)+3
\end{gathered}
$$



Figure 3.
Voronoi diagrams for $N=3,4,5,6$ with $2 N-5=1,3,5,7$ vertices and $3 N-6=3,6,9,12$ ] edges, respectively
for $N \geqq 3$. The $18 N-33=3(2 N-5)+4(3 N-6)$ positions of the resulting vector of a Voronoi diagram computation we consider as a unique characterization of a Voronoi diagram by linearization of adjacency lists for this special graph structure with the positions for each vertex where two are reserved for the coordinate values and one for a common pointer, and two times two positions for each edge - for the index of the vertex at the other end of the edge, of for the slope of the edge, and for a common pointer. For concrete inputs of $N$ points, positions actually not needed in the resulting ( $18 N-33$ )-tuple contain value zero by assumption. Then, we have

$$
\lambda_{\text {VORONOI-DIAGRAM }}=\gamma_{\text {VORONOI-DIAGRAM }}=2 N,
$$

and

$$
12 N-3 \leqq \tau_{\text {VORONOI-DIAGRAM }} \leqq 2 N(18 N-33)
$$

for $N \geqq 3$, where the local and global case may be analyzed by using a regular $N$-gon, and for the total case a Voronoi diagram in the sense of Fig. 3, with $2 N-5$ points, was used where each point of the diagram essentially depends on there input points, i.e., on six coordinate values.

Example 11. Matching of a pattern of length $M$ against a string of length $N(M \leqq N$ and the elements of pattern and string are assumed to be reals) may be considered as a $(N+M)$-ary function into the set of $(N-M+1)$-tuples on $\{0,1\}$ where, for

$$
f_{\text {PATTERN-MATCHING }}\left(p_{1}, p_{2}, \ldots, p_{m} ; s_{1}, s_{2}, \ldots, s_{m}\right)=\left(e_{1}, e_{2}, \ldots, e_{N-M+1}\right)
$$

we have $e_{i}=1$ iff $s_{i+j}=p_{j+1}$, for all $j=0,1, \ldots, M-1$, and $e_{i}=0$ otherwise, for $i=1,2, \ldots, N-M+1$. We have

$$
\lambda_{\text {PATTERN-MATCHING }}=2 M,
$$

$\gamma_{\text {PATTERN-matching }}=M+N$, and $\tau_{\text {Pattern-matching }}=2 M(N-M+1)$.
In all three cases, the maximal dependence may be analyzed for the trivial input situation $p_{i}=s_{j}=$ const, for $i=1,2, \ldots, M$ and $j=1,2, \ldots, N$. Detection of a pattern of length $M$ within a string of length $N, M \leqq N$, may be considered as an $(N+M)$-ary function into the set $\{0,1\}$ where the output is equal to $\max \left\{e_{i}: i=1,2, \ldots, N-M+1 \& f_{\text {PATTERN-MATCHING }}\left(p_{1}, p_{2}, \ldots, p_{M} ; s_{1}, s_{2}, \ldots, s_{N}\right)=\right.$ $\left.=\left(e_{1}, e_{2}, \ldots, e_{N-M+1}\right)\right\}$ for input $\left(p_{1}, p_{2}, \ldots, p_{M} ; s_{1}, s_{2}, \ldots, s_{N}\right)$. Then,

$$
\max \{2 M, M+[N / M]\} \leqq \lambda_{\text {PATTERN-SIGNalization }} \leqq M+N
$$

Note that this represents the first example of a computational problem where the equality $\gamma_{f}=n$ remains an open problem, for an $n$-ary function $f$ with $n=N+M$ in the case of pattern detection. As a last example, sorting of $N$ real numbers may be considered as an $N$-ary function into the set of $N$-tuples of real numbers. For this very important problem, we have

$$
\lambda_{\text {SORTING }}=\gamma_{\text {SORTING }}=N, \quad \text { and } \quad \tau_{\text {SORTING }}=N^{2}
$$

where these maximal values are true for $N$ pairwise different input values.

## 4. Data transfer lemma and applications

Between the quantitative descriptions of data transfer for SIMD systems (Section 2) and of data dependence for computational problems (Section 3), the following direct relation holds.

Lemma 1. (Data Transfer Lemma). Let SYS $\in$ SIMD, and let $\pi$ be an arbitrary program for SYS for the computation of a function $f$ which is $n$-ary and has $m$-tuple values. Let $R$ denote the set of output registers of SYS where the $m$-tuples appear at the end of the computation (card $(R)=m$, off-line mode), or those output registers of SYS via which the computed values of the $m$-tuples leave SYS in certain waves of information (card $(R) \leqq m$, on-line mode). Then, the computation of $f\left(x_{1}, x_{2}, \ldots, x_{n}\right)$ on SYS by $\pi$ requires at least $t_{0}$ steps of comdutation for a given input $\left(x_{1}, x_{2}, \ldots, x_{0}\right) \in \operatorname{domain}(f)$, where $\Lambda_{\mathrm{SYS}}\left(t_{0}\right) \geqq \lambda_{f}$, $\Gamma_{\mathrm{SYS}}\left(\operatorname{card}(R), t_{0}\right) \geqq \gamma_{f}$, and $T_{\mathrm{SYS}}\left(\operatorname{card}(R), t_{0}\right) \geqq \tau_{f}$.

Proof. Let us consider the local off-line or on-line situation. Assume that $\lambda_{f}=\operatorname{card}\left(\operatorname{sub}_{i}\left(x_{1}, x_{2}, \ldots, x_{n}\right)\right.$ ), for a given input vector $\left(x_{1}, x_{2}, \ldots, x_{n}\right)$, and for
a given position $i, 1 \leqq i \leqq m$. Let $\operatorname{sub}_{i}\left(x_{1}, x_{2}, \ldots, x_{n}\right)=\left\{j_{1}, j_{2}, \ldots, j_{\lambda_{f}}\right\}$. For any position $i_{k}, k=1,2, \ldots, \lambda_{f}$, either the name of an input register receiving value $x_{j_{k}}$ at a given moment will be transfered to the receptive field $\operatorname{rec}_{\pi}^{\left(x_{1}, x_{2}, \ldots x_{n}\right)}\left(r^{(i)}, t^{*}\right)$ by some operational instructions only, if value $\operatorname{proj}_{i}\left(f\left(x_{1}, x_{2}, \ldots, x_{n}\right)\right)$ appears in register $r^{(i)} \in R$ at time $t^{*} \leqq t_{0}$ of computation, or during the $t^{*}$ steps of computation of $\operatorname{proj}_{i}\left(f\left(x_{1}, x_{2}, \ldots, x_{n}\right)\right)$ at least one test instruction JGTZ, JZERO, or JLTZ must be performed where the contents of the CPU accumulator depends on the input value $x_{j_{k}}$ at the moment of testing. In the second case, if the test instruction is followed by certain operational instructions directed to register $r^{(i)}$ the name of the input register receiving value $x_{j_{k}}$ at a given moment will be transferred to the receptive field $\operatorname{rec}_{\pi}^{\left(x_{1}, x_{2}, \ldots, x_{n}\right)}\left(r^{(i)}, t^{*}\right)$, too; cp. (iv) in Definition 4. Without loss of generality, assume that $j_{1}, j_{2}, \ldots, j_{v}, v \leqq \lambda_{f}$, denote all the positions which have produced register names in the receptive field $\operatorname{rec}_{\pi}^{\left(x_{1}, x_{2}, \ldots, x_{n}\right)}\left(r^{(i)}, t^{*}\right)$. If $v=\pi_{f}$, then $\pi_{f} \leqq \operatorname{card}\left(\operatorname{rec}_{\pi}^{\left(x_{1}, x_{2}, \ldots, x_{n}\right)}\left(r^{(i)}, t^{*}\right)\right) \leqq \lambda_{\text {SYS }}\left(t_{0}\right)$ follows immediately. For $v<\lambda_{f}$, let $t_{1}, t_{2}, \ldots, t_{w}$ be all the moments where test instructions have to be performed according to $\pi$ and input ( $x_{1}, x_{2}, \ldots, x_{n}$ ) such that the contents of the CPU accumulator depend on one of the input values $x_{j_{v+1}}, \ldots, x_{j_{\lambda_{f}}}$ at least, at the moments of testing. Consider the following program ' $\pi^{\prime}$ computing something unspecified, produced by $\pi$ and ( $x_{1}, x_{2}, \ldots, x_{n}$ ) in the following way:
— all test instructions at moments $t_{1}, t_{2}, \ldots, t_{w}$ will be deleted in $\pi$, and

- all other instructions of $\pi$ will be performed according to $\pi$ : and input $\left(x_{1}, x_{2}, \ldots, x_{n}\right)$, in the same order, where all instructions LOAD $\alpha$ or $\mathrm{OP}_{1} \alpha$, for $\alpha$ equal to $=x, m,{ }^{*} m$, or ( $j$ ), will be replaced by $\mathrm{OP}_{2} \alpha$, for the same value of $\alpha$, if such instructions appear in $\pi$.
Thus, the receptive field of register 0, i.e., the CPU accumulator, will increase monotonically according to $\pi^{\prime}$ and $\left(x_{1}, x_{2}, \ldots, x_{n}\right)$. After $t^{*}-w$ operations according to $\pi^{\prime}$, rec $\left(0, t^{*}-w\right)$ contains all input register names for the input data $x_{j_{v+1}}, \ldots, x_{j_{\lambda_{f}}}$. This receptive field will be combined with $\operatorname{rec}_{n}^{\left(x_{1}, x_{2}, \ldots, x_{n}\right)}$ $\left(r^{(i)}, t^{*}-w\right) \geqq \operatorname{rec}_{\pi}^{\left(x_{1}, x_{2}, \ldots, x_{n}\right)}\left(r^{(i)}, t^{*}\right)$ at moment ${ }^{*} t^{*}-w+1 \leqq t^{*}$ by adding an instruction $\mathrm{OP}_{2} \alpha$ (see conditions (OFF.2) and (ON.6)) or $\mathrm{OP}_{2}(j)$ (see conditions (OFF.4) and (ON.7)) to $\pi^{\prime}$. Thus, $\lambda_{f} \leqq \operatorname{card}\left(\operatorname{rec}_{\pi^{\prime}}^{\left(x_{1}, x_{2}, \ldots, x_{n}\right)}\left(0, t^{*}-\dot{w}+1\right)\right) \leqq$ $\leqq \Lambda_{\mathrm{SYS}}\left(t^{*}-w+1\right) \leqq \Lambda_{\mathrm{SYS}}\left(t_{0}\right)$. Note that the off-line or on-line I/O convention is necessary to ensure that a non-accumulator PE register $r^{(i)}$ may be replaced by the accumulator of the same PE which is an output register, too. For this replacement, parallel STORE instructions may be replaced by parallel $O P_{1}$ instructions using the same masks for PE addresses.

What we have explained is one of the possible ways to ensure the necessary data transfer within time limit $t_{0}$, for the local off-line or on-line situation. The essential point in the program transformation from $\pi$ to $\pi^{\prime}$ may be characterized by the word "linearization", because all test instructions could be deleted, in fact. This linearization approach may be used for the local, global and total situation in the following way.

For the given program $\pi$ and an input situation 1 , all the performed instructions will be written as a linear sequence $S_{0}$. We obtain sequence $S_{1}$ by deletion of all instructions JLTZ, JZERO, JGTZ, JUMP, WRITE, and HALT in sequence $S_{0}$. Now, for the special case of an on-line program, if in sequence $S_{0}$ there were some STORE instructions in front of a WRITE instruction directed to certain output
registers $r \in R$, then these STORE instructions will be shifted to the end of sequence $S_{1}$. In the resulting sequence $S_{2}$, all serial or parallel $\mathrm{OP}_{1} \alpha$ or LOAD $\alpha$ instructions will be replaced by an $\mathrm{OP}_{2} \propto$ instruction formally, in the same position for the same value of $\alpha$. For the resulting sequence $S_{3}$ we have monotonically increasing receptive fields for all accumulators, for the CPU and PEs. Also, by the described step from $S_{1}$ to $S_{2}$, for sequence $S_{3}$ the receptive fields of output registers will be monotonically increasing for consecutive output waves of information. Now, if in the original sequence $S_{0}$ there was no test instruction, our program linearization is finished. In the other case, in $S_{3}$ we shall place an instruction JZERO, e.g., in that position where the last test instruction was located in sequence $S_{0}$. Now consider an arbitrary output register $r \in R$. If there is an operational.instruction behind the JZERO instruction directed to $r$ then register $r$ will obtain the receptive field of the CPU accumulator containing all the register names corresponding to tested input values, cp. (iv) in Definition 4. If there is no operational instruction behind the JZERO instruction directed to $r$ then we shift the last instruction directed to $r$ in front of the JZERO instruction to a position behind this instruction. By con sideration of all registers $r \in R$, oủr program linearization is finished. Note that the length of the resulting linear instruction sequence is restricted by the length of the original sequence $S_{0}$.

Now assume that $\lambda_{f}=\operatorname{card}\left(\operatorname{sub}_{i}\left(x_{1}, x_{2}, \ldots, x_{n}\right)\right)$ for a certain $i, 1 \leqq i \leqq n$, $y_{f}=\operatorname{card}\left(\bigcup_{i=1}^{m} \operatorname{sub}_{i}\left(y_{1}, y_{2}, \ldots, y_{n}\right)\right)$ and $\tau_{f}=\sum_{i=1}^{m} \operatorname{card}\left(\operatorname{sub}_{i}\left(z_{1}, z_{2}, \ldots, z_{n}\right)\right.$, for certain input vectors $\left(x_{1}, x_{2}, \ldots, x_{n}\right),\left(y_{1}, y_{2}, \ldots, y_{n}\right),\left(z_{1}, z_{2}, \ldots, z_{n}\right)$. These input vectors characterize input situations $I_{x}, I_{y}, I_{z}$ for SYS. By linearization of $\cdot \pi$ according to these input situations we obtain linear programs $\pi_{x}, \pi_{y}, \pi_{z}$, respectively, all of length $\leqq t_{0}$. Thus, we have

$$
\begin{aligned}
& \lambda_{\pi_{x}}^{\left.\left(x_{1}, x_{2}, \ldots, x_{n}^{T}\right)_{\left(R, t_{0}\right.}\right)} \geqq \lambda_{f}, \\
& \gamma_{\pi_{y}}^{\left(y_{1}, y_{2}, \ldots, y_{n}\right)}{ }_{\left(R, t_{0}\right)} \geqq \gamma_{f}, \\
& \tau_{\pi_{z}}^{\left(z_{1}, z_{z}, \ldots, z_{n}\right)}{ }_{\left(R, t_{0}\right)} \geqq \tau_{f},
\end{aligned}
$$

which proves our statements.
Corollary 1. Let CLASS $\subseteq$ SIMD. For any system SYS $\in$ CLASS, the computation of a function $f$ which is into the set of $m$-tuples of real numbers requires at least $t_{0}$ steps of computation in the worst case, where $\Lambda_{\text {Class }}\left(t_{0}\right) \geqq \lambda_{f}$, $\Gamma_{\text {CLASS }}\left(m, t_{0}\right) \geqq \gamma_{f}$, and $T_{\text {CLASS }}\left(m, t_{0}\right) \geqq \tau_{f}$.

Proof. Immediately by Lemma 1 where the generalization about all programs computing the function $f$ is used as well as about all systems of CLASS. For the on-line case note that there may already be a certain $m_{0} \leqq m$ such that $\Gamma_{\text {CLASS }}\left(m_{0}, t_{0}\right) \geqq \gamma_{f}$, and $T_{\text {CLASS }}\left(m_{0}, t_{0}\right) \geqq \tau_{f}$.

Example 12. Let CLASS $=\{$ EXAMP1 $\}$ and consider the computation of the parallel Roberts gradient as described in Example 1. In this case we get the trivial lower time bound 1 only; an upper bound was 29. Now, let CLASS = \{EXAMP3\} and consider the computation of the arithmetical averages of $M$ consecutive waves of information of length $N=2^{n-1}$ as described in Example 3. Here, by Corollary 1
we obtain the lower time bound $n+2 M-2=\max \{n-1, n+2 M-2, n+M-1\}$, cp. equation (6.1), (6.2), (6.3), for values $\lambda_{f}=N, \gamma_{f}=N \cdot M$ and $\tau_{f}=N \cdot M$. An upper bound was $6 M+n$.

Using common asymptotic notations, for both examples the optimal times $\theta(1)$ and $\theta(M+n)$ are known as a result.

Theorem 3. For any system $S Y S \in O F F-\mathrm{NET}_{p}, p \geqq 2$, the computation of a function $f$ which is into the set of $m$-tuples of real numbers requires at least $t_{0}$ steps of computation in the worst case, where

$$
t_{0} \geqq \max \left\{\left(d_{1}-1\right) / 2,\left(d_{2}-m\right) / 2 m,\left(d_{3}-m\right) / 2 m\right\}
$$

for $p=2$, and for $p \geqq 3$

$$
\begin{aligned}
t_{0} \geqq \max \{ & \log _{p-1}\left(d_{1}(p-2)+2\right)-1.586, \\
& \log _{p-1}\left(d_{2}(p-2)+2\right)-\log _{p-1} m-1.586, \\
& \left.\log _{p-1}\left(d_{3}(p-2)+2\right)-\log _{p-1} m-1.586\right\},
\end{aligned}
$$

if $f$ is locally $d_{1}$-dependent, globally $d_{2}$-dependent, and totally $d_{3}$-dependent.
Proof. Immediately by Theorem 1, Definition 7 and Corollary 1 where the relation $\log _{p-1} p>1.586, p \geqq 3$, was used.

In Table 7 are collected, for the classes of off-line systems defined in Section 1, the lower time bounds that may be obtained by using Corollary 1. Because the classes OFF-LINEAR, OFF-PS, OFF-BINTREE and OFF-QUADTREE represent examples for the maximal transfer situation as characterized by Theorem 1, for these classes the lower time bounds are as given by Theorem 3. If a function $f$ into the set of $m$-tuples is globally or totally $d^{\prime}$-dependent, then the value $d$ has to be replaced by $d^{\prime} / m$ in the lower time bounds given in Table 7, to obtain the corresponding values for the global or total situation.

Theorem 4. For any system $\mathrm{SYS} \in \mathrm{ON}-\mathrm{NET}_{p, q}, 2 \leqq p<\infty, 1 \leqq q<p$, the computation of a function $f$ which is into the set of $m$-tuples of real numbers requires at least $t_{0}$ steps of computation in the worst case, where

$$
t_{0} \geqq \max \left\{\left(d_{1}+1\right) / 2,\left(d_{2}+m\right) / 2 m,\left(d_{3}+m\right) / 2 m\right\}
$$

fot $q=1$, and for $q \geqq 2$

$$
\begin{gathered}
t_{0} \geqq \max \left\{\log _{q}\left(d_{1}(q-1)+1\right), \quad \log _{q}\left(d_{2}(q-1) / m+1\right),\right. \\
\log _{q}\left(d_{3}(q-1) / m+1^{T}\right\},
\end{gathered}
$$

if $f$ is locally $d_{1}$-dependent, globally $d_{2}$-dependent, and totally $d_{3}$-dependent.
Proof. Immediately by Theorem 2, Definition 7 and Corollary 1.
In Table 8 are collected, for the classes of on-line systems defined in Section 1, the lower time bounds that may be obtained by using Corollary 1. Because the
 present examples for maximal transfer situations as characterized by Theorem 2,
for these classes the lower time bounds are as stated by Theorem 4. As in the case of Table 7, if a function $f$ into the set of $m$-tuples is globally or totally $d^{\prime}$-dependent, then the value $d$ has to be replaced by $d^{\prime} / m$ in the lower time bounds given in Table 8, for obtaining the corresponding values for the global or total situation. Note that value $m$ may be replaced by a value $m_{0} \leqq m$ for special ON-NET systems.

## 5. Conclusions

In this paper we have given a general framework for the description of parallel processing systems, and explained how data flow may be used for analyzing lower time bounds in general. Note that this approach may be applied to supercomputers as well as to on-chip realizations. Problems connected with the technical features

Table 6. Local, global and total data dependence measures

| Computational problem $f$ | $n$ | $m$ | $\lambda_{f}$ | $\gamma_{f}$ | $\tau_{f}$ |
| :---: | :---: | :---: | :---: | :---: | :---: |
| MATRIX <br> MULTIPLICATION | $2 N^{2}$ | $N^{2}$ | $2 N$ | $2 N^{2}$ | $2 N^{3}$ |
| MATRIX INVERSION IP | $N^{2}$ | $N^{2}$ | $N^{2}$ | $N^{2}$ | $N^{4}$ |
| DETERMINANT | $N^{2}$ | 1 |  | $N^{2}$ |  |
| LINEAR EQUATIONS | $N^{2}+N$ | $N$ | $N^{2}+N$ | $N^{2}+N$ | $N^{3}+N^{2}$ |
| TRANSPOSITION IP | $N^{2}$ : | $N^{2}$ | 1 | $N^{2}$ | $N^{2}$ |
| MATRIX $\pi$ IP | $N^{2}$ | $N^{2}$ <br> for | $\stackrel{2}{\pi \neq i d}$ | $N^{2}$ | $\begin{aligned} & 2 N^{2}-\#\{(i, j): \\ & \pi(i, j)=(i, j)\} \end{aligned}$ |
| 2D-DFT | $2 N^{2}$ | $2 N^{2}$ | $\begin{aligned} & \geqq 2 N^{2}-4 \\ & \leqq 2 N^{2}-1 \end{aligned}$ | $2 N^{2}$ | $\begin{aligned} & \geqq 2 N^{4} \\ & \geqq 4 N^{4}-2 N^{2} \end{aligned}$ |
| 2D-WT | $N^{2}$ | $N^{2}$ | $N^{2}$ | $N^{2}$ | $N^{4}$ |
| ROBERTS GRADIENT | $M N$ | $N M$ | 4 | $M N$ | $4 M N-2 M-2 N-2$ |
| CH SIPOL | . 2 N | $2 N$. | $2 N$ | $2 N$. | $\begin{aligned} & \geqq 2 N^{2}-8 N+12 \\ & \leqq 4 N^{2} \end{aligned}$ |
| VORONOI DIAGRAM | $2 N$ | 18N-33 | $2 N$ | $2 N$ | $\begin{aligned} & \geqq 12 N-30 \\ & \leqq 36 N^{2}-66 N \end{aligned}$ |
| PATTERN MATCHING | $N+M$ | $N-M+1$ | $2 N$ | $M+N$ | $2 M(N-M+1)$ |
| PATTERN <br> SIGNALIZATION | $N+M$ | 1 | $\geqq \max \{2 M$ | $M+[N / M]\}$, | $\leqq M+N$ |
| SORTING | $N$ | $N$ | $N$ | : $N$ | $N^{2}$ |

of architecture elements were by passed by the selected level of abstract system description. Thus, in the discussion of parallel algorithms for a given model SYS $\in$ $\in$ SIMD we may have in mind quite different technical implementations, but we may discuss parallel algorithms for all of them at once using the abstract model SYS $\in$ SIMD. For example, an important problem is given by the necessary decision between different structures of parallel processing systems to ensure efficient algorithmic solutions for classes of computational problems such as mentioned in Example 8 (matrix-type computations), 9 (two-dimensional transforms), 10 (geometric problems), or 11 (combinatorial problems). According to our considerations in [4] the selection of parallel algorithms crucially depends on the given parallel processing system and comparisons between different SIMD systems on the basis of knowledge about optimal algorithms represents quite a hard task. Also, there are nearly as many different models for parallel processing as papers on this topic, making comparative studies of different parallel structures nearly impossible. In the present paper an attempt was made to propose a classification of special parallel processing systems which have been of widespread interest in the past. The proof of the practicability of the proposed exact definition of SIMD systems will be the subject of forthcoming papers; the first programs of the PARSIS project fit well into this framework.

By using Tables 6, 7, and 8 the interested reader may obtain lower time bounds for different combinations of SIMD systems and computational problems, e.g., the lower time bound $\log _{2}\left(N^{2}+1\right)$ for the two-dimensional Walsh transform on

Table 7. Lower time bounds for off-line systems in OFF-CLASS for computing a local $d$-dependent function

| CLASS | $p$ | lower time bound | $d=128$ | $d=128^{2}$ |
| :--- | :---: | :--- | :---: | :---: |
| LINEAR | 2 | $(d-1) / 2$ | 64 | 8,192 |
| HEXAGONAL | 3 | $\left(\left(\frac{8}{3} d-\frac{5}{3}\right)^{1 / 2}-1\right) / 2$ | 9 | 105 |
| SQUARE or ILLIAC | 4 | $\left((2 d-1)^{1 / 2}-1\right) / 2$ | 8 | 91 |
| TRIAGONAL | 6 | $\left(\left(\frac{4}{3} d-\frac{1}{3}\right)^{1 / 2}-1\right) / 2$ | 7 | 74 |
| DIAGONAL | 8 | $\left(d^{1 / 2}-1\right) / 2$ | 6 | 64 |
| PS | 3 | $\log _{2}(d+2)-1.586$ | 6 | 13 |
| BINTRE <br> top node | 3 | $\log _{2}(d+2)-1.586$ |  |  |
| $\log _{2}(d+1)-1$ |  |  |  |  |

Table 8. Lower time bounds for on-line systems in ON-CLASS
for computing a local $d$-dependent function

| CLASS | $p$ | $\left\{i_{1}, \ldots, i_{q}\right\}$ | Lower time bound | $d=128$ | $d=128^{2}$ |
| :--- | :--- | :--- | :--- | :--- | :--- |
| LINEAR | 2 | $\{0\}$ | $(d+1) / 2$ | 65 | 8,193 |
| HEXAGONAL | 3 | $\{0,1\}$ | $\left((8 d+1)^{1 / 2}-1\right) / 2$ | 16 | 181 |
| SQUARE or ILLIAC | 4 | $\{0,1,2\}$ | $d^{1 / 2}$ | 12 | 128 |
| TRIAGONAL | 6 | $\{0,1,2,3,4\}$ | $\left(\left(\frac{8}{5} d-\frac{3}{5}\right)^{1 / 2}-1\right) / 2$ | 7 | 81 |
| DIAGONAL | 8 | $\{0,1,2,3,4,6,7\}$ | $\left(\left(\frac{8}{7} d-\frac{3}{7}\right)^{1 / 2}-1\right) / 2$ | 6 | 64 |
| BINTREE | 3 | $\{1,2\}$ | $\log _{2}(d+1)$ | 8 | 15 |
| TRIANGLE | 5 | $\{1,2,3,4\}$ | $\log _{2}(d+1)$ | 8 | 15 |
| QUADTREE | 5 | $\{1,2,3,4\}$ | $\log _{4}(3 d+1)$ | 8 |  |
| PS | 3 | $\{0,1\}$ | $f_{t_{0}+2} \geqq d+2$ for the <br> $F_{0}$ | 11 | 21 |

ON-TRIANGLE systems. The characterization of data dependencies for computational problems as given by Definition 7 may be refined, e.g., by consideration of changes of function values not only by changing arguments in one position but in several positions.


#### Abstract

Starting with an exact definition of classes of SIMD (single instruction, multiple data) systems, a general approach to obtaining lower time bounds by data flow analysis is presented. Several interconnection schemes, such as the square net, the perfect shuffle, the infinite binary tree, etc. are analyzed with respect to their data transfer possibilities. For some types of computational problems the data dependencies are analyzed in a quantitative way. From both types of analysis, lower time bounds result for many combinations of SIMD systems and computational problems, for example, $O(\log N)$ for on-line quadtree-net systems and the computation of Voronoi diagrams for $N$ planar points, $O(N)$ for off-line diagonal-net systems and the two-dimensional discrete Fourier transform, and $O(\sqrt{N})$ for off- or on-line Illiac-net systems and sorting of $N$ items.

CENTER FOR AUTOMATION RESEARCH UNIVERSITY OF MARYLAND COLLEGE PARK, MD 20742 U.S.A. * PERMANENT ADDRESS: FRIEDRICH SCHILLER UNIVERSITY DEPARTMENT OF MATHEMATICS, UNIVERSITY TOWER 17TH FLOOR, DDR-6900 JENA, GERMAN DEMOCRATIC REPUBLIC


The support of the U. S. Air Force Office of Scientific Research under Grant AFOSR-77-3271 is gratefully acknowledged, as is the help of Janet Salzman in preparing this paper. The author thanks the government of the German Democratic Republic for financial support and Azriel Rosenfeld for his efforts in making the author's stay in College Park possible and effective as well.

## References

[1] H. Abelson, Lower bounds on information transfer in distributed computations, J. ACM 27 (1980), 384-392.
[2] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, MA (1974).
[3] W. M. Gentleman, Some complexity results for matrix computation on parallel processors, J. ACM 25 (1978), 112-115.
[4] R. Klette, Zeitkompliziertheit von Berechnungsproblemen der digitalen Bildverarbeitung Vergleiche zwischen sequentieller und paralleler Datenverarbeitung (in Slovakian, to appear, VEDA Publish. House, Bratislava).
[5] R. Klette, Geometrische Probleme der digitalen Bildverarbeitung, BILD UND TON 35 (1982), 101-110.
[6] R. Klette and R. Lindner, Zweidimensionale Vektormaschinen und ihr Leistungsvermögen bei der Lösung von Entscheidungsproblemen der Aussagenlogik, EIK 15 (1979), 37-46.
[7] T. Legendi, A cellular processor project, International Workshop on Parallel Processing by Cellular Automata, Berlin, GDR, Sept. 15-16, 1982.
[8] V. R. Pratt and L. J. Stockmeyer, A characterization of the power of vector machines, J. Computer System Sciences 12 (1976), 118-121.
[9] A. Rosenfeld and A. C. Kak, Digital Picture Processing (Second Ed.), Academic Press, New York (1982).
[10] H. J. Siegel, A model of SIMD machines and a comparison of various interconnection networks, IEEE Trans. Computers C-28, (1979), 907-917.

Received May 13, 1983.


[^0]:    5 Acta Cybernetica VI/4

