Abstract
Introduction
In multiprocessor architectures the overall system performance is closely related to the structure of interconnections beetween processors and memory modules. A number of network configurations was characterized in survey papers [2, 4, 7, 10, 13] .
The most common measures characterizing interconnection networks (INs) are : assymptotic hardware complexity, complexity of the setup algorithm and total delay time. A number of SSI related assesments of the hardware complexity was verified in VLSI environement. Franklin [5] showed that for both banyan and crossbar INs the chip area grows as O(n 2 ). Propagation delay grows as O(n) for the crossbar and approximately O(n 0:19 log 2 n) for the banyan.
Chen et al. [3] derived O(n 2 =logn) assesment of the area requirements for VLSI realization of the Omega multistage IN. Lin and Shin [9] proved the O(n 2 =log 2 n) wire area lower bound for the shuffle-exchange and cube-connected multistage INs. They discovered also high occurence of long paths in both types of networks. Layouts consuming a larger amount of the chip area are more expensive to fabricate and less reliable. Long wires araise propagation delays and hence reduce the throughput of the system. When the scale of integration arises cellular interconnection networks (CINs) can be a good alternative in the design of multiprocessor architectures (f.i. SIMD computers). CINs have many interesting properties from the point of view of VLSI design, i.e. regular form, short local connections between adjacent cells, easy fabrication, simplified testing and diagnosis. The main families of CINs are : triangular, diamond, rectangular, rhomboidal, prunned rectangular, approximately square, cascaded, etc. [1, 6, 8, 11] . O(nlogn) programming algorithms for the triangular and diamond CINs were described in [12] . Similarly, O(n) programming algorithms for triangular and cascaded CINs were developed in [11] .
The present paper shows that (n=2) (n=2) cellular array is sufficient to realize an arbitrary permutation when two passes through the network are allowed. The square array provides an efficient two-phase interprocessor communication in a model of SIMD computer. It has neither long connections nor criss-crossing between cells. We give below a simple O(n) setup algorithm for this network.
A model of SIMD computer
One model of SIMD multiprocessor architecture is presented in Fig composition of the two permutations : s = s 1 s ;1 2 , where s 1 is related to the "write" operation and s 2 characterizes the state of IN during the "read" phase.
Architecture of the square IN
The square CIN shown in Fig.2 is (n/2)x(n/2) array with cells being (2 2)-permuters (n is even). Each cell can be set up individually into two different states : the "cross" state and the "interconnection" state [6] . As a mathematical model of this network during "write" and "read" phases we apply certain coset decompositions of S n (symmetric group of all n-permutations) -see [8, 11] for details.
Programming the square IN
Now we present the algorithm SQUAREFACTOR for programming the square CIN in the first phase of the data transfer. Before execution of the algorithm all cells of the network are in "cross" state. No more then n/2 individual cells is then set up to the "interconnection" state. Before the "read" phase, the square network is decomposed into two triangular subnetworks by setting the cells of the left diagonal into the "inteconnection" state. According to this decomposition we have to relabel the square CIN using the formula: R 2 = ( n + 1 ) ; R 1 , where R 2 is the number of SM register in Fig.2 and SQ 2 (n=2) . 
Analysis:
It is clear, that algorithm SQUAREFACTOR terminates in O(n) steps: it has four loops, with n/2 iterations each. Since both algorithms KLWFACTOR and REVKLWFAC-TOR are proved to be in O(n) class too, the total setup time of the square IN is O(n).
Our algorithms developed for the square IN can be also applied for programming a family of rectangular CINs [6] . required to compute the permutations s P and s Q in cycle representation: s P = (142)(3) and s Q = (5687). After "write" operation (steps 2 and 3), in step 4 the algorithm KLWFACTOR(s P ) produces s P in the two-cycles form: s P = (12)(24). Notice, that 1-cycles are omitted. Similarly, in the same step the algorithm REVKLWFACTOR(s Q ) produces s Q = (56)(68)(78). Before the "read" operation the following cells of the square CIN have to be set up into the "interconnection" state: (1,2),(2,4), (5, 6) , (6, 8) , (7, 8 ) (see Fig.3.b) . The final permutation of data in our SIMD computer is a composition of s 1 and s ;1 2 = ( s P s Q ). 
A network programming example

