Abstract-The solution of tridiagonal systems is a topic of great interest in many areas of numerical analysis. Several algorithms have recently been proposed for solving triadiagonal systems based on the Divide and Conquer (DC) strategy. In this work we propose an unified parallel architecture for DC algorithms which present the data flows of the Successive Doubling, Recursive Doubling and Parallel Cyclic Reduction methods. The architecture is based in the perfect unshuffle permutation, which transforms these data flows into a constant geometry one. The partition of the data arises in a natural manner, giving way to a systolic data Row with a wired control section. We conclude that the constant geometry Cyclic Reduction architecture is the most appropriate one for solving tridiagonal systems and, from the point of view of integration in VLSI technology, is the one which uses the least amount of area and the smallest number of pins.
I. INTRODUCTION RIDIAGONAL SYSTEMS (TS) are used in several areas
T of science and engineering. They arise when solving partial differential equation systems using FACR (Fourier Analysis-Cyclic Reduction), Successive Line methods, OverRelaxation methods, etc.
[ 1 ], [9], 161. The most efficient algorithm for solving TS systems in single processor is Gaussian Elimination (GEL Unfortunately, the GE algorithm cannot be parallelized efficiently. The alternative is to define algorithms which exploit the inherent parallelism of the TS systems. The classic parallel algorithms for solving TS systems are Stone's [ 161 [ 101 recursive doubling method (RD) and Hockney and Jesshope's [7] cyclic reduction method (CR). Both algorithms are based on the Divide and Conquer strategy (DC).
The basic idea underlying the DC strategy is to solve a problem of a given size by dividing it into similar problems of smaller sizes, solving them and combine their solutions for obtaining the solution of the original problem. Usually the same technique is applied to each subproblem, making the procedure recursive. The DC strategy has been the most used in the formulation of TS algorithms because the data flow analysis has an implicit parallel architectural model, i.e., distributed memory system with one processor per equation. In general, the algorithms proposed consist of three phases. In Manuscript received February 24, 1993; revised September IO, 1993 Emilio L. Zapata the first phase, (elimination), each processor performs some transformations over the equations it is assigned and obtains one or more equations. In the second phase (reduction), the tridiagonal system made up of these equations is solved by all the processors using an interconnection network. In the third phase (substitution), the solution obtained is sent to each processor which completes the solution of the problem.
There has been much recent effort dedicated to the development of parallel TS algorithms which exploit the inherent parallelism of the DC strategy. We can establish three directions in the design of parallel TS algorithms based on the DC strategy: extension of the parallelism of the RD and CR methods by means of the Parallel Prefix (PPI [5] and Parallel CR (PARACR) [7] algorithms, respectively; design of new DC algorithms [15] , 118); parallelization and or removal of the three phases mentioned above. Algorithms based on the RD method are those proposed by Egecioglu et al. [5] (hypercube network) and Lin and Chung [IO] (exchange-perfect unshuffle network). The TS algorithm proposed by Wang and Mou [ 181 presents the Successive Doubling (SD) data flow, which is similar to that of the radix 2 decimation in time FFT algorithm. They also propose the GEDC algorithm [ 181 implementing the Gaussian Elimination (GE) in phase one, using the DC algorithm in phase two. Recently, Spaletta and Evans [ 15 1 have presented a recursive decoupling method based TS algorithm which uses Sherman-Morrison formula to calculate the inverse of coefficients matrix.
From the point of view of communications, the TS algorithms referenced can be classified into the following groups: all to all broadcast [4] , [17] ; binary tree [SI, [12] , [15] ; RD data flow [ 2 ] , [5] , [lo] ; SD data flow [ I l l , [18] ; CR data flow 161, 181; PARACR data flow 171; ring [3] and linear data flow [ 141, [17] . Krechel et al. [8] use an exchange network (CR in phase one) and a binary tree (phase two) so that phase three is carried out without communications. Reale 114) proposes a variant of the CR algorithm that uses a linear array in phases one and three. Muller and Scheerer [12] propose a method whose phase two is solved in parallel using a binary tree. The algorithm presented by Spaletta and Evans [15] has only two phases, the second one having a binary tree pattern.
Sun et al. [ 171 propose three algorithms: Parallel Partition LU (PPT), which uses the LU decomposition in phases one and two, requiring an all to all broadcast among phases; Parallel Partition Hybrid (PPH), substitutes phases two and three by the RD algorithm; Parallel Diagonal Dominant (PDD), which only needs nearest neighbor communications (linear array RD [5] and PARACR [7] data flows. These algorithms are only, in general, stable (reliable) when applied to systems that can be solved without pivoting. We have unified these three flows of data by means of their transformation into a single constant geometry flow. This constant geometry is obtained by performing a perfect unshuffle permutation of the data generated in each of the log;, N phases of the algorithms ( N = 2" is the number of equations in the TS system). The action carry out by the unshuffling of the data is to transform the original TS algorithm into a new one with the same data flow in all its stages. The resulting architecture is a column of Q processors (0 = 24, with q 5 R -1), where the partition of the data arises in a naturd manner. The rest of the work has been structured as follows. In Section 11, we present three new constant geometry DC algorithms for solving TS systems. The design of the processor for the ekaluation of these algorithms is presented in Section 111. The processor column, based on the perfect unshuffle permutation is introduced in Section IV. Finally, in Section V, we evaluate the unified parallel/pipelined architecture proposed.
TRIDIAGONAL SYSTEM ALGORITHMS
WITH CONSTANT GEOMETRY $4 tridiagonal system is made up of equations of the type for i = O , --. , N -1 , where 2 -1 = XN = 0. We consider systems with N = 2" equations and we will assume that the sequence S of equations is distributed as a matrix D of dimensions 2 x 2"-' so that In this section, we will briefly describe the SD [181, PP [ 5 ] and PARACR [7] algorithms and we will formulate a constant geometry version for each one of them. We now give the definitions of two operators which are particularized for the needs of this work.
Dejinition 1: A butterfly operator B over a column of a matrix of dimension a x b is a function with a inputs and c outputs. In the second phase, we use equation ((s + 1)2(t-1))', whose unknowns are s2t-1, (s + 1)2t-1 and (s + 2)2t-1, in order to eliminate (s + 1)2t-1 in equations e, Y,
1.

271-1 -
( 1 + 2t-1)' and ((s + 2)2t-1 -1)'. This phase produces four equations with identical first and third unknowns, s2t-1 -1 and (s + 2)2t-1 respectively and whose central unknowns are s2t-1, i, i + 2t-1 and (s + 2)2t-1 -1. As s is even, these four equations are, In the last stage of the algorithm, each trio [i]'") contains equation E,(") with the unknown terms (-2". i, 2"). As X -N = .EN = 0, a simple division permits the calculation of each unknown term 2,. Fig. l(a) We will assume that in each stage t of the CGSD algorithm, the trios y"l(i) flow towards the processor as a matrix Dt-l of 2"-l columns with 2 elements each, following the distribution defined in (2):
In Dt-l, we have suppressed the subscripts t -1 of each element. The trio with index i is located in column [in . . .&I, row [ill of the matrix. From the computational viewpoint, we can consider that the column indicates the cycle in which the trio will be processed and the row the bus through which the trio accesses the processing section. We will say that the index admits a (cycle, bus) interpretation.
In the execution of the tth stage of the CGSD algorithm we can differentiate the following two actions.
1) The butterfly operator &d is applied to each column of Dt-1
And thus, matrix Dt-l is transformed into matrix Gt-l:
2) The perfect unshuffle operator is applied to Gt-l in order to obtain matrix Dt.
Thus, the tth stage can be expressed as the application of the operator string Bsdr to matrix Dt-l. Given the fact that all the stages are identical, the algorithm CGSD consists in the application of Bsdr n times to matrix DO, that is, (5) is applied.
Observe that y t ( r t ( i ) ) = [i](t), where I't means which r is applied t times.
In the case (b) all the stages ares similars: B,d is applied to trios y"'(i) at distance of 1 and the perfect unshuffle operator r permutes the resulting sequence to obtain trios yt(Z).
B. Recursive Doubling Algorithm
The recursive doubling method (RD) (1) or, more briefly, Xi+l = P;Xi. Consequently,
POX().
(12) Therefore, solving the system is reduced to finding all the partial products of the matrices Pi, calculating $0 using the last equation and substituting in order to obtain the rest of the unknown quantities. It has been observed that some numerical stability problems can arise in the use of recursive doubling algorithm when the size of the system is large.
The RD method is the most widely used one for performing the parallel calculation of all the partial products or prefixes of a succession of N elements. It permits the calculation of all the prefixes in [log, N1 stages and presents the data flow of Fig. 2(a) . As can be observed, the data flow is very regular. Each node consists in a 3 x 3 matrix and the operation to be carried out in each node is the matrix product. The initial matrices Pi (see (1 1)) have as last row (0, 0, l}, a feature which is maintained throughout the product of them. Thus, in order to obtain the product matrix only six elements must be computed. That is, 20 arithmetic operations are camed out in each node. A node identified as 0 at tth stage means which its matrix is the same that in the previous stage. No computation is carried out in these nodes. In the other hand, a node identified as represents the product of its input matrices. At the tth stage (t = 1 , 2 , 3 ) a total of 8 -2t-1 matrix products are computed and each node participates in two products. Input matrices to each node are at distance of 2*-'.
Constant Geometry: The RD algorithm admits a constant geometry formulation with characteristics similar to those of the CGSD algorithm.
Theorem 2: The RD algorithm is equivalent to the following algorithm with constant geometry and which we will name CGRD:
where 0 5 IC < 2t-1 and y ( i -l)y(i) is the product of two matrices, being y(O)(i) = Pi.
The proof of Theorem 2 is given in the Appendix.
As in the case of the CGSD algorithm, we assume that in the tth stage the data flows towards the processor as a matrix Dt-l (see (7)). The phases we must now consider in the execution of each stage of the CGRD algorithm are the following. 
y(2" -1). . . y(2i f 1 ) . . . y(3) y (1) where 1 is the unit matrix for the matrix product. We will denote as Frd the operator which performs this first phase.
2) The butterfly operator B,d is carried out over each column of Ft-:
where z ( i ) is equal to the right hand side of expression 
Expression (1 3) evidences the geometry of the algorithm. Fig. 2(b) shows the data flow of the CGRD algorithm for A' = 8. In this case all stages have the same data flow, as a consecuence of the perfect unshuffle permutation applied to the results z t -' ( i ) (see (15) ) to obtain yt((i) (see (13)). This permutation forces the nodes to be located at distance one. The nodes of Fig. 2(b) have the same meaning that the nodes of Fig. 2(a) . Note that a node of Fig. 2 (see (4)).
C. Parallel Cyclic Reduction Algorithm
The PARACR algorithm by Hockney and Jesshope [7] consists of two stages: elimination and substitution. The elimination stage consists in obtaining in the tth stage equations As the unknown terms in E~~~~? l have subscripts (i -P , i2t-1 , z), the first term in the right hand side of (18) 
,(").
The right-hand side of (18) will be abbreviated as
The PARACR algorithm presents a regular data flow, Fhown in Fig. 3(a) for N = 8. Note that it is similar to the RD algorithm, but with two address flow. Each node is an equation which has four coefficients. The number of stages is n. There are four classes of nodes in Fig. 3(a) An advantage of the parallel cyclic reduction algorithm is that if the TS system is sufficiently diagonally dominant, the elimination phase can be stopped before completion without loss of accuracy [7] .
Constant Geometry: As in the previous cases, a version with constant geometry can be formulated for the PARACR algorithm.
Theorem 3: The PARACR algorithm is equivalent to the constant geometry algorithm we will name CGCR:
and y(O)(i) is equation E,(').
The proof of Theorem 3 is analogous to that of Theorem 2 (see Appendix).
As in previous cases, we will assume that in the tth stage, the input data is organized as a matrix Dt-l defined in (7). Each stage consists of three actions.
Matrix Dt-l is converted into matrix Ft-1: The result is matrix Gt-l defined in (9). We apply the perfect unshuffle operator to Gt-l in order to obtain D,. Summarizing, each stage can be expressed as the operator string <crBc,17 applied to matrix D t P l and the CGCR algorithm as (<,,B,,I') applied n times to the initial matrix DO.
That is,
The data flow of the CGCR algorithm presents the same characteristics as the previous constant geometry algorithms. Fig. 3(b) shows the algorithm for N = 8. All stages have the same data flow, being the input nodes at distance one. This is a consecuence of the perfect unshuffle permutation, applied each stage. Each node on Fig. 3(b) represents an equation with four coefficients. The operation defined in (1 8) is carried out in each node marked as 0. The operation [ I , y(i) , y(2 + l)](t-l) is carried out in the nodes marked with circles and the operation [y(2-1), y(z), in those marked with shadowed triangles.
DESIGN OF THE PROCESSOR
The processor associated with the CGSD, CGRD and CGCR algorithms will consist of two sections: processing section (PS), hardware implementation of the specific butterfly operator of each algorithm, and routing section (RS), hardware implementation of the operators &d. tCr and I?. We will obtain architectures with constant geometry by adequately designing the RS sections so that the locality of the operations is maintained in all the stages of the algorithm. The PS sections compute the different butterfly operators and will not be studied in detail. In section V we will do a comparison of their arithmetic complexities.
The I' operator is common to the three algorithms (see (IO), Lemma 2:
The proof is straightforward from (4), (24), and (25).
Operator P can be implemented by means of a FIFO queue of length N with two inputs located in cells N -1 and N/2-1, using a numbering scheme from right to left, and an output in cell 0. The decimation operator can be implemented as an N cell FIFO queue with outputs in cells 1 and 0 and one input in cell N -1. The hardware implementation of operator r is immediate from the implementations of p and 5.
A. Design of the Processor for the CGSD Algorithm
The constant geometry architecture for fast orthogonal transforms proposed in [20] and [20] are suitable for the CGSD algorithm (see (10) another one implementing the butterfly operator Bsd. Fig. 4 shows the design of the processor (solid lines). The RS section is made up of two FIFO queues. In each stage, one queue acts as output buffer (writing the data generated in the current stage), whereas the other acts as input buffer where the data generated in the previous stage is read. This function is exchanged in the next stage under the control of several multiplexors (1 to 4). The inputs to the FIFO queue are provided by the concatenation operator. The writing process is carried out in parallel over each one of the segments of the queue. In each operation cycle the data is written in the FIFO queue, although the displacement of the queue is only one position each cycle. The Operation of writing all the data is completed in Y-' cycles.
The decimation operator provides the outputs of the FIFO queue. The reading process consists in directing each of the data items located in the last two cells to one of the output buses (so, s l ) . The FIFO queue has to perform a 2 cell shift operation in order to position the next two data items in the reading location. The reading of all the data produced in the stage also uses 2"-l cycles. Each output bus is connected to the corresponding input bus of the processing section ( S O to b and s1 to c ) to allow for the sequencing of the stages. On the other hand, in order to consider the processing section PS as a segment of the processor, the length of the FIFO queues will be N -1 cells.
B. Design of the Processor for the CGRD Algorithm
Taking expression (16) into account, the design of the processor for the CGRD algorithm is obtained by adding to the previous design the necessary logic to make operator &d provide the first row of matrix Ft-l (see (14)). Observe that the elements of this first row are in the last row, preceding column. Therefore, the element of the third row which is going to be processed in the ith cycle must be directed to a latch in order to be used in the next cycle. Fig. 4 shows the situation of latch L connected to input a of the PS section, which has three inputs. Inputs b and c are provided by the output buses so and s1 respectively.
Operator Brd performs two 3 x 3 matrix products each cycle. According to definition (13), if i = IC . 2n-t+1, (Le. in the cycles /~2 " -~ of the tth stage) the output b' is equal to the input value in 6, so WL: introduce a multiplexor which determines the output b' of the PS section, as Fig. 5(a) shows. 
C. Design of the Processor for the CGCR Algorithm
According to expression (23), the design of the processor for the CGCR algorithm is obtained by adding to the CGSD processor design the necessary logic for the implementation of operator which provides the first and last rows of matrix Ft-l (see expression (21)). The first row is provided by latch L and input a, as in the case of the CGRD processor. The elements of the fourth row are obtained from the second row with unit delay. Therefore, an output in cells 2 of the FIFO queue which provides each cycle the d input for the PS section will be necessary as shown in Fig. 4 (drawn with dashed line) .
In the ith cycle, the PS section must compute equations ~( 2 2 ) and z(2i+l) (see (22) ). According to (20), in cycles i = IC. 2n-t of the tth stage (column i of matrix Ft -1 is processed) y(22 -1) does not participate, it is substituted by equation 1 with coefficients (0: 1.0.0) and in cycles (IC + 1) . 2n-f -1, equation I substitutes y(2i + 2 ) . Consequently, in the design of the PS section we must introduce two multiplexors for making the appropriate selection each cycle. Fig. 5(b) shows the corresponding design.
Su mmarizing, the architecture of the CGCR processor encompasses as particular case those of the CGSD and CGRD processors. Obviously, the type of cells of the routing section and the PS sections of each processor will be different. The duration of the operation cycle can be reduced by pipelining the PS sections. In this case, the architecture must be modified in order to eliminate waiting cycles between stages (see [20] ).
IV. PROCESSOR COLUMN
Processor pipeline or array are the two most widely employed methods for projecting the inherent parallelism of the data flow of the DC algorithms onto an architecture which can be implemented using VLSI or WSI integration. The design of a processor pipeline has the drawback of its limited I/O bandwidth. With a processor column constituting a constant geometry architecture we can exploit the spatial parallelism found in each stage. In this section we will design a parallel architecture for the CGSD, CGRD and CGCR algorithms based on a perfect unshuffle interconnection network which makes the partitioning of the algorithms possible, permits wired control and can be considered systolic.
In order to define the processor column by means of operators, we generalize to three dimensional structures T of dimensions 2" x 2' x 2c the operators used in Section I1
and we define the restrictions of the operators to two of the dimensions. For avoiding redundances, in this section we will only introduce the basic notation for expressing the algorithms.
We note m = b + c and n = a + b + c.
Dejinition 5:
qZ, !/, . ,+li,...il] . (27) We assume that the sequence S of equations at stage t is distributed as a tridimensional matrix Tt-l of dimensions 2 , 2 q , 2""(n = q + 7u) among the Q = 24 processors of the column. We consider two distributions: cyclic ( d c y ) and consecutive (dco)
Dejinition 6:
. , i 2 ] [ i l ] . (29)
Using cyclic distribution makes that consecutive columns in matrix Dt-l (see (7) 
A. Processor Column for the CGSD Algorithm
The expression of the CGSD algorithm as a composition of three dimensional operators is obtained following the same steps as in Section 11-A. The matrix of trios to be computed by processor PE in the tth stage, written as is similar to the one defined in (7) (10). Fig. 6 shows the data trio evolution of the algorithm for each of the distributions aforementioned. The proof is immediate from (10) and (31). According to (32), operator Bsd sequentially computes 2w-1 butterflies in each processor. The operator string ~c y " e~" u s d c~c ' e~b ' l s internally unshuffles the results in each PE. Therefore, each processor is analogous to the one designed in Section 111-A (Fig. 4) , with the only difference that here, the length of the FIFO queues will be 2'-1 cells if we consider the PS section as an additional segment of the internal pipeline of the processor. Operator I'PE.bus performs the external unshuffle and defines the interconnection network of the processors in the column. All the trios computed in processor PE= [i,+l ... Lemma 3, whose proof is analogous to that of Lemma 1, means that it is equivalent to apply operator r to a sequence S or to the three dimensional structure T . From the computational viewpoint, index i of each data item admits the interpretation (cycle, PE, bus) in the case of cyclic distribution (dcy) and (PE, cycle, bus) in the case of
Lemma 4:
The design will be the same as in the d,--case, but now the PS section of each PE must be located behind the FIFO queues.
b) I'(Tt-1) = Py~"6"'"(Tt-~).
(3 1) Summarizing, both distributions require the same external interconnection network between processors. The designs of the processors differ for each type of distribution in the location of the PS section with respect to the FIFO queues. (h) The expression of the CGliD algorithm as a composition of three dimensional operators is obtained in a similar manner to the two dimemional case. The computation of the tth stage consists in three steps which are analogous to those described in Section 11-€3. However, we must extend operator Erd which carries out step 1 of Section 11-B to three-dimensional structures Tt-1. The extension IS immediate if the distribution of the sequence of input matriceti in the f t h stage is consecutive: 7;-1 is transformed into Ft-l so that in processor PE matrix is transformed into the matrix (35) found at the bottom of the page.
for the operator performing this transformation. [f the distribution is cyclic, the transformation is different. Matrix DFPEl ((Icy distribution) is transformed in each PE into the matrix (36) found at the We will maintain the notation named as qrd.
Steps 2 and 3 are analogous to those of the single processor case and the expression of the algorithm analogous to ( I 6).
Theorem 5: The CGRD algorithm can be camed out in a column of Q = 2q processors which perform the following string of operators in each step
The proof is immediate from (31) and (16). If we compare the operator string (37) with (32) (dcy distribution) we observe that the architecture of the column for the CGRD algorithm is obtained by adding to the design of the processor column of the CGSD algorithm the logic necessary for the implementation of B rPE.bus rycle.bus rjcle,bus rd rd P ( 5 ' 1 y(2"PE + 2" -3) . . . y(2"PE + 1) y(2"PE -1) y(2"PE + 2"' -2) . . . y ( 2"PE + 2) y(2"PE + 0) . y( 2" PE + 2"' -1) . . . y(2"PE + 3) y(2'"PE + 1) A similar study can be carried out for the implementation of operator Erd (d,,, distribution) . However, the resulting design does not admit pipelining. Observe in Fig. 6(a) that at any stage, in the first cycle of each PE a data item from the last cycle of the PE -1 processor is needed. For example, at stage 2 the first cycle of PE = 1 needs data 14 which would be into the pipeline yet. Therefore, the consecutive distribution is not adequate if we pipeline the PS section of the CGRD processor. This problem also appears in the CGCR algorithm and, consequently, we will not consider it in the rest of this work.
C. Processor Column for the CGCR Algorithm
The CGCR algorithm can be expressed as a string of three dimensional operators. As in the previous cases, the corresponding expression is obtained by considering three steps in the execution of each stage of the algorithm, analogous to those considered in section 1I.C. In the first step, we must extend operator so that matrix D[-El ((icy distribution) is transformed in each PE into the matrix (39) found at the bottom of the page. We will name as qc, the operator that carries out this transformation.
Steps 2 and 3 are analogous to those of the single processor case and the expression of the algorithm analogous to (23). We will not consider the case of the consecutive distribution for the same reasons as in section B.
Theorem 6: The CGCR algorithm can be carried out in a column with Q = 24 processors which perform the following operator string in each step Therefore, processor 0 of the column must be a complete CGCR processor. The latch located at input a and the logic providing output s2 (MUX5 of Fig. 4 ) can be removed in the rest. Fig. 7 shows in dashed line the connections that must be added in order to complete the design of the CGCR column.
V. EVALUATION
In this article, we have proposed three new constant geometry algorithms for solving TS systems based on the SD [ 181, RD [5] and PARACR [7] , respectively. We have also proved that the data flow of CGSD, CGRD, and CGCR algorithms can be implemented in a unified parallel architecture based on the multiplication, a: addition) by the butterfly operators of each algorithm, as well as an enumeration of these operations. We observe that the CGRD algorithm does not require divisions. The second column specifies the data size (in bytes) required in each algorithm (we consider simple precision 32 bit data, so data size = 4c, being c the number of coefficients). This value will determine, among other things, the size of the memory cells. The number of input ( I ) and output (0) buses of the PE's associated to each algorithm is specified in the third column. In the design of the PE column of the CGCR algorithm only PE= 0 requires three output buses. Observe that even though the CGCR architecture requires four input buses, its bandwidth (4x16 bytes) is smaller than those of the CGRD (3x24 bytes) and CGSD (2x48 bytes) designs. The fourth column shows the additional hardware elements the PE columns of the CGRD and CGCR algorithms need. The last two columns specify the memory and message sizes. The CGCR algorithm saves 50% and 75% memory if compared with CGRD and CGSD algorithms, respectively. The three algorithms produce the same number of messages (2 log, N ) . However, the CGSD algorithm sends double the number of bytes than the CGRD algorithm and triple the number of the CGCR algorithm.
The CGSD algorithm presents a higher hardware cost and a larger number of transmitted data. The most significant advantages of the CGRD algorithm is that it does not include division operations in the evaluation of its associated butterfly operator. Its hardware and communications cost is 50% that of the CGSD architecture. As we saw in the previous section, the CGCR architecture encompasses as particular cases the other two designs. Furthermore, as can be deduced from the data in Table I , the CGCR architecture is the one which presents a lower hardware and transmitted data cost. Consequently, we conclude that the CGCR architecture is the most appropriate for solving TS systems. An additional advantage is that the CGCR algorithm is the most stable of the three [7] , [9] , and, as indicated in section II.C, the algorithm can be stopped if the dominance factor is less than the rounding factor.
The structure of the RS section of the general PE presented in Fig. 4 is based on FIFO queues, permitting the partitioned systolic design of a PE column or array of columns. As the addressing of the data is implicit in the architecture of the RS section and the interconnection network, the design of the control section of the PE is simple and can be wired in. The partitioning of the data arises in a natural way by means of the decomposition of their index into three fields: (cycle, PE, bus), cyclic distribution; (PE, cycle, bus), consecutive distribution. Given the arithmetic complexity of the butterfly operators of the three algorithms, a high speed design in VLSI technology requires the pipelining of the PS section. Pipelining which is only possible if we adopt a cyclic distribution of the data in the CGRD and CGCR architectures (see Section IV). An analysis of the impact of the pipelining of the PS section on the design of the RS section of a constant geometry architecture was carried out in [20] . Finally, we will point out that from the point of view of integration in VLSI technology, the CGCR architecture is the one using the least area and the smallest number of pins.
APPENDIX
Let's 4 be the perfect shuffle permutation. b) The hypothesis implies that some of the n -t + 1 least significants bits of i is 0 and then 4t-1(z) < N -2t-1. During 1978-1981, he was an Assistant Professor in the University of Granada, and dunng 1982-1991, he successively was Assistant. Associate, and Full Professor in the University of Santiago de Compostela. Currently, he is a Professor with the Department of Computer Architecture at the University of Mdaga. His research interests are in the area of parallel computer architecture, parallel algonthms, numerical algorithms for dense and sparse matrices, and VLSI digital signal processing.
In these areas, he has published more than 35 papers in refereed lntemational Journals and about 50 refereed International Conference Proceedings.
