I. INTRODUCTION
There is a class of important orthogonal transforms which can be calculated using a fast algorithm by applying the successive doubling method [4] , the objective of which is to minimize the redundant operations. The successive doubling method consists of performing successive bisections of the data until the original sequence is decomposed into N / r sequences of length r ( N = r"), where r is the length of the minimum sequence to be transformed (radix of the transform). Once the N / r discrete transforms (butterflies) have been calculated, they must be combined to obtain the discrete transform of the original sequence by means of'log, N calculation stages. As a result of the successive bisections of the initial data set it is necessary to carry out a shuffle of the input sequence (decimation in time (DIT) algorithm) or of the transformed sequence (decimation in frequency (DIF) algorithm) in order to obtain an output sequence ( T ( k ) , 0 I k < N ) in its natural order.
The usual way for carrying out this shuffle is by using the bit reversal permutation.
An adequate architecture for integration in VLSI and WSI technologies is the systolic architecture. There are many methods for the systolization and partitioning of algorithms [ 5 ] , but none of them is applicable to the algorithms based on the successive doubling method. This is why many authors consider the successive Manuscript received July 17. 1991; revisedoctober 14, 1991. [3] , [12] . The new architectures proposed in this correspondence are more efficient than the ones described in [3] and [12] . Our objective in this work is to design an application specific architecture which permits the exploitation of the parallelism present in the successive doubling method and which is integrable in VLSI and WSI technologies. The constant geometry architecture we present in the following sections is based on the perfect unshuffle or shuffle interconnection networks, which permit efficient mapping and partitioning of the flowchart generated by the successive doubling method without having to use microprogrammed control.
We have structured the rest of this correspondence in the following way. In Sections I1 and 111 we present in detail the application specific architecture of the DIT algorithm, defined by the perfect unshuffle permutation. More specifically, in Section I1 we approach the design of the appropriate processor for obtaining the constant geometry systolic architecture and in Section 111 we present the application specific parallel architecture. In Section IV we briefly describe the application specific architecture associated with the DIF algorithm, whose constant geometry is determined by the perfect shuffle permutation.
THE CONSTANT GEOMETRY ARCHITECTURE
We consider that the size of the transform is N = r", where r is the radix and we will use a two-dimensional representation [x, y] of the index for each data item (i = 0, I , . . . , N -1) in the input sequence where x, and y, are the digits of the binary representation of x and y , respectively. The union of x and y into one number (x . 2" + y )
will coincide with the binary representation of the index i of the data sequence (U + z, = log,N). Finally, we will suppose that the data sequence flows from left to right.
The operators are defined by their effect on the indices of the data items. The decimation operator 6 converts a row into many by reducing the number of columns:
Each row is broken into 2k rows, and the operator is well defined if k 5 U . Using this notation, we can define the operator concatenation / 3 which reduces the number of rows of an array by increasing the number of columns:
A sequence made of 2" rows with 2" elements (columns) is transformed into another with 2"-k rows of 2 u + k columns and the operator is well defined if k 5 U . Finally, we define the perfect unshuffle operator r of a sequence, which flows from left to right and is organized as a two-dimensional array of 2f' rows and 2" columns:
r(k) performs a rotation to the right of order k of the binary representation of the index of each element of the sequence and is well
We are interested in the efficient hardware implementation of the perfect unshuffle permutation, as it is the base for the design of a constant geometry architecture for radix r DIT successive doubling algorithm. The output sequence of the processor will have to undergo a perfect unshuffle permutation to maintain the constant geometry in all stages of the transform. From this we deduce that out of all the permutations we can implement with (4) only the particular cases ril,) [x, y] and rll,) [x, 01 will be of any interest, being v = log,r. Also, these permutations can be obtained by the combination of the operators 6(,,, and defined previously.
Lemma 1 :
riv) tx, 01 = 6(f,)oir,l tx, 01. (6) Here, the order for the application of the operators is from left to right. The proof of (5) and (6) is immediate.
A. Design of the Processor
The internal structure of the processor will consist of two clearly differentiated sections: Processing (PS) and routing (RS). The PS section will carry out the set of operations associated with the r-point butterfly (discrete transform of a sequence with r-points). These operations will depend on which particular transform we are implementing [2] . As an example, Figs. l(a) and (b) show the PS section corresponding to a 2 and 4 point discrete Walsh transform (DWT), respectively. As we are only interested in the regrouping of the data items and not in the specific computations of each transform, we will consider the PS section as a new operator, the butterfly operator B,,,,, which carries out an arbitrary function with 2f'-inputs and 2"-0utputs.
The equalities (5) and (6) bering scheme as i the previous case; the queue must have an input in cell zero. We can also implement the permutation 6((,) by means of a demultiplexor with an input associated with the sequence we wish to decimate, 2" outputs and z, control inputs used in a cyclic fashion each clock period. Both solutions require the sequence to be decimated to advance 2" positions each cycle.
The hardware implementation of permutation 1'(1,, is immediate using permutation off,) and one of the two alternatives of permutation lii,,). This leads to two different designs for the processor, although the internal parallelism is the same in both cases (2" data items are processed in parallel). If we use the FIFO queue as the implementation for operator 6,,,, (Lemma 1 , equality ( 5 ) ) , the design of the processor for the calculation of a stage of DIT successive doubling algorithm is the hardware translation of the following operator string:
where v = log,r. Fig. 2(a) shows the design of the radix 2 processor. We have included a double FIFO queue in order to be able to implement the whole transform by external recirculation of the data, using only one processor with the ith output connected to the ith input (i = 0, 1, * . . , 2" -1). The n stages ( n = l o g y ) of the transform are identical, as we apply the operator sequence (7) n times. At each stage, one queue acts as the output buffer (writing the data generated in the current stage) whereas the other acts as input buffer (reading the data generated in the previous stage) and this function will be exchanged in the next stage, under control of multiplexors MUXO-MUX2. Observe that in the design of Fig. 2(a) the FIFO queues have a length of N -1 cells (W = N / r ) and the inputs associated with permutation &,) have been conveniently distributed. In order to do this we have considered the PS section as a segment of the pipeline made of sections PS and RS. Fig. 2(b) shows the second alternative for the design of the processor (Lemma 1, equality (6)), we have considered radix 2 again.
This design is the hardware translation of the following operator string where v = log,r. For the same reasons as in the design of Fig. 2(a) we have included two FIFO queues of lengths N -1 with 2" inputs and only one output in its right end (cell (N -2) ). In this case we can also implement all the stages of DIT successive doubling algorithm using only one processor, with each output path feeding back its corresponding input path. The design of the processor according to the operator sequences (7) or (8) permits the interpretation of the two-dimensional representation of (1) in the following way: the y coordinate gives the parallelism for each stage of the transform (a butterfly with 2" data items is processed each cycle), whereas the x coordinate establishes the sequentiality for each stage of the transform (2" butterflies of length 2"). Therefore, the calculation time for a stage will be 2" clock cycles, the time used by the processor to compute the butterflies associated with each input vector. With this interpretation of ( I ) , the binary representation of the data consists of two fields [cycle, path] . The data item (x, y ) will input the processor through its yth input path (path = 0, 1, . . . , 2" -I ) , being a part of the xth butterfly of the stage (cycle = 0, I , . . . , 2" -1). To change from a single PE system to a PE column will force us to modify the notation introduced at the beginning of Section I1 as we need a three-dimensional representation [x, z , y] of the index of each data item in the input sequence:
PARALLEL ARCHITECTURE
where x I , yl, and z, are defined as in (1) and the union of x, z , and y into a single number (x . 2"+" + z . 2" + y) will coincide with the binary representation of the index of the data sequence (i = 0, If we consider v = log, r we can interpret the three-dimensional representation of (9) in the following way: the z and y coordinates determine the parallelism of each stage of the transform, 2" butterflies ( < N / r ) of 2" data items are computed in parallel in one column of 2" PE's with 2*' inputs each, and the x coordinate establishes the sequentiality in each stage of the transform (2" vector of length 2w+*'). The calculation time for a stage will be 2" clock cycles where the duration of the cycle is the time used by the processor in the computation of a butterfly of 2" data items.
With this interpretation of equation (9) we are decomposing the binary representation of the index of each data item into three fields (cycle, nPE, path]. In each stage, cycle indicates the instant it is processed (cycle = 0, I , . . . , 2" -I), nPE indicates the PE where it will be processed (nPE = 0, I , . . . , 2" -1) and path specifies the path through which it will enter the PE (path = 0, 1, . . . , 2 ' -I ) . Thus, for example, let N = 64, Q = 4, and r = 2 ((U, w , v ) = ( 3 , 2, I) ). In the first stage of the transform, data item 35 (10001 1 binary), whose three-dimensional representation is [[loo] , [Ol] , [I] ], will input PE 1 through path 1 in cycle 4 (35 will be a part of the fifth block of butterflies). [nPE, cycle, path] is another possible interpretation with similar characteristics [2] . 1, . . ' , 2 " + " + " -1; U + W + 21 = n). ( 1 1) where k < U + U and k 5 U , respectively. We define the rest of the permutations of two variables rpi;, with a , b = x, y , z ( a # b) in a similar way. We will also extend the meaning of the operators h0, and &), which possess a two-dimensional nature, in order to be able to apply them to the new representation of (9). This is, 6${ and @{ perform the decimation and concatenation permutations, equations (2) and (3), respectively, on the dimensions x and y without modifying dimension z .
Lemma 2:
r(,,) tx, Z, YI = rw;;; tx, z , YI (12) where the application order for the operators is from left to right.
Its proof is immediate.
Lemmas 1 and 2 guarantee the decomposition of permutation r(,,) of a two-or three-dimensional representation of the index of each data item into more elementary permutations, which is the base for the design of a constant geometry architecture. Consequently, we can state the following theorem.
Theorem 1: DIT successive doubling radix r and constant geometry algorithm of a sequence of N data items ( N = r n ) can be camed out in a column of Q PE's (Q = r q , 0 I q 5 n -1) which implements (hardware translation) each stage the following operator string:
where U = log2r, the operators are applied from left to right and we consider the [cycle, nPE, path] interpretation of the three-dimensional representation of the index of each data item.
Proof of (13): We have to demonstrate that the operator sequence /37;;hf;,i I't;; is equivalent to a perfect unshuffle permutation (IO): p.;;,;q;,;rt;; [x, z, y1 = 6;pt;,y[[y,, . . . y l x , . . . xi], Fig. 3 shows the three elements [cycle, nPE, path] for each stage of the radix 2 transform of a sequence of 64 data items in a column with 8 PE's (N = 64, Q = 8, r = 2 ) . We have applied the operator string (13) to each data item each stage. Remember that the sequence to be transformed has been shuffled in accordance with the bit reversal permutation before starting the transform. As was to be expected, the output stage presents the same order as the input sequence, due to the fact that we have carried out six (n = 6) permutations F(l).
The hardware implementation of the perfect unshuffle permutation expressed in (10) is immediate from the operator strings in (12) and (13). fi$$6$? performs the perfect unshuffle permutation of the 2" butterflies of 2L' data items processed sequentially by each PE.
Consequently, we can use the same solution as in the single processor case (see (7) and Fig. 2(a) ), with the only difference being that in this case the length of the FIFO queues will be 2" -2" cells, considering the PS section as the only stage of the internal pipeline of the PE. rf;; performs the perfect unshuffle permutation of the outputs of the PE's. Its hardware implementation will use an external interconnection network determined by operator r applied on the dimensions [ z , y ] : The yth output path of the zth PE will be connected to the y *th input path of the z*th PE, where [x, p , y *~ = r:.s (&,) [ x , z , y]. Fig. 4 shows the connections of the PE's for the example of Fig. 3 . The number of input and output paths of each PE is only a function of the radix of the transform and not of the number of PE's in the column. Summarizing, we have introduced the partition of DIT successive doubling algorithm in a natural way by means of the decomposition of permutation rll,) in a perfect unshuffle which is internal ([cycle, path] ) and another which is external ([nPE, path] ) to the PE's. In the particular case of one column with a single PE the whole unshuffle will be internal, we obtain the solution of the previous section (sequence (7)), and the FIFO queues will have a length of N -2" cells. In the extreme case of a column with r" -I PE's the whole unshuffle will be external and, consequently, the FIFO queues will not be necessary. From the analysis of the data flux shown in Fig. 3 we can conclude that the proposed architecture 
IV. DIF SUCCESSIVE DOUBLING ALGORITHM
The constant geometry architecture of the DIF algorithm is obtained by performing a perfect shuffle of the data generated in each stage of the transform. Considering the three-dimensional representation of the data indices introduced in Section I11 (9) we define the order k perfect shuffle operator as
. . X , -~+~I I (14) where k 5 U + w + U. We can easily generalize p(kl in a similar way to (1 1). We are interested in the decomposition of permutation pl(,), U = log,r, into elementary permutations that we can implement electronically. In order to achieve this we define two new operators: inverse decimation (7 a;) and inverse concatenation ,;P.;;;[x, z , y1 (18) where the application order for the operators is from left to right.
The proof is immediate, using definitions (15) and (16).
Equations (17) and (1 8) guarantee the decomposition of the perfect shuffle permutation of the three dimensional representation of each data item's index. Consequently, we can establish the following result:
f i e o r e m 2: DIF successive doubling radix r and constant geometry algorithm of a sequence of N data items ( N = r n ) can be camed out in a column of Q PE's ( Q = r 4 , 0 5 q < n ) which implement (hardware translation) each stage the following operator string:
where z~ = Iog,r, the operators are applied from left to right and we consider the [cycle, nPE, path] interpretation of the three dimensional representation of the index of each data item.
Its proof is similar to that of Theorem 1. Fig. 5 shows the internal structure of the PE where we have included two FIFO queues to facilitate the overlapping execution of different transform stages.
These queues implement the operator string T~;,P+ ti,;, whereas the partial perfect shuffle operator CL :;; determines the interconnection network of the PE column.
V . FINAL REMARKS
Parallelism has long been considered an important solution to fast transform algorithms [ 11. The solution is in the design of multiprocessor architectures which permit the exploitation of the inherent parallelism of these algorithms, but we must consider three important problems: partitioning of the algorithm, synchronization between PE's, and pipeline design of the PS section.
The successive doubling method is an efficient procedure for the design of fast algorithms for orthogonal transforms, presenting a high spatial parallelism in the calculation stages and an inherent sequentiality between them. Consequently, the appropriate architecture will be a rectangular array of a pipeline of PE columns. The application specific multiprocessor we have proposed in this work efficiently solves the problems mentioned before. The result is a constant geometry systolic architecture. The geometry is determined by the perfect unshuffle (or perfect shuffle) permutation of the data generated in each stage of the DIT (or DIF) successive doubling algorithm.
The constant character of the geometry permits the implementation of the data flux of the successive doubling algorithm by means of hardwired control. That is, the PE's do not need address arith-metic units to locate the data. Addressing is inherent to the evolution of the data in the FIFO queues of the RS section and the external interconnection network. Moreover, the partitioning of the algorithm appears in a natural way when this permutation (perfect unshuffle or shuffle) is decomposed into a string of elementary permutations which can be implemented electronically: decimation, concatenation, and partial perfect unshuffle (DIT) or shuffle (DIF).
Also, we have chosen the systolic operating mode, which is an effective synchronization method. Finally, a pipelined PS section of the PE will improve the performance of the proposed architec-
[51

161
[71 ture (see [2] and [12] ). 181
[91
