The Self-Sorting (SS) algorithm is a highly e cient version of the Fast Fourier Transform (FFT), because, unlike the generally used algorithms, it does not require shu ing the sequence to be transformed (digit reversal). In this work we propose a parallel architecture that implements the SS radix r (r 2) algorithm. The data ow of the algorithm is decomposed, in a natural way, into two sections that are implemented by means of FIFO queues located in the processors and an interprocessor connection network (perfect unshu e). The resulting design is regular and modular, and, whenever possible, presents constant geometry. The total processing time required is nN=rQ cycles for a transform of size N = r n computed using Q = r q processors. Consequently, there are no cycle losses.
INTRODUCTION
The self-sorting (SS) algorithm 1] is a version of the successive doubling (SD) method. This is a general and e cient procedure for the design of fast algorithms. It is based on the divide and conquer strategy and establishes the solution of a problem by working with a group of subproblems of the same type and smaller size. This way, the redundances present in many algorithms are eliminated, resulting in faster algorithms. Another application of the SD method is the extraction of the parallelism present in these algorithms.
Considering the types of data ows generated by the SD method, we can distinguish between in-place, split-radix, constant-geometry and SS algorithms 2]-3]. In-place algorithms 3] minimize the memory requirements as the partial results are written over the data used for their calculation. Split-radix 4] algorithms are characterized by the fact that they require a smaller number of arithmetic operations than other algorithms. However, split-radix algorithms present irregularities in the data ow that hinder the balancing of the computational load in parallel architectures 5]. Constant-geometry algorithms 6] present identical data ows in every stage of the transform. Finally, SS algorithms 1], 7], unlike the other versions of the SD method, are characterized by the fact that they provide an output sequence that is digit reversed with respect to the input sequence.
The parallel implementation of an SS algorithm with respect to an unordered algorithm is advantageous in terms of speed. In the implementation of an FFT algorithm 8] with an unordered output (in-place or constant-geometry, for instance), the ordering of the data sequence can be carried out in two basic ways. First, the ordering can be carried out by means of an interconnection network (such as the one in ref. 9] ). However, this solution cannot be used in partitioned algorithms, as a global data shu e and not a shu e within each partition must be carried out. Secondly, if the FFT device is used as a coprocessor by a host processor, the sequence can be written in an unordered format and the host can perform the ordering operation. Another alternative is to generate an adequate sequencing of the addresses in order to write the data in memory in an ordered format (this solution is used in 10]). In 11] they propose an address generator using application speci c hardware.
Given the importance of the FFT, many e orts have been devoted to the design of systolic architectures that e ciently exploit the parallelism associated with the SD method. Until very recently, no designs contemplating the partitioning of the data have appeared in the literature.
In every case, the authors have considered the constant-geometry FFT algorithm 12]-14]. However, we do not know of any systolic design that e ciently implements the SS FFT algorithm.
In this work we develop a parallel architecture that implements the SS algorithm. The result is a parallel SS algorithm where the partitioning of the data is simple and the computational e ciency is optimal, as there are no cycle losses. The organization of the paper is as follows. The SS FFT algorithm is introduced in section 2. The notation used will be introduced in section 3. In this same section, we present the design of the processor implementing the SS algorithm. In section 4 we extend the architecture to a processor column that is interconnected by means of the perfect unshu e permutation. Finally, in section 5 we evaluate the proposed designs.
SELF-SORTING FFT ALGORITHM
The process leading to the Fast Fourier Transform (FFT) is based on the factoring property of the DFT. If we have a sequence x(m); 0 m < N, the DFT X(k); 0 k < N of this sequence is de ned by means of the following expression:
where W N = exp(?j2 =N). If N is decomposed into a set of n factors N = N n N n?1 N 1 then, the DFT of length N can be computed in n stages through multiple DFTs of length N i (numbering the stages i = 1; : : :; n). Usually, in every stage N i = r, and, consequently, the input and output indices for transform m and k, can be written in base r, for example, m = m n r n?1 + m n?1 r n?2 + + m 2 r + m 1 (m n ; m n?1 ; : : :; m 2 ; m 1 )
Substituting these factorings of m and k in equation (1), we obtain the standard radix r FFT algorithm that was initially derived by Cooley and Tukey 8] . Any FFT algorithm basically implies two processes: a shu ing of the data sequence (digit reversal) and a set of n calculation stages where the \butter ies" are computed. There are, however, di erent FFT algorithms, although their results are the same, the di erence may be found in the computations carried out and in the way in which the intermediate results are stored 2] . Considering the types of data ows, we can distinguish between in-place, constant-geometry and SS algorithms. In-place algorithms 3] minimize the memory requirements as the partial results are written over the data used for their calculation. The constant-geometry algorithms 6] present identical data ows in every stage of the transform. Finally, in SS algorithms 1] it is not necessary to digit reverse the sequence (unlike in-place and constant-geometry algorithms) as it is carried out in situ during the execution of the transform. In table I we summarize the most frequently used algorithms, indicating the positions where the inputs and outputs of each butter y are located. In the i-th stage (i = 1; : : :; n), the butter ies are made up with the set of the r inputs that only di er in digit m n?i+1 , producing the r outputs that di er in digit k i . The last three algorithms of table I are variants of the rst three obtained by reverting the indices, and are, consequently, very similar to them. The algorithms in table I are not the only possible alternatives. Using a di erent approach, Johnson and Burrus 15] showed that the radix 2 algorithm can be restructured so that it is both self-sorting and in-place. This technique was generalized by Temperton 16] for its application to di erent radices. Other fast algorithms with di erent properties are found in the book by Rabiner and Gold 3] . Similar principles to those applied to the Fourier transform can be applied to other transforms. Thus, for example, Buneman 7] has developed an SS algorithm for computing the Fast Hartley Transform.
The data ow of the SS algorithm is based on the cyclic shifts to the right of the most signi cant digits of the indices of the data (partial perfect unshu es (3) Where, for the sake of clarity we have underlined the digits that are shifted. After the 4 stages that make up the SS algorithm, the output index is the digit reversal of the input index. Figure  1 shows the data ow of the SS algorithm for a sequence of sixteen data items.
As shown in table I, the SS algorithm admits a variant that consists in performing the cyclic displacements to the left, instead of performing them to the right. The architecture that results in this case 17] has identical properties to the one we will describe in the following sections.
RADIX r PROCESSOR ARCHITECTURE
In this section we will introduce the notation that will be used together with the design of a single processor architecture that implements the radix r SS algorithm. In the next section this formulation will be generalized by means of the introduction of parallelism. This will permit the design of a recirculating PE column. For both tasks we will base our approach on the partial perfect unshu e permutation.
Notation
We consider a sequence to be transformed fa i g of size N = r n , being r the radix. We will assume that this sequence is arranged as a two dimensional matrix of r v columns of r data items as follows (n = v + 
To indicate a sequence of this type, we will use a two dimensional representation of the index for each data item, written in the radix base. This way, the most signi cant digit of each index, (in base r) will denote the row number, whereas the rest of the digits will denote the index of the column as follows:
x; z] r = x]; z v ; : : :
with x; z k 2 f0; 1; : : :; r ?1g. The index of the i-th data item will be i = x r v + z v r v?1 + + z 2 r + z 1 (n = v + 1). This representation is chosen because it is the most appropriate one for expressing the di erent operations to be carried out. For simplicity in the notation, from now on we will write expression (5) without subindex r.
If we consider matrix (4) as a two dimensional sequence of data owing from left to right, whereas the computation of the butter ies of the transforms is carried out over each of the columns of the matrix, we can interpret the two dimensional representation x; z] as path; cycle]. Coordinate x (which counts the rows from top to bottom) determines the parallelism of each stage (a butter y with r data items is processed each cycle), whereas coordinate z (that counts the columns from right to left) establishes the sequentiality in each stage (r v cycles). Consequently, the calculation time of a stage will be r v clock cycles where the duration of the cycle is the time employed by the processor in the computation of the operations associated with each butter y.
In order to obtain a general design of the architecture for the evaluation of the SS algorithm we will need to de ne a set of algebraic operators which will allow us to describe processor networks in terms of their interconnection rules. The operator strings will represent the data ow through the processor networks. Thus, the notation provided by a generic operator ; x; z] = x`; z`], means that the data item entering a processor through path x in cycle z, will output it through path x`in cycle z`.
For writing the expressions of the operators we will follow the convention of composing operators from left to right. For example, 1 2 ( x; z]) = 2 ( 1 ( x; z])); this way, the ow of the data in the processor networks will take place rst through the network de ned by 1 and then through the one corresponding to 2 . This is a commonly used notation and presents the advantage that the generation of the networks from the corresponding operator strings follows their written order.
The rst operator we are going to de ne is the partial perfect unshu e operator, ? i according to the following expression
that is, a right cyclic shift of the most signi cant digits of the x and z elds has been carried out starting with digit z i . This operator modi es both elds, path and cycle.
With this notation we introduce the symbolic operator butter y B that computes a radix r butter y over each of the columns of the two dimensional data matrix. As we are only interested in regrouping the data and not in the computations, we consider that the butter y operator acts as an arbitrary r-input and r-output function.
In this paper we will use the formulation of the SS algorithm presented by Cochran et al. 1]. The SS algorithm can be expressed as the following operator sequence:
? n B ? 2 B? 1 B (7) where we have used the butter y and partial perfect unshu e that were de ned before. These operators respectively describe the computations and the data ow of the algorithm. They are a set of n = v + 1 stages, although in the rst one ? n = 1 (identity), we have not shown it for reasons of homogeneity in expression (7) . As an example, the data ow in the SS algorithm over a sequence of N = 8 radix 2 (r = 2; v = 2) data items is a 3 
The partial perfect unshu e operator does not have a direct implementation and for this reason we will try to decompose it into a string of elementary operators that do have it. The rst elementary operator we will de ne over sequence (4) is the concatenation operator . The object of this operator is the generation of a one dimensional output sequence by means of the concatenation of the rows of the two dimensional input matrix as follows,
where the empty brackets indicate that the resulting sequence is one dimensional. This is the essence of concatenation. Thus, for instance, applying operator to a sequence made up of two rows with four data items (r = 2; v = 2) yields a 3 
The partial perfect unshu e operator can be implemented from these two elementary operators as established in the following lemma: In gure 2 we display the circuit that implements lemma 1 for r = 2; v = 2, and for i = 1; 2; 3 (expression (8)). It is a FIFO queue whose inputs are given by the concatenation operator and whose outputs (which depend on parameter i) are given by the decimation operator. Operator ? i is implemented over the two dimensional data matrix in two stages. During the rst stage, the two rows of the data matrix enter the FIFO queue, each one of them through its corresponding input. When the FIFOs are full, the concatenation of the two rows of the matrix is carried out (operator , expression 10). In the second stage, the data is present in the two outputs, and the decimation of the sequence stored in the FIFO queue is performed (operator i , expressions 12 to 14). The i-th operator is obtained by selecting the i-th input of the output multiplexor.
The motion in the queue must be adequately programmed in order to perform this operations. Consequently, it must be shifted by one position during the concatenation stages and by one or two positions during the decimation stages. The operation that is globally performed is the perfect unshu e ? i of the two dimensional input matrix.
In the general case, the circuit is a FIFO queue of N = r v+1 cells with inputs in cells 0, r v , 2 r v , ..., and (r ? 1) r v , and r outputs located every r i?1 cells, with the last one in position N ? 1 (for each value of parameter i). The design can be easily adapted in order to work over data sequences of di erent lengths. In this case, the length of the FIFOs and the location of the outputs must be programmed through the corresponding multiplexors. The control must also be modi ed accordingly.
Design of the processor
The internal structure of the processor will consist of two clearly di erent sections: processing (PS) and routing (RS). The PS section will carry out the set of operations associated with an r point butter y (operator B) and the RS section will perform the shu ing of the data according to the partial perfect unshu e (operator ? i ). Lemma 1 permits the e cient implementation of this permutation, decomposing it into the decimation and concatenation elementary permutations that are easy to implement. We can thus enunciate the following theorem:
Theorem 1 The SS algorithm of a sequence of length N = r n can be carried out on a processor that implements (hardware translation) in each stage the operator string i B (16) numbering the stages i = n; : : :; 1 and assuming the path; cycle] interpretation of the two dimensional representation of each data items index.
Proof The proof is immediate using lemma 1. 2
The RS section of the processor is basically given by lemma 1 ( gure 2). However, as the SS algorithm is not in-place, in order to implement all the stages of the algorithm by means of a single processor with feedback we require two identical FIFO queues in the processor. One of them gathers the data coming from the processing section (concatenation) while the other generates the sequence in the appropriate order for processing (decimation). Once a stage has been computed, the resulting sequence is fedback to the processors in order to start computing a new stage of the algorithm. The FIFO queues exchange their functions: the one that carried out the concatenation operation now performs the decimation operation and viceversa. This way, it can be guaranteed that no useful data is overwritten. In gure 3 we display the internal structure of a generic radix 2 processor, where we specify the required connections and multiplexors.
PROCESSOR COLUMN
In this section we generalize the design of the architecture that evaluates the SS algorithm by introducing parallelism. The algorithm will be implemented on a processor column (PEs). In most existing designs, the size of the PE column is conditioned by the number of data items to be processed. We will start from the processor design presented in the previous section, which allows the computation of transforms of di erent sizes by just modifying the length of the FIFO queues.
To go from a system with a single PE to a PE column will force the modi cation of the notation introduced in the previous section as a new dimension will be required in order to represent parallelism. We thus employ a three dimensional representation x,y,z] of the index corresponding to each data item The formulation of the SS algorithm with this representation has the same look as in the single processor case (equation (7)). It is necessary, however, to provide a three dimensional de nition of the operators appearing in this expression. 1 (21) We also de ne the partial digit reversal operator i that will produce the reversal of the most signi cant digits of the three dimensional representation of the indices of the data starting in the i-th digit and that can a ect three, two or one dimensions: 
We can also simplify the notation for those operators that carry out the reversal of all the digits of one dimension, for instance:
x;y;z 1 (23) This lema will permit the decomposition of the data ow of the parallel algorithm into two stages, one carried out within the processors and another one by means of an interconnection network among the processors. In Figure 4 we illustrate it for ? 1 This relationship is obtained in the gure in two ways: directly, using the de nition of operator ? 1 or in two stages trough operators ? x;y and ? 1 . In the latter case ? x;y can be interpreted as an interconnection network among the processors (coordinates x and y represent the parallelism) and ? 1 can be interpreted as the data ow in a pipelined processor (coordinate z represents the sequentiality).
Lemma 3
x;y ? x;y = y (27) This lemma will permit the simpli cation of the design of the PE column. The relationship provided by this lema is illustrated in gure 5. The three operators that participate in it can be interpreted as interconnection networks (they modify the x and y coordinates). The network string de ned by x;y and ? x;y can be substituted by the one corresponding to y .
Lemma 4
? n ? n?1 ? i = i ; 1 i n (n = u + v + 1) (28) Two important particular cases of this lemma are the following:
? n ? n?1 ? 1 = x;y;z (29) ? n ? n?1 ? v+1 = x;y (30) Expression (29) proves that the SS algorithm automatically generates the digit reversal permutation 18]. Expression (30), will be directly used in the design of the processor column; this is illustrated in gure 6. Expression (30) together with lemma 3 will allow the substitution of the multiple interconnection networks (represented by ? i ) by an input network to the processor column ( y ) and a feedback network to column (? x;y ) (this will be established in lemma 5).
The three dimensional concatenation and decimation operators perform the same function as in the two dimensional case. Both modify the x and z dimensions, leaving dimension y unaltered. In order to clearly show this fact, from now on we will denote them as x;z and x;z i , The object of this lema is to substitute the ? i operators (di cult to implement) by the strings y x;z x;z i and ? x;y x;z x;z i (with an easy parallel implementation). This lemma is the base of the design we will now establish.
Theorem 2 The SS algorithm of a sequence of length N = r n can be executed on a column of Q = r u PEs that implements (hardware translation), in the rst stage (i = n), the operator string PE path;cycle path;cycle n B (36) and in the rest of the stages (i = n ? Proof The proof is immediate using lemma 5. 2
The architecture de ned by theorem 2 is made up of two clearly di erent parts: the PEs (which include the processing and memory) and the interconnection network. Furthermore, the PEs we must use are the ones designed for the single processor case. This can be deduced from the following: the operators appearing in equations (36) and (37) are classi ed into two groups depending on the dimensions over which they act: (path, cycle) and (path, PE). The operators that act over the dimensions path and cycle de ne a data ow within a PE (they do not modify the PE dimension), whereas those that act over the dimensions path and PE de ne an interconnection network of memory elements (they do not modify the processing cycle of the data) This decomposition is only obtained if we interpret the three dimensional representation x; y; z] of the indices of the data as path; PE; cycle]. If we consider any other type of interpretation, the classi cation of the operators cannot be carried out this way.
Consequently, the internal structure of the PEs is a translation of the string path;cycle path;cycle i B and, therefore, coincides with the one we had obtained in the single processor case. Two types of interconnections exist depending on the equations that make up theorem 2: in equation (36), PE (digit reversal operator) de nes an input network to the PEs, as it only acts during the rst stage (at the input of the data), whereas in equation (37), ? path;PE (the perfect unshu e operator), de nes an interconnection network between PEs as it acts on the data recirculation between stages. In gure 7 we show a radix 2 design of this type considering a column of 8 PEs.
We can interpret theorem 2 as the decomposition of the three dimensional partial perfect unshu e operator, basis for the design of the SS algorithm, into a string of easily implementable two dimensional operators. As a consequence of this, the data ow of the SS algorithm has been factored into two stages: one carried out within the PEs and the other in an interconnection network. In gure 8.a we show the data ow of a radix 2 transform of length 64 data items executed on a column with 8 PEs. The PE that must be used in this example is the one in gure 3 and the interconnection network the one in gure 7. Observe that the e ciency is maximum as there are no cycle losses. The SS algorithm is completed in 24 cycles.
In gure 8.b we present the generation of the data ow of gure 8.a. The factorization of the data ow into the two sections per stage, one carried out within the PEs (boxes with labels 3, 2, and 1) and the other carried out by means of interconnection networks among the PEs, can be clearly observed. The input interconnection network is PE and the one between stages ? path;PE . The data ow generated within the PEs is path;cycle path;cycle i . We must take into account that parameter i is decreasing (i = n; : : :; 1). In addition, if i > v, path;cycle path;cycle i is the identity operator (the boxes representing this case are labelled as 3, the rest are labelled using parameter i).
FINAL REMARKS
Most of the systolic and semisystolic designs found in the literature for computing the FFT do not permit the generation of an ordered output sequence if the algorithm is partitioned. If the algorithm is not partitioned, the shu ing of the data sequence can be carried out by means of an interconnection network. However, this approach is not valid when the algorithm is partitioned, as the shu ing of the data that must be carried out is global (paths + cycles) and not only of the paths. If a RAM memory is used for storing the data, the shu ing could be carried out by addressing the RAM memory, but this solution is not compatible with a systolic or semisystolic design. Consequently, in most systolic designs, it is assumed that the shu ing is carried out outside the device.
The architecture we have presented for the implementation of the SS algorithm presents the advantage of providing an output sequence that is digit reversed with respect to the input. This is an advantage in applications such as the FFT, in which using in-place, constant-geometry or split-radix algorithms imply an additional data shu ing stage. As in parallel computation a shu ing of the data implies communications, the implementation of an SS algorithm will be more e cient. We will now compare the parallel implementation of the di erent algorithms in detail.
The constant-geometry algorithm is the one with the easiest parallel implementation as it generates an identical data ow in all the stages of the algorithm. This property of the data ow of the constant-geometry algorithm is translated into a process with a simpler structure than for the rest of the algorithms, as proven in 14]. However, the need for the additional shu ing stage signi cantly reduces the e ciency of the parallel architecture.
The split-radix algorithm requires a smaller number of operations than the other algorithms. However, the irregularities in the data ow it presents hinder its parallel implementation. The processor and the interconnection networks needed for its implementation are not excessively complex 5], but a good balance of the computational load among the processors is not achieved. It also requires an additional digit reversal shu ing of the data sequence.
Finally, the in-place algorithm, like the SS, generates a rather complex data ow. The architecture must implement a di erent data ow in each stage. For instance, the PDSP16510 processor, from Plessey Semiconductor 10] implements a radix 4 in-place FFT algorithm. This device has eight 4-Kbit RAMs for storing the data. During the FFT computation, data has to be continuously read from the memory and intermediate results rewritten to it. In order to adequately address the RAMs a complicated address sequence and data movement has to be generated.
The processor we propose for the implementation of the SS FFT algorithm has a semisystolic structure, as the required data ow is achieved through multiplexors and interconnection lines. This leads to an advantage in terms of speed with respect to those designs in which the data ow is generated by means of addressing RAM memories (for instance 10]). In the single processor case, our design requires 2N memory cells connected as 2 FIFO queues with N cells each. In order to achieve the data ow required by the SS algorithm, we need to select one output out of the possible n = log r N. In a design based on the use of RAM memories, it would be necessary to generate N addresses. The complexity of our design is consequently low.
The processor we have designed can be integrated into a column of PEs with feedback without any modi cations. The operation of the PEs within the column is identical to their operation in the single processor case. The initial data sequence is distributed among the di erent PEs. The interconnection network in the feedback of the column is in charge of distributing in an adequate manner the data sequence among the PEs for each one of the stages that make up the algorithm. This way, we take advantage of the high degree of parallelism of the FFT (the use of the PEs is 100%).
Both the single processor and the PE column permit the computation of variable sized FFTs over a xed size design. The points to be considered are the adaptation of the length of the FIFO and the control. It is not necessary to modify the interconnection network or the structure of the PEs. Let us assume that we want to compute an FFT whose size is half of the maximum allowed by the design. In this case, we must provide two new inputs to the FIFO queues in intermediate positions (2 v?1 and 3 2 v?1 ) and the control will have to be adapted in order to consider one stage less and N=2 processing cycles less in each stage. These considerations are similar to those that permit the computation of FFTs of variable size using the PDSP16510 processor from Plessey Semiconductor 10].
Finally, we will point out that the architecture we propose in this work for the SS algorithm and the one developed in 14] for the constant-geometry algorithm (CG) share quite a few properties. In fact, we can consider the CG architecture as a particular case of the SS architecture by making ? i = ? 1 , i = 1; 2; : : :; n in equation (7) . Both architectures have the same recirculation network, de ned by the perfect unshu e permutation, although the SS design requires the digit reversal reordering of the input buses and a greater complexity of the RS section. 
APPENDIX Proofs of the lemmas

