AbstractÐThis paper presents a modular algorithm which is suitable for computing a large class of multidimensional transforms in a general purpose parallel environment without interprocessor communication. Since it is based on matrix-vector multiplication, it does not impose restrictions on the size of the input data as many existing algorithms do. The method is fully general since it does not depend on the specific nature of the transform kernel and, therefore, it may be used for a wide variety of transforms. Moreover, since some one-dimensional Fast Fourier Transform algorithms map the input sequence onto two or more dimensions, the new method also may be employed to efficiently compute the 1D FFT in parallel. In addition, the proposed algorithm is exploited to derive a fully systolic VLSI architecture performing multidimensional transforms, which does not need the transposer required by classical architectures.
INTRODUCTION AND MOTIVATION
T HE usefulness of the 2D Discrete Fourier Transform (DFT) is well-known in a large number of application areas. The multidimensional (3D or 4D) DFT also has been proposed as an effective tool, e.g., in computer vision and pattern recognition, to facilitate object recognition [1] and time dependent analysis [2] , [3] , in video telephony to compute motion from a sequence of images (multiframe detection) [4] , [5] , and in some nuclear magnetic resonance imaging algorithms [6] . The large amount of computation required by these algorithms makes a parallel implementation desirable.
Although other transforms (Hartley Transform [7] , Discrete Cosine Transform [8] , Discrete Sine Transform [9] , Hadamard Transform [10] , etc.) have been developed in order to avoid the operations on complex numbers that are required by the Fourier transform, the bulk of the literature about parallelism in transforms is devoted to the DFT.
In [11] , the one-and two-dimensional Fast Fourier Transform (FFT) implementations are discussed over a kdimensional array of k Processing Elements (PEs): In that model, an interconnection system allows each PE to communicate with its neighbors in any direction. The analysis shows that a minimum number of I/O operations occurs in the hypercube interconnection scheme.
Thus, though architectures with shared hierarchical memory to perform data transposition [12] or schemes requiring high connectivity [13] have been shown, it is important to develop algorithms that reduce the need to communicate among the PEs. If communications between the PEs can be eliminated, two advantages are gained: No time is wasted on interprocessor I/O operations and no synchronization is required among the PEs. In fact, some research has focused on reducing or avoiding the interprocessor communications that are usually required to transpose the data between the 1D DFTs (see Section 2) . These studies essentially exploit the possibility to compute the multidimensional DFT by a certain number of independent 1D DFTs or perform this computation by matrix-vector multiplications.
The first approach is to compute the 1D DFTs by using fast algorithms (many 1D FFT algorithms have been presented, e.g., in [14] ). Most of the known algorithms impose some restrictions on x (the size, along each axis, of the input array), reducing the generality of this approach. For example, [15] presents an algorithm that computes an x k point k-D DFT (where x is a prime number) by evaluating x k À Iax À I independent one-dimensional DFTs.
Other results are shown in [16] : The most interesting are limited to the 2D DFT, and are related to the value of x p P (with p a prime number) and x P n . In the first case, p P p independent x-point 1D DFTs and p I independent ppoint 1D DFTs, in the second case QaPP n independent xpoint 1D DFTs and one P nÀI Â P nÀI two-dimensional DFT are required.
In [17] , the authors present a parallel implementation of the algorithm described in [18] on an AT&T BT100 binary tree computer. This algorithm computes an x Â x twodimensional DFT by v yx independent x-point 1D DFTs whose input vectors are given by Discrete Radon Transforms over v data sets. These data sets are extracted from the input matrix according to criteria based on linear congruences [18] .
Though the algorithm does not impose any restriction on x, its first step is to find the linear congruences which span a quadratic x Â x grid and possess a common solution (0, 0). Unfortunately, this can be performed by means of simple formulas only for some specific values of x [18] . Moreover, it provides a solution only for the 2D case and it relies on the following relationship among the transform coefficients:
This relation is valid for the Fourier coefficients x k 3 k x e ÀjP% x k , but not for other useful Transforms (e.g., DCT, DST, Hartley Transform, etc.).
Property (1) is also exploited in [19] , where the multidimensional case is treated too. The Reduced Transform Algorithm is used to balance the communication and the computation in a parallel machine: Performance for an iPSC/860 is provided. Unfortunately, this algorithm requires that the sizes of the input array are equal to prime numbers.
Such restrictions are not required in approaches based on matrix-vector multiplication. This approach allows a straightforward VLSI implementation and has the same computational complexity as a 1D DFT if the problem size does not allow a fast implementation.
The matrix-vector approach was first explored by Winograd [20] . In [21] , a detailed analysis is performed showing that matrix-vector multiplication approach is attractive when the input data are sequentially available. This is because the matrix-vector multiplication does not need the whole input data set to start its computation. This work is more suitable for fairly small transforms since the amount of numeric computation grows more rapidly than fast algorithms.
The paper is structured as follows: The 2D DFT and its standard parallel algorithm are briefly summarized in Section 2; the parallel approach based on the matrix-vector multiplication is defined in Section 3; Section 4 provides a performance evaluation; Section 5 describes a fully systolic VLSI architecture which exploits the proposed algorithm in order to avoid the transposer, which is needed by classical architectures, and Section 6 is devoted to conclusions.
CLASSICAL PARALLEL IMPLEMENTATION OF THE 2D DFT
Let X x rY s Y H rY s`x f g be an x Â x matrix. Its 2D DFT is the x Â x matrix Y y pY q Y H pY q`x f g , defined as [22] :
where 3 x is:
The classical parallel approach to implementing the distributed computation of the x Â x 2D DFT may be summarized as follows: Because of Step 3, if there is one PE for each row, x À I data need to be sent and received by each PE, which limits the speed of this approach. Thus, for k P, the classical algorithm requires the input of data (i.e., one row for each PE), x independent 1D DFTs (i.e., one by each PE), data transposition, x independent 1D DFTs, and the output of the final result. In the next section, an approach based on matrix-vector multiplication which needs no interprocessor communications is described.
A MATRIX-VECTOR MULTIPLICATION APPROACH TO THE PARALLEL MULTIDIMENSIONAL DFT COMPUTATION
The following definitions are provided for the 2D case. The extension to the multidimensional case is postponed to the end of this section. The use of the method for rectangular matrices is straightforward. Let W x w x iY j Y H iY j`x f g be a symmetric x Â x matrix having:
By applying (4) to (2), we obtain: X SX Equation (5) may be expressed in matrix form: . Â denotes a matrix-vector multiplication and . Á denotes matrix transposition.
Now, if:
. are the qth column of Y and W x , respectively the following compact forms may be provided:
This scenario suggests a parallel approach to computing the 2D DFT: In fact, (6.a) may be computed by x independent computing phases (i.e., one for each PE), one for each value of p (alternatively, if storage by columns is preferred, (6.b) may be computed x times, one for each value of q). In both cases, each independent computing phase needs yx P operations. The external matrix-vector multiplication in (6'.a) and 
where:
. Á d e denotes the smallest integer greater than or equal to the argument, and . Á Ã denotes the complex conjugate.
Consequently, only R xÀI P AE Ç x real multiplications (instead of Rx P ) have to be effectively computed between the x complex points of the vector obtained by the first matrix-vector multiplication X Â w g q and the x complex points of the first xÀI P AE Ç rows, starting from the second one (this rough evaluation also includes the trivial multiplications by 0 and AEI that occur, e.g., for the first point of any row of W).
Multidimensional Case
g be its k-dimensional DFT defined as [22] :
Now, if m k is the preferred direction along which to sort the output data, (9) may be rewritten as:
IH
This equation may be evaluated by x independent and parallel computations, each one associated with a fixed value (i.e., m k ) of the index m k .
Each of these computations has an yx k computational complexity and outputs x kÀI output values
g . An important consideration has to do with reference to the memory requirement. From (9) , it may seem that a memory size of yx k is necessary for storing W and X in any PE, but this is not strictly required. In fact, by means of (7), only one row of W is necessary:
Moreover, any row of X may be overwritten over the previous one because any point of X is processed only once by the inner matrix-vector multiplication of (6).
The possibility of processing the data by every PE in accordance with (6) , where rows are multiplied by vectors w g x q , avoids delaying the computing with respect the input phase. In other words, the processing may start as soon as the first point is received by the system: This is not possible by the classical algorithm, where the PE related to the transformation of the last row has to wait for it, as shown in Fig. 1 .
In conclusion, the new algorithm requires inputting the data to transform into each PE, x independent matrixvector multiplications (one per PE), x independent 1D transforms, and the output of the final result. The input phase and the first computing phase may be interleaved, i.e., any PE may start its processing as it receives its data.
1D FFT Algorithms and Other Multidimensional Transforms
FFT algorithms perform the 1D DFT efficiently. Many algorithms have been introduced [14] , [23] , [24] , [25] , in general, they exploit properties of 3 m x for certain values of x. Since many of them map the 1D input sequence onto two or more dimensions [22] , the proposed algorithm may be easily extended to these cases. In addition, since the proposed algorithm is completely independent of the specific nature of the Fourier coefficients, it may be used to perform a large class of k-D Transforms that may be expressed as:
where k iI i and x m i Y n i are, respectively, the normalizing and the transforming coefficients. For instance, such class includes the k-D Discrete Hartley Transform [7] , the k-D Discrete Cosine Transform [8] , the k-D Discrete Sine Transform [26] and may be implemented in parallel with no interprocessor communications by a simple change in the matrix W x of our algorithm.
The Discrete Wavelet Transform (DWT) [27] , [28] , [29] is a mathematical technique that decomposes a signal in the time domain by using dilated/contracted and translated versions of a single basis function, named the prototype wavelet. Recent research suggests that the DWT is preferable to other transforms, especially for image compression [30] , [31] , [32] , [33] , [34] , [35] . A nontrivial extension of the proposed algorithm performing the multidimensional DWT is presented in [36] .
PERFORMANCE EVALUATION
In order to evaluate the performance of the algorithm described in Section 3, we have compared it to the classic algorithm described in Section 2. Table 1 shows the operations required by both approaches to transform a two-dimensional x Â x matrix, by means of x PEs.
Preliminary Considerations
For both strategies, Steps 3 and 4 are identical.
For suitable values of x, Step 1 in the classic algorithm (1D x-points DFT), may be computed by an FFT in y x log x operations, otherwise it has the same computational complexity as the x Â x matrix times x-points vector multiplication required in the new algorithm yx P . The new algorithm is more attractive because it does not require the data transposition (corner turning, Step 2) needed by the classic algorithm, when the number of PEs used is less than or equal to x. In fact, the time required by this step is strongly dependent on both the architecture and the strategy chosen to perform it, but it mitigates the advantage of a fast computing of Step 1 in the classic algorithm.
Another advantage of the new algorithm results from the nature of Step 1. Processing can begin within each PE as soon as data becomes available. This is denoted by the dotted line between Steps 0 and 1: These steps, in fact, may conveniently be interleaved. On the contrary, the ith PE in the classic algorithm needs the ith row to transform it and start its computation (i.e., the last PE has to wait the last row before starting its task).
Comparison between the new algorithm and other approaches aiming to reduce the communications between PEs may be also performed. As we have previously indicated, the new algorithm imposes no restrictions on the value of x. Moreover, the new algorithm is general so that it may be used also for other transforms having real coefficients (i.e., in case the input data is real, not complex) because it is not based on the separability property of the Fourier Transform from (1).
Experimental Results
The algorithm introduced in Section 3 has been implemented and tested on an AT&T DSP-3 parallel processor. The DSP-3 parallel processor is a machine with 16 to 128 PEs. Each PE has 512K by 32-bit memory and a DSP32C processor operating at a frequency of 50 MHz, performing 25 MFLOPS, and having a multiplier and an adder that operate in parallel [37] , [38] . Table 2 summarizes the experiments performed to compare the new algorithm with the classic algorithm. The data for x PY RY VY IT were obtained directly from C++ programs running on the DSP-3. The data for x b IT was derived by simulating larger machines with our hardware that has 16 PEs.
The first column of the table gives the size of the input data (for this 2D case, x is the number of rows and the number of columns) that coincides with the size of the PE array. The Fourier transforms required in Step 1 of the classic algorithm and in Step 3 of both the algorithms have been computed both by means of fast (second and third columns) and normal (fifth and sixth columns) routines. The fourth and the seventh columns indicate the speed-up achieved by the new algorithm over the classic algorithm. Figs. 2 and 3 graphically show the data of Table 1 . The figures show that, for all cases, the new algorithm performs better than the classic one. This is more evident when the fast computing of the Fourier Transform cannot be performed. The speed-up decreases as x increases because the time consumed by Steps 0 and 4 assumes a growing weight as the number of processors grows.
In these experiments, moreover, the computing is started in both the algorithms at the same time, without exploiting the feature of processing the data as soon as available, as is possible in the new algorithm.
VLSI IMPLEMENTATION
Advances in VLSI technology allow implementation of parallel processing on a single chip. Therefore, the approach proposed in Section 3 (new algorithm) can be exploited to design a fully systolic VLSI architecture avoiding the transposer that is required by classical architectures [39] , [40] . The transposer requires a large area for global interconnection and time for loading and unloading.
Most of the architectures available in the literature restrict the value of x to a prime number or to a power of 2 [41] , [42] : These limitations are not required by the new algorithm. To implement the new algorithm for the twodimensional case, a twice iterated matrix-vector multiplication is required: In the literature, many efficient architectures are available to perform this task.
A Fully Systolic Architecture
In order to better exploit the VLSI technology, we have focused on systolic array implementations. These structures have these essential features:
. synchrony: data rhythmically computed and flowing; . modularity and regularity: modular PEs, simply and regularly connected; . pipelinability: the speed-up may be linearly increased by pipelining; . boundary input and output: the I/O operations with the external world are performed only by the boundary PEs. There are two well-known semisystolic arrays that perform matrix-vector multiplication [43] . Neither of them is fully systolic: The first one generates a single point of the result in each processor ! i (i.e., the output is not available from the last PE), while the second array needs the input vector to be preloaded with one point for each processor i . These features may be exploited by merging the functionality of ! i and of i in one processor i in order to obtain a modular fully systolic array computing a two-dimensional transform by the new algorithm.
Let
i be a processor performing the task of ! i during a first processing phase and the task of i during a second phase. Data computed by the first phase are locally stored and used as input data for the second phase. Both computing phases need x steps. Table 3 describes how the various computing steps are performed by an array constituted by four processors i . A two-dimensional transform may be obtained by an x Â x array, as shown in Fig. 4 . A block-diagram of iYj is shown in Fig. 5 .
This architecture: . has a minimum number of boundary cells;
. uses an input data stream and generates an output data stream; . does not require any data preloading. These features ensure maximum throughput, minimum latency and a minimum number of I/O pins [44] .
Recent advances of the 3D VLSI technology make feasible shorter and more systematic wire routing, as well as higher circuit density [45] . Also, it is important that there are no high fan-out ªglobalº signals other than the clock. Therefore, the new algorithm for the 3D case can be directly realized by a 3D VLSI implementation [46] , [47] .
Complexity Estimation
We estimate the complexity of the 2D array shown in Fig. 4 by using the ere Á ime P e Á P complexity estimation method [45] , [48] .
The area complexity of each of the x P multiply-add cells is y log x ; the area occupied by wires is y x P since a vertical chord or horizontal chord crosses x wires: Therefore, the entire area complexity is y x P log x . The array computes the complete transform (i.e., x P values) in Px clock cycles, each one of y log x units of time; therefore, the entire time complexity is y x log x . Thus, e Á P y x R log Q x À Á . For the 2D FFT, the e Á P complexity of the lower bound is y x R log P x À Á [49] , [50] : Therefore, the architecture implementing new algorithm has an e Á P complexity that is only y log x higher than the theoretical optimum one.
Conclusions
A parallel algorithm based on matrix-vector multiplication has been introduced, which is suitable for computing a large class of k-D transforms on general purpose parallel machines. Our attention is focused on the Fourier Transform, but these results are applicable to other transforms, since the algorithm has no dependency on the nature of the Fourier coefficients.
The main feature of the new algorithm is the absence of interprocessor communications: Some tests on the AT&T DSP-3 parallel machine have shown speed-up of 1.4 to 3.0 with respect to the classical 2D FFT approach.
The proposed parallel approach can be efficiently adapted to a set of 1D FFT algorithms by mapping the 1D input sequence onto a multidimensional array.
For its structure, the new algorithm does not impose any restriction on the size of the array, i.e., it may be performed with any value of x, and this is convenient: In fact, zero padding approaches that are used to increase the input size to a suitable one can enlarge the data size tremendously if performed along multiple dimensions.
The new algorithm does not need the acquisition of the whole set of input data (required by the classic approach) for starting the computation.
A modular and fully systolic VLSI implementation has been derived also. Since it exploits the proposed algorithm in order to avoid the transposer, it has an ere Á ime P complexity that is within a factor of log x of the lower bound for a 2D FFT. This modest increase is largely compensated by the generality of our architecture, which is also able to compute other transforms. Moreover, by needing a minimum number of boundary cells, using data streams as I/O and not requiring preloading of data into cells, it ensures a maximum throughput rate and a minimum number of I/O pins and latency. In 1990, he became a professor of electrical and computer engineering at the University of Texas at Austin, where he holds the Schlumberger Centennial Chair in Engineering. His research interests are in application specific processing, the interaction between computer architecture and VLSI technology, and the history of calculators. He has made significant research contributions in computer arithmetic, VLSI development, and digital signal processor implementation.
Prior to joining the University of Texas at Austin he was with TRW Defense and Space Systems in Redondo Beach, California, from 1975 to 1990, where he held positions ranging from staff engineer to laboratory manager and, most recently, director of independent research and development for the TRW Defense Systems Group. Currently, he is the hardware area editor for ACM Computing Reviews, a subject area editor of the Journal of Systems Architecture, and editor of the Calculators column of the IEEE Annals of the History of Computing. He is a fellow of the IEEE and an inaugural member of the IEEE Computer Society Golden Core. He has been honored as an Outstanding Electrical Engineer and a Distinguished Engineering Alumnus of Purdue University, and as a Distinguished Engineering Alumnus of the University of Colorado.
