In this work, we propose parallel FFT algorithms, for medium-to- 
Introduction
The Fast Fourier Transform (FFT) algorithm formulated by Cooley and Tukey in 1965 [5] , provides an efficient method for the analysis, design and the implementation of Digital Signal Processing (DSP) algorithms.
The extent to which these algorithms can be performed in real time has in many cases limited by the rate at which the FFT algorithm can be executed. The high performance requirement for real time implementation of these algorithms led to the design of special purpose hardwares. An extensive research has been conducted to implement efficient FFT algorithms on vector processors [6, 11] , and general purpose parallel architectures with shared memory [1, 4, 10] and distributed memory [12, 13, 15] .
The purpose of this paper is to investigate the efficient parallelization of one-dimensional FFT algorithm on medium-to-coarse grain, distributed-memory, message-passing architectures (multicomputers) implementing the hypercube interconnection topology. In order to achieve speedup on such architectures, the algorithm must be designed so that both computations and data can be distributed to the processors with local memories in such a way that computational tasks can be run in parallel, balancing the computational loads of the processors.
Communication between processors to exchange data must also be considered as part of the algorithm. One butterfly scheme. His scheme also minimizes both the number and the volume of concurrent communications to d and d(N=2P), respectively. His scheme performs the communications in the first d stages of the FFT computations and overlaps communications with the computation of the complex coefficients (complex exponentiations). However, in most of the DSP applications, N-point FFT is applied successively to N-point input data sets, for a fixed N. In such applications, the computation of complex coefficient values as they are needed, will be extremely inefficient. In general, N=2 complex coefficient values can be computed once and stored in a table (during preprocessing), so that they are accessed by simple table-lookup operations during successive FFT computations.
In this work, we propose and present more efficient and elegant parallel FFT algorithms for medium-to-coarse grain hypercube-connected multicomputers. The proposed algorithms preserve the best features of the existing work in the literature as follows: (i) use simplified-butterfly scheme, (ii) achieve perfect load-balance, (iii) exploit the efficient table-lookup scheme for handling complex coefficients, (iv) allow only nearestneighbor communications, (v) minimize the number of concurrent communications to d by eliminating fragmentary message passing, (vi) minimize the volume of communication in each concurrent exchange step to N=2P, (vii) overlap communications with one half of the addition/subtraction operations during the d concurrent exchange communication steps. The proposed algorithms achieve these nice features by using a simple yet efficient mapping scheme and a restructuring for the FFT algorithm which enables overlapping communication with computation.
Different strategies exist for the computation of FFT [5, 7, 8, 14] . The FFT scheme chosen for parallelization is radix-2, Cooley-Tukey scheme using the decimation-in-time decomposition [5] . Section 2 presents and discusses the computational structure of this FFT scheme. Parallelization of the chosen FFT scheme is discussed in Section 3. The mapping scheme proposed to achieve perfect load-balance is presented in Section 3.1. Section 3.2 presents two different schemes proposed to overlap communication with computation. Section 4 presents the implementation results and the relative experimental performance evaluation of the proposed algorithms on a 32-node iPSC/2 hypercube multicomputer.
The Sequential FFT Algorithm
The computational flow graph of radix-2, Cooley-Tukey FFT scheme using the decimation-in-time decomposition is illustrated in Fig. 2 for a (N =32)-point FFT computation. In this scheme, the input is in bit-reversed order and the output is in normal order. The numbers in the normal decimal order used in Fig. 2 to illustrate the output sequence also denote the indices of the corresponding FFT points used for in-place computations.
As is seen in Fig. 2 , each stage of the computation takes a set of N complex numbers and transforms them into another N complex numbers by performing N=2 simplified-butterfly computations. This process is repeated n =lg 2 N times, resulting in the computation of the desired discrete Fourier transform in normal order. The simplified-butterfly computations required at stage k of an N-point FFT is, temp = W r N X k (q); X k+1 (p) = X k (p) + temp; X k+1 (q) = X k (p) ; temp (1) where q =p+2 k , W r N =e ;j( 2Π N )r , for k = 0 1 : : : n ;1. times to identify and perform the computations associated with the 2 k FFT butterflies in the respective block. 
where t calc is the time taken by the floating point multiplication, addition and subtraction operations. The computations of the complex coefficients (Wfac) are not involved in the given complexity analysis since they are performed only once during the preprocessing.
Parallel FFT Algorithms
Computational structure of the FFT algorithm is very suitable for parallelization on multicomputers implementing the hypercube interconnection topology. The distribution of data and computations is straightforward for medium-to-coarse grain parallelism whenever the number of FFT points, N = 2 n , is greater than or equal to the number of processors, P = 2 d , in the hypercube. Successive processors in the decimal ordering are assigned the consecutive slices of the X-array with each slice containing equal number of, M = N=P, consecutive FFT points. In this scheme, each processor is responsible for carrying out the complete in-place computations required for the FFT points assigned to itself. The horizontal dashed-lines in Fig. 2 illustrate this straightforward mapping for a 32-point FFT data and computations on a 2-dimensional hypercube, with each processor assigned M =8 FFT points.
Computational interdependencies in a particular stage of an FFT algorithm are confined within butterflies belonging to that stage. Furthermore, FFT butterflies are confined within consecutive blocks of size 2 k+1 at stage k. Note that, block size increases as 2 4 8 : : : 2 n;d during the first n;d stages k = 0 1 2 : : : n ;d;1. Thus, FFT blocks and hence FFT butterflies are not fragmented during the first n;d stages since the straightforward mapping scheme assigns consecutive M =N=P =2 n;d FFT points in blocks to consecutive processors in decimal ordering. Hence, no interprocessor communication is required during the first n;d stages. However, in the last d stages, FFT blocks and hence FFT butterflies are fragmented between processors thus necessitating interprocessor communication.
In the (n;d) th stage k = n;d;1, each processor P i computes only a single FFT block B i n;d;1 of size M for i = 0 1 : : : P ;1. During the following d stages k =n;d n;d+1 : : : n ;1, consecutive FFT blocks of size 2 1 M 2 2 M : : : 2 d M, are fragmented across consecutive processor blocks containing 2 1 2 2 : : : 2 d consecutive processors in decimal ordering. Note that, each processor block constitutes a lower dimensional hypercube called here subcube. Hence, during the last d stages, k = n;d n;d+1 : : : n ;1, FFT blocks are fragmented across`= 1 2 : : : ddimensional subcubes, respectively, where`= k ;(n;d)+1. The fragmentation of an individual FFT block of size 2`M across an`-dimensional subcube is such that i th processors (for i=0 1 : : : 2`; 1 ;1) in the first and second halves ((`;1)-dimensional subsubcubes) of that subcube hold M p -points and M q -points, respectively, of the 2`; 1 M butterflies involved in that block. Hence, in the last d stages, p and q points of FFT butterflies separated by 2 n;d 2 n;d+1 : : : 2 n;1 , are assigned Prog. 2, csend and crecv denote synchronous (blocking) communication primitives which achieve the transmission and receive of the communication packets. As is seen in Prog. 2, this scheme introduces a local storage overhead of size M =N=P due to the local receive buffer XRB. Note that, the size of the local X-array is N=P.
The straightforward mapping scheme avoids fragmentary message passing and assigns interacting FFT subblock pairs to the neighbor processor pairs of the hypercube. However, this scheme has two major drawbacks. 
Here, t su denotes the message set-up time overhead, and t tr denotes the time taken for the transmission of a complex floating-point word (2 4 bytes) between two neighbor processors.
Perfect Load Balance
The straightforward mapping scheme used in Prog. 2 already maintains perfect load-balance during the first (n ; d) stages since all processors hold equal number(s) of unfragmented FFT blocks. Hence, this mapping is maintained during the first (n ; d) stages of the parallel algorithm proposed in this subsection. As is indicated earlier, one half of the processors hold only updated values for the p-points and the other half hold only updated values for the q-points of the butterflies during the last d-stages. This static mapping scheme is altered at the beginning of each stage of the last d stages. At the very beginning of each stage, each processor holding updated values for N=P q-points exchanges one half of its q-points with one half of the p-points of its neighbor processor which holds all the p-points of its butterflies at that stage and vice versa. This exchange operation is not only the exchange of data values to be used at that stage. In fact, processors effectively exchange the responsibility of the further FFT computations associated with those exchanged FFT points.
This scheme maintains a single unfragmented FFT subblock of size M = N=P at each processor during the last d stages. This dynamic mapping scheme is illustrated in Fig. 3 for a 32-point FFT on a 4-processor hypercube. The pseudo-code for the node-program of the proposed parallel FFT algorithm is given in Prog. 3. We only present the last d concurrent exchange communication phase of Prog. 3 since the initial mapping scheme and the first for-loop is exactly similar to Prog. 2. As is seen in Fig. 3 and Prog. 3, each processor exchanges either the first half or the second half of its local X-array, in-place, by simply checking the (`;1) th bit of its processor index, where (`;1) denotes the channel across which the exchange operation is to be performed at that stage. Due to the dynamic mapping scheme, each processor performs simplified-butterfly computations on local p and q pairs separated by 2 m;1 = N=2P after the exchange operations at each stage of the last d stages. Hence, in this scheme, each processor holds equal number of p and q points which form unfragmented FFT subblocks of size M =N=P points (consisting of M=2 p-points and M=2 q-points) after the exchange operation. Each processor performs equal number of (N=2P) complex multiplications thus achieving a perfect load-balance. This scheme requires no extra storage for send/receive buffers since send/receive operations are performed in-place from/to the local X-array. Furthermore, the volume of concurrent communication at each exchange communication step is reduced by a factor of two (from N=P to N=2P complex floating-point words). Hence, the parallel complexity of the proposed scheme is,
The set-up times (t su ) of the mutual send operations are overlapped in an exchange operation. The overlap of mutual data transmissions ((N=2P )t tr ) between a pair of processors is feasible only when two physical links are present between neighbor processors as in iPSC/2. However, the internal hardware architecture of an individual iPSC/2 processor is such that internal bus conflicts occur due to the outgoing and incoming long messages during an exchange operation. Hence, a complete overlap cannot be achieved in iPSC/2 during the mutual data transmission phase of the exchange operation. The performance of the exchange operation can be modeled as (t su + (N=2P)t tr ) on iPSC/2, where is measured to vary between 1:3 and 1:6 with varying incoming/outgoing message size [2] . Note that, =1 corresponds to complete overlap.
In Prog. 3, each processor issues a synchronous receive just after the synchronous send operation. Due to the perfect load-balance, communicating processor pairs perform the synchronous send operations concurrently.
A synchronous send operation returns the control back to node program only after the outgoing message leaves the indicated send area in the local X-array. Whenever an incoming message begins to arrive to a destination processor, it does not find a pending receive, hence it is copied to a temporary system buffer by the node operating system NX. Later, whenever the receive operation is issued by the node program, that message is copied from the temporary system buffer to the indicated half of the local X-array. Hence, late issue of the receive operation introduces a block copy overhead (N=2P)t copy where t copy represents the time taken to copy a single complex floating-point word from the system buffer to the indicated half of the X-array.
Note that, such a receive overhead is not included in the parallel time complexity model given in Eq. 3 for Prog. 2. In Prog. 2, due to the lack of load balance, communicating processor pairs do not initiate send operations concurrently. The model in Eq. 3 is given for the bottleneck processors which stay idle, waiting for the multiplication results from their neighbor processors. These processors always issue early synchronous receives thus avoiding the receive overhead.
As is seen in Fig. 3 , the output results are slightly scrambled (in N=2P blocks) in this scheme due to the proposed dynamic mapping scheme. In most DSP applications a sequence of DSP blocks are applied consecutively on a set of input data. A proper output/input interface between successive DSP blocks can always be maintained, if the output data order of a particular DSP block is disturbed for the sake of efficiency.
For example, in most DSP applications, a set of time domain sample data is transformed into the frequency domain by applying an FFT DSP block. Then, a sequence of frequency domain operations are performed.
The results are transformed back to the time domain by applying an inverse FFT DSP block. The sequence of frequency domain computations including the inverse FFT block can easily be modified to operate on the disturbed output order of the previous FFT block. Hence, the order of input and output data of individual DSP blocks does not bring any inefficiency to the overall application.
Overlapping Communication with Computation
There are strong data dependencies in the FFT algorithm. As is seen in Prog. 
Scheme 1: Asynchronous Send and Synchronous Receive
The pseudo-code for the node program of the parallel FFT algorithm which overlaps communication and computation is given in Prog. 4. The initial static mapping for the first (n ; d) stages and the dynamic mapping for the last d-stages are similar to the scheme given in Prog. 3. However, as is seen in Prog. 4, the first for-loop iterates only (n;d;1) times which is one less than the iteration count in Prog. 3. Then, at each iteration (computation stage) of the second for-loop, each processor initiates the send portion of the exchange operation required at the following stage. Each processor classifies its computational tasks at each stage into two categories: those updates to be sent to its neighbor processor in the following stage and other updates to be kept local for the following stage. Then, each processor first performs the computations associated with those points required by the neighbor processor in the next stage. Hence, each processor first performs N=2P complex multiplications associated with its local N=2P q -points. Then, each processor updates either the values of its local p-points or q-points simply by checking the`t h bit of its processor index. Here,`denotes the channel across which the exchange operation is to be performed in the next stage. Upon completion of these N=2P updates, each processor issues an asynchronous (non-blocking) send (isend in Prog. 4) to initiate the transmission of the updated N=2P FFT-point values to the neighbor processor. After initiating the send operation, each processor completes the local computations associated with that stage by updating the other half of its local FFT points that will be kept local for the following stage. Upon completion of the second type updates each processor issues a synchronous receive to complete the already initiated exchange.
/* Computations over the first (n ; In Prog. 4, the first inner for-loop of the second outer for-loop computes the updates to be sent to the neighbor processor while storing the intermediate muliplication results into the second half of the local X-array to be reused in the second inner for-loop. Each processor stores these local update results into a send buffer (XSB array) of size N=2P since it does not need these results in further FFT computations. Note that, updated local p points sent from the local buffer XSB will be the q points of the following stage, and vice versa. After initiating the asynchronous send operation, the second inner for-loop computes the local updates to be kept local for the following iteration and stores them either into the first or the second half of the local X-array depending on the type of updates, p or q point updates, respectively. Hence, N=2P complex addition/subtraction operations in the second inner for-loop are overlapped with communication. Thus, communication is overlapped with one fifth of the computations involved in a stage. The receive portion of the exchange operations can be done in-place into the local X-array since synchronous receive is issued after the completion of the overall computations associated with that stage. Hence, the local storage overhead is only N=2P due to the local send buffer XSB. The only computational overhead is the loop overhead since two for-loops are required instead of one. The number of floating-point computations is exactly equal to that of Prog. 3. Thus, the parallel complexity of the proposed algorithm is, N=2P ) of the local X-array are used as two consecutive receive buffers. As is seen in Prog. 5, incoming messages are received either into the first or the second buffer in a cyclic manner according to`being even or odd, respectively. This switching receive buffer scheme is chosen to avoid the contamination of the message received in the previous iteration by the incoming message expected in the current iteration. In Prog. 5, the local variables pstart and qstart are used to inform the following iteration about the two particular quarters of the local X-array which will contain the updated p and q points, respectively. The local variables pptr and qptr are used to index the local X-array for accessing updated p and q points. In the first inner for-loop of the first outer for-loop, the results of the local updates to be sent to the neighbor processor are stored into the send buffer XSB, whereas the multiplication results are temporarily stored into the second quarter of the local X-array to be reused in the second inner for-loop. In the second inner for-loop, results of the local updates to be kept local for the following iteration are either stored into the first or the second quarter of the local X-array depending on the type of updates, p or q point updates, respectively.
The size of XSB-array is N=2P as in the case of Prog. 4. Hence this scheme introduces an extra local storage overhead of size N=P due to the two receive buffers in the second half of the X-array, compared to Prog. 4.
On the other hand, the computational overhead introduced compared to Prog. 
As is seen in Eq. 7, receive overhead is avoided by the early issue of the asynchronous receive. Table 1 illustrates the relative performance features of the proposed parallel FFT algorithms. The local storage overhead in Table 1 denotes the extra local storage requirement for send/receive operations in addition to the local X-array of size N=P. The parallel performance of these algorithms are expected to increase with increasing program index as is also verified experimentally in the following section.
Experimental Results
All programs presented in this paper have been coded in C language and run on a 32-node iPSC/2 hypercube multicomputer for various data sizes, 256 N = 2 n 64K. Figure 4 , illustrates the variation of percent performance improvement of Prog. 3 compared to Prog. 2 during the d exchange communication phase. As is seen in Fig. 4 , Prog. 3 outperforms Prog. 2 as expected since it achieves perfect load-balance and reduces the volume of communication by a factor of two compared to Prog. 2. As is also seen in Fig. 4 , percent performance improvement of Prog. 3 compared to Prog. 2 increases with increasing data size. respectively, in spite of the for-loop overheads. Also, as is indicated in [3] , complete overlap cannot be achieved due to the internal architecture of an individual iPSC/2 processor. respectively. These overlapped programs do not perform better than Prog. 3 for small granularities due to the for-loop overhead and the insufficient amounts of local computations for overlapping communications (see Eq. 6). The relative performances of the overlapped algorithms are expected to be much higher on larger dimensional hypercubes and architectures which enable complete overlap.
Figures 7(a) and 7(b) show speed-up and efficiency curves for Prog. 5. As is seen in Fig. 7(a) , almost linear speed-up is achieved for N 4K and P 32. As is seen, in Fig. 7(b) , efficiency remains over 94% when N=P 128 FFT points are mapped to an individual processor of the hypercube.
Conclusion
We have proposed more elegant and efficient, three parallel FFT algorithms for medium-to-coarse grain hypercube-connected multicomputers. All the proposed parallel FFT algorithms achieve perfect load-balance 
