Abstract. In this paper, we present a new parallel radix-4 FFT algorithm based on the BSP model. Our parallel algorithm uses the group-cyclic distribution family, which makes it simple to understand and easy to implement. We show how to reduce the communication cost of the algorithm by a factor of three, in the case that the input/output vector is in the cyclic distribution. We also show how to reduce computation time on computers with a cache-based architecture. We present performance results on a Cray T3E with up to 64 processors, obtaining reasonable e ciency levels for local problem sizes as small as 256 and very good e ciency levels for sizes larger than 2048.
Introduction
The discrete Fourier transform (DFT) plays an important role in computational science. DFT applications ranges from solving numerical di erential equations to signal processing. (For an introduction to DFT applications see e.g . 7] .) The widespread use of DFTs in computational science is mainly due to the existence of fast algorithms, known by the general name of fast Fourier transform (FFT), which compute the DFT of an input vector of size N in O(N log N) operations instead of the O(N 2 ) operations needed by a direct approach, i.e., by a matrix-vector multiplication.
In 1965, Cooley and Tukey 9] published a paper describing the FFT idea (giving special attention to the so called radix-2 FFT). Since then, many variants of the algorithm have appeared. For an extensive discussion of the family of FFT algorithms, see Van Loan 27] . In recent years, after the dawn of parallel computing, the originally sequential FFT algorithms have been modi ed and adapted to the needs of parallel computation (see e.g. 1, 2, 3, 8, 11, 12, 14, 15, 21, 22, 25, 27] ).
The lack of a uni ed parallel computing model and the existence of many di erent parallel architectures have made it rather di cult to develop e cient and portable parallel FFTs. Recently, however, as the parallel programming environments have become less machine dependent, examples of such algorithms have appeared. Typical examples are the 6-pass (or 6-step) approach and the related transpose approach (see e.g. 2, 3, 14, 15, 12] ). Those algorithms regard the input vector of size N = N 0 N 1 as an N 0 N 1 matrix, and carry out the computations in a similar way as done for two-dimensional FFTs. Since those algorithms require the number of processors p to be a divisor of N 0 and N 1 , they can only be used if p p N. As Furthermore, to our knowledge none of those algorithms were implemented.
Our main aim in this paper is to present a new parallel FFT algorithm and its implementation. Our parallel algorithm works for any p < N as long as both are powers of two, which is required because of the radix-2 framework. (A mixed-radix framework will be discussed elsewhere 18].) We dedicate the remainder of this section to giving a brief introduction to the basic framework of radix-2 and radix-4 FFTs and to the bulk synchronous parallel (BSP) model. In Section 2, we derive our parallel FFT algorithm by inserting suitable permutation matrices into the basic radix-2 decomposition of the Fourier matrix. This approach leads to a simple and easy to implement distributed memory parallel FFT algorithm. In Section 3, we present a set of templates that are used in the implementation of the algorithm. In Section 4, we present variants of our FFT algorithm. We show how to modify the algorithm to accept vectors that are not in the block distribution. We also show how to obtain a cache-friendly version of our algorithm, that is, an algorithm that takes advantage of the cache memory of a computer by breaking up the computations into small sections in such a way that the data stored in the cache is completely used before new data is brought in. In Section 5, we present results regarding the performance of our implementation and discuss aspects such as the cache e ect. In Section 6, we draw our conclusions and discuss future work. Alternatively, the DFT can be seen as a matrix-vector multiplication: Z = F N z: (1. 3)
The complex matrix F N is known as the N N Fourier matrix; it has elements (F N ) jk = w jk N , where w N = e 2 i N : (1.4) Though it is possible to develop FFT algorithms that compute the DFT of a vector of arbitrary size, the radix-2 FFT algorithm only works for N's that are powers of two. For simplicity, we will restrict our discussion to such values of N. The sequential iterative radix-2 FFT algorithm starts with the so-called bit reversal permutation of the input vector (see Section 3.2), and proceeds in log 2 Following van Loan's matrix approach 27], Algorithm 1.1 can be described as a sequence of sparse matrix-vector multiplications which correspond to the following decomposition of the Fourier matrix 1 F N = A N;N A 8;N A 4;N A 2;N P N ; (1.6) where P N is the N N permutation matrix corresponding to the bit reversal permutation (step 1 of Algorithm 1.1), and the N N matrices A K;N correspond to the butter y stages (step 2 of Algorithm 1.1). The block structure of the butter y stages leads to block-diagonal matrices of the form A K;N = I N=K B K ; (1.7) which is shorthand for a block-diagonal matrix diag(B K ; : : : ; B K ) with N=K copies of the K K matrix B K on the diagonal. The symbol represents the direct (or Kronecker) product of two matrices, which is formally de ned at the end of this subsection. The matrix B K is known as the K K 2-butter y matrix which corresponds to the butter y 1 Actually, the matrix decomposition corresponding to the algorithm of Cooley and Tukey 9] is F N = P NÃN;N Ã 8;NÃ4;NÃ2;N ; whereÃ K;N = P ?1 N A K;N P N . which has the same form as the original B K , but the weights w j K in (1.9) are replaced by w j+ K , where can be any real number. In practice, often a radix-4 FFT is used. A radix-4 algorithm can be derived completely analogously to the radix-2 algorithm, yielding a similar matrix decomposition. The algorithm starts with a reversal of pairs of bits instead of a reversal of single bits, and proceeds in log 4 N 4-butter y stages which involve quadruples of vector components instead of pairs. Since 34 ops are performed per quadruple, this brings the op count down to C (N) = 34 K 4 N K log 4 N = 4:25N log 2 N: The resulting algorithm has the disadvantage that either it must be assumed that N is a power of four, or special precautions must be taken which complicate the algorithm.
We take a slightly di erent approach: wherever possible we take pairs of stages A K;N A K=2;N together and perform them as one operation. Our K K 4- This matrix is a symmetrically permuted version of the radix-4 butter y matrix 27]. 2 This approach gives the e ciency of a radix-4 FFT algorithm, and the exibility of treating a parallel FFT within the radix-2 framework. For example, if we wish to permute the data sometime during the computation, for reasons of data locality, this can happen after any stage, and not only after an even number of stages. An algorithm for the inverse FFT is obtained using the following property:
The backward algorithm is basically the same as the forward one, the only di erence being that the powers of w K are replaced by their conjugates and that the nal result is rescaled. Now, we de ne the direct product of two matrices and give some properties that will be used in the course of the paper.
De nition 1.1 (Direct product). Let A be a q r matrix and B be an m n matrix. As one would expect, the direct product is associative, but it is not commutative. Lemma 1.2 summarizes some direct product properties that follow directly from the definition. (See 24, 27] for other useful properties). gramming model which gives a simple and e ective way to produce portable parallel algorithms. It does not depend on a speci c computer architecture, and it provides a simple cost function that enables us to choose between algorithms without actually having to implement them. In the BSP model, a computer consists of a set of p processors, each with its own memory, connected by a communication network that allows processors to access the private memories of other processors. Accessing local memory (the processor's own memory) is faster than accessing remote memory (memory owned by other processors), but access time is considered to be independent from the computer architecture. In this model, algorithms consist of a sequence of supersteps and synchronization barriers. The use of supersteps and synchronization barriers imposes a sequential structure on parallel algorithms, and this greatly simpli es the design process. The variant of the BSP model that we use is a single program multiple data (SPMD) model, i.e., each one of the p processors executes a copy of the same program, though each has its own data. The program distinguishes between the processors through a parameter s (the processor identi cation number). Special cases are treated using \if" statements. Such methods appeared as a direct consequence of the divide-and-conquer structure of the radix-2 FFT algorithm. The paper 8] by Chu and George discusses several existing parallel algorithms of this type and three variants of their own. Restricting the discussion to vector sizes that are powers of two, they present a common framework in which all the algorithms they discuss are reorderings of one another in the following sense.
Each butter y stage K of an FFT of size N, performs pairwise operations that combine elements j and j + K=2 from the vector being transformed using the weight w jmodK K . Writing j in its binary representation j = (j m?1 ; : : : ; j 0 ) 2 , where m = log 2 N, we observe that elements j and j+K=2 di er only in bit log 2 K?1 and that w jmodK K = w (j log 2 K?1 ;:::;j 0 ) 2 K . If the ordering of the vector is changed, so that original element j is stored as element 8 l, the butter y stages must be modi ed to carry out the same operations. If we can represent the new ordering using a permutation of the original bits, it is easy to know which elements to combine and which weights to use. For example, if N = 16 a possible reordering of the input vector could be l = (j 0 ; j 2 ; j 1 ; j 3 ) 2 , where j = (j 3 ; j 2 ; j 1 ; j 0 ) 2 . The butter y stage corresponding to K = 16 should then combine elements l = (j 0 ; j 2 ; j 1 ; 0) 2 with l + 1 = (j 0 ; j 2 ; j 1 ; 1) 2 using weights w (j 2 ;j 1 ;j 0 ) 2 16 . In the parallel scenario, any group of log 2 p bits can be used to represent the processor number, while the remaining log 2 (N=p) bits are used to represent the local index. If the bit corresponding to the next butter y stage is one of the log 2 (N=p) bits that represent the local index, then that stage is local, otherwise communication is needed.
Swarztrauber 25] carries out a similar discussion. He starts with a more general formulation of the problem, where N is not restricted to powers of two, but when discussing the distributed memory framework, he only considers FFTs on a hypercube, restricting both p and N to powers of two. The problem of the algorithms discussed in 8, 11, 25] is that reorderings are carried out by means of exchanging one bit at a time. Since there are log 2 p bits in the processor part, log 2 p communication supersteps of size O( N p g + l) are needed. A less expensive solution to the problem is to exchange all the processor bits with a group of local bits corresponding to butter y stages that have already been performed. Since the communication cost of the permutation that exchanges many bits is of the same order O( N p g + l) as for exchanging one bit, the reduction in the communication cost is huge.
The basic idea for such algorithms already appears in the original Cooley and Tukey N 0 , the inner sum of (1.15) corresponds to a DFT of size N 0 , which can be computed by the same process as before if N 0 is not prime. They remark that this process can be applied to any possible factorization of N, N = N 0 : : : N H?1 and that, if N is composite enough, real gains (over the O(N 2 ) direct approach) can be achieved. Afterwards they derive the radix-2 algorithm by choosing N to be a power of two. If instead of decomposing N into its prime factors, we stop at a higher level, we obtain a decomposition of the FFT into a sequence of shorter FFTs that, in the parallel case, can be spread out over the processors. This is what happens in our FFT algorithm presented in the next section and in algorithms based on the 6-pass approach and the related transpose approach (see e.g. 2, 3, 14, 15, 12]).
2. The parallel algorithm 2.1. Group-cyclic distribution family and the parallel FFT. Since our parallel FFT algorithm is based on the radix-2 decomposition (1.6) of the Fourier matrix, N must be a power of two. For practical reasons N=p must be integer and therefore p must also be a power of two. Our parallel FFT algorithm makes use of the data distributions de ned below.
De nition 2.1 (Cyclic distribution in r groups, C r (p; N)). Let r, p, and N be integers with 1 r p N, such that r divides p and N. Let f be a vector of size N to be distributed over p processors organized in r groups. De ne M = N=r to be the size of the subvector of a group and u = p=r to be the number of processors in a group. We say that f is cyclically distributed in r groups (or r-cyclically distributed) over p processors if, for all j, the element f j has local index j 0 = (j mod M) div u, and is stored in processor s 0 + s 1 , where s 0 = (j div M) u is the number of the rst processor in the group (i.e., the processor o set) and s 1 = (j mod M) mod u is the processor identi cation within the group.
We use the name group-cyclic distribution family to designate all the r-cyclic distributions generated by the same N and p. This family includes the well-known cyclic distribution (C 1 (p; N)) and block distribution (C p (p; N)) as extreme cases.
The parallel FFT algorithm works as follows. A total of H = dlog 2 N= log 2 (N=p)e = dlogN p Ne phases is performed. (The number of phases is the largest integer H for which (N=p) H?1 N.) In phase 0, the algorithm uses the block distribution to perform the rst log 2 (N=p) butter y stages, i.e., those involving butter ies with K N=p (which we call short distance butter ies). Afterwards, in each intermediate phase 1 J < H ? 1 it uses the r-cyclic distribution, with r = p=(N=p) J , to perform a group of log 2 (N=p) butter y stages with (N=p) J < K (N=p) J+1 (the medium distance butter ies). Finally, in phase H ? 1, it uses the cyclic distribution to perform the remaining butter y stages, i.e., those involving butter ies with K > (N=p) H?1 (the long distance butter ies). Figure 1 illustrates the use of the group-cyclic distribution family in our parallel FFT. The same operations are illustrated in two ways: (A) using the logical view, and (B) using the storage view. The logical view emphasizes the logical sequence of the elements in the vector while the storage view emphasizes the way the elements are actually stored. For the block distribution, both views are the same.
Our algorithm has only O(log N= log ( On the other hand, each permutation ? u is a strict communication superstep. In the next section we give a complete description of the resulting parallel algorithm.
Implementation of the parallel algorithm
We present our algorithm in the form of templates, which are high level of detail algorithms, ready to be implemented, though not necessarily completely optimized. In the list that follows we introduce the terminology used in our parallel algorithms. Indexing. All the indices of vectors or array structures are global. This means that array elements have a unique index which is independent of the processor that owns it. This property enables us to describe variables and gain access to arrays in an unambiguous manner, even though the array is distributed and each processor has only part of it. (In an actual implementation, it is more convenient to convert the indexing scheme to a local one.)
Communication. Communication between processors is indicated using g j Put(pid; n; f i ) This operation puts n elements of array f, starting from element i, into processor pid and stores them there in array g starting from element j. Subscripts are not needed when the rst element of the array is 0 or when communicating scalars. When communicating more than one element, we use boldface to emphasize that we are dealing with a vector and not a scalar. All puts are assumed to be bu ered, so that they are safely carried out, even if f and g happen to be the same. where the second term corresponds to the extra cost we have to pay for performing 2-butter ies. The communication and synchronization costs of our parallel FFT are discussed in Section 3.5 after we discuss the parallel permutation subroutines. Let y be a vector of size N = 2 m block distributed over p = 2 q processors. Suppose that we want to permute it by a bit reversal permutation, i.e., perform y P N y. Applying Corollary 3.2 with u = p, it is possible to split the parallel bit reversal permutation in two parts as follows. Having as basis the block distribution, from now on, we use Proc(k) = k div N p to denote the processor in which element k is stored, and k 0 = k mod N p to denote the local index of the element.
2. y (I p PN p ) y, which is a local bit reversal permutation in the local index t 0 : t 0 ! k 0 = revN p (t 0 ) Algorithm 3.3 carries out a parallel bit reversal using this idea. If we combine the local bit reversal (superstep 2 of Algorithm 3.3) with the short distance butter y phase (superstep 2 of Algorithm 3.1), we obtain a complete local sequential FFT. This means that we can easily replace the two supersteps by any optimized FFT subroutine we can lay our hands on. 3 by sending packets of data. This is done in a similar way as when permuting from block to 21 cyclic distribution (see Section 3.4); the only di erence is in the destination processor, which is rev p (j mod p) instead of j mod p. 3.3. Permutations within the group-cyclic distribution family. Permuting a vector from the C r 1 (p; N) distribution to the C r 2 (p; N) distribution, where r 1 = p=u 1 and r 2 = p=u 2 may be any possible group sizes, not necessarily powers of two, can be done as follows: rst, use ?1 u 1 to permute the vector to the block distribution, and then use u 2 to permute it to the C r 2 (p; N) distribution. This operation is expensive if performed in parallel, because all the data have to be moved twice around the processors. The best approach is to combine both permutations into one: we need to sum the computation, communication, and synchronization costs. The computation costs were already obtained in Section 3.1. To simplify the nal result we only include the higher order term of the total computation cost (3.2), C FFT;par;Comp (N; p) = 17 4 N p log 2 N; (3.7) which is exact when only 4-butter ies are performed. The communication and synchronization costs are the costs involved in performing the bit reversal and the permutations related to the group-cyclic distribution family. The maximum amount of data sent or received during a permutation involving complex numbers is equal to N=p complex values (or 2N=p real values). If the permutation is performed with puts, the number of synchronizations is 1, giving a total cost of C permut (N; p) = In Section 5, we discuss the validity of cost function (3.9) as an accurate estimator of the true cost of the FFT algorithm.
4. Variants of the algorithm 4.1. Parallel FFT using other data distributions. Up to now, we discussed an FFT algorithm where the input and output (I/O) vector must be block distributed. There exist applications of the FFT where it would be better if the I/O vector would be distributed by other distributions or where the distribution can be freely chosen (see e.g. 14, 19, 28] ) . Here, we discuss how to modify our parallel FFT algorithm to accept I/O vectors that are not distributed by the block distribution. 25 The rst and the last supersteps of Algorithm 3.1 are permutations. Because of this, the algorithm can be modi ed to accept any I/O data distribution without any extra communication cost, or even at a smaller communication cost depending on the desired distributions. If the input vector is not in the block distribution, the algorithm is modi ed by combining the redistribution to block distribution with the bit reversal permutation. If the output vector is expected to be in a distribution other than the block distribution, this is done by replacing the permutation from cyclic to block distribution by a permutation from the cyclic to the desired distribution.
If the desired distribution for the output vector is the cyclic distribution, the last communication superstep can be completely skipped. The rst permutation can also be skipped if the input vector is already stored by the distribution associated with the bit reversal permutation. Applications where the input vector is bit reversed and the output vector is cyclically distributed are advantageous, because, in such cases, two complete permutations can be skipped. 4 This saves two thirds of the total communication cost in the common case that p N=p, leaving only one permutation in the middle of the computation.
While the cyclic distribution is simple and widely used, the distribution associated with the bit reversal permutation is awkward. Fortunately, it is possible to modify Algorithm 3.1 so that the cyclic distribution is a natural input distribution, i.e., a distribution that does not involve any communication as the rst superstep. This is done as follows. The rst three supersteps of Algorithm 3.1 are described by the matrix decomposition ? u AN p ;N : : : A 2;N P N ; (4.1) where u = min(p; N=p IN p ) . In the case that the input vector is already cyclically distributed, the rst superstep can be skipped. We can use this exibility to reduce the computation cost of some combinations of p and N by inserting the permutations so that a maximal number of generalized butter y stages are paired o . Another reason to permute the vector at an earlier stage is that the sizes of the butter y phases can be better balanced (so that all factors of N have approximately the same size). This would enhance the performance on a cache-sensitive computer (see the discussion in Section 5). An even a more e ective way of enhancing the performance on a cache-sensitive computer is to reduce the butter y sizes so that they always t completely into the cache. We suggest a method in the following subsection.
4.3. Cache-friendly parallel FFT. Each computation superstep of our parallel FFT algorithm performs a butter y phase which consists of a sequence of generalized butter y stages represented by the operation y R l;n y, where l and n are powers of two with 2 l n, and R l;n = A n;n A 2l;n A l;n ; (4.3) is an n n matrix. Suppose that the cache memory of a computer is such that the data needed by a butter y phase of size n=v, where v < n is a power of two, ts totally in ? ( n v ) j ; (4.4) where the n n matrixÂ k;u is an abbreviation forÂ k;u;v;n = ? u;v;n A ku;n ? ?1 u;v;n . Generalized versions of Theorem 2. The matrix decomposition (4.4) can be used to construct an alternative (cache-friendly) algorithm for the computation of the generalized butter y phases. Note that if = 0 then the resulting algorithm can be used to construct a cache-friendly sequential FFT algorithm.
Experimental results and discussion
In this section, we present results on the performance of our implementation of the FFT. We implemented the FFT algorithm for the block distribution in ANSI C using the BSPlib communications library 16]. Our programs are completely self-contained, and we did not rely on any system-provided numerical software such as BLAS, FFTs, etc.
We tested our implementation on a Cray T3E with up to 64 processors, each having a theoretical peak speed of 600 M op/s. The accuracy of double precision (64-bit) arithmetic is 1:0 10 ?15 . We also give accuracy results from calculations on a Sun Workstation using IEEE 754 oating point arithmetic, which has a double precision accuracy of 2:2 10 ?16 , and which is the standard used in many computers such as workstations. To make a consistent comparison of the results, we compiled all test programs using the bspfront driver with options -O3 -flibrary-level 2 -fcombine-puts and measured the elapsed execution times on exclusively dedicated CPUs using the system clock. The times given correspond to an average of the execution times of a forward FFT and a normalized backward FFT.
5.1. Accuracy. We tested the overall accuracy of our implementation by measuring the error obtained when transforming a random complex vector f with values Re(f j ) and Im(f j ) uniformly distributed between 0 and 1. The relative error is de ned as jjF ? Fjj 2 =jjFjj 2 ; where F is the vector obtained by transforming the original vector f by a forward (or backward) FFT, and F is the exact transform, which we computed using the same algorithm but using quadruple precision. Here, jj jj 2 indicates the L 2 -norm. Table 1 shows the relative errors of the sequential algorithm for various problem sizes. Since the error for the forward and backward FFT are approximately the same, we present only the results for the forward transform. The errors of the parallel implementation are of the same order as in the sequential case. In fact, the error of the parallel implementation only di ers from the error of the sequential one if the butter y stages are not paired in the same way. This validates the parallel algorithm. The results also indicate that IEEE arithmetic is superior to the CRAY-speci c arithmetic. Table 1 . Relative errors for the sequential FFT algorithm.
where Time(seq; N) is the execution time of the sequential implementation. Analyzing the performance of an FFT algorithm by using the number of ops of the radix-2 FFT as basis is a standard and useful procedure. By doing so, it is possible to compare di erent algorithms with di erent cost functions and also evaluate the overall performance of the algorithm as a function of N. Table 2 gives timing results and FFT op rates for various problem sizes. The op rates show that the performance of the algorithm increases until N = 4096, when it suddenly drops. This sudden decrease in performance happens because the data space allocated by the program becomes too large to t completely in the cache memory of the CRAY T3E, 5 which means that the computation becomes more expensive, because more accesses to the main memory are needed.
5.3. Scalability of the parallel implementation. The timing results obtained by our parallel algorithm are summarized in Table 3 . We also present the theoretical predictions using the cost function (3.9) and the values of the BSP parameters v, g, and l listed in Table 4 which were obtained using a modi ed version of the benchmark program from BSPpack 6 17] . Except for the out-of-cache computations (boldface entries in the table), the timings show that the BSP cost function predicts the behavior of the parallel implementation well. The discrepancy between predicted and measured results for out-of-cache computations is to be expected, since the computation speed, which we assumed to be constant, suddenly drops when the computations cannot be done completely in cache. These results show that the BSP model is a valid tool for analyzing and predicting parallel performance. as done in Figure 4 . In theory, E abs (p; N) 1, and our goal is to achieve e ciencies as close to 1 as possible. The gure shows moderate e ciencies for small problem sizes (N 4096). For N 8192, e ciencies of up to 1:7 are achieved. Such amazing e ciencies are possible because of the so called cache e ect: when N 8192 the total amount of memory needed by the FFT is too large to t in the cache memory of one processor, but, if the problem is executed using a su ciently large number of processors, the memory required by each processor becomes small enough to t in the cache. This e ect is welcome, but it masks the real scalability of the algorithm. Note that there is a sudden rise in the op rate when the local problem size becomes small enough to t in the cache. In this way the cache e ect can be easily spotted and the scalability of the algorithm better judged. FFT sizes that t completely in the cache (N 4096) have a completely di erent behavior than larger problems. For small sizes (N 4096) the e ciency decreases notably in going from one to two processors, then it is more or less constant up to 8{16 processors and after that it decreases steadily. For large sizes (N > 16384) the op rate is nearly constant, both before and after the transition out-of-cache/in-cache, indicating a good scalability. The cases 8192 N 16384 are intermediate cases where there is an increase in the e ciency in going from one to two processors, but a deterioration of performance when the number of processors becomes too large.
We can also examine the scalability of our parallel algorithm by increasing the problem size together with the number of processors 13, 21] , for instance by maintaining the local problem size N=p constant and increasing p. In doing so, we can learn about the asymptotic behavior of our algorithm. Figure 5 shows the predicted and measured e ciencies as a function of p for various values of N=p. The predicted values converge to a horizontal line as N=p increases which means that asymptotically the e ciency can be maintained at a constant level if N=p is maintained constant. The experimental results must be analyzed keeping in mind the cache e ect, which causes the sudden increase in the e ciency. It is clear that e ciency can be maintained at reasonable levels for N=p as small as 256, and at very good levels for N=p = 4096. 
Conclusions and future work
In Section 2, we presented a new parallel FFT algorithm, Algorithm 3.1. This algorithm is a mixed radix-2 and radix-4 FFT. It was derived from the matrix decomposition corresponding to the radix-2 algorithm by inserting suitable permutation matrices corresponding to the group-cyclic distribution family. The use of the group-cyclic distribution family gives a parallel algorithm which is simple to understand and easy to implement.
The use of matrix notation proved to be a powerful tool for deriving and adapting the parallel FFT algorithms to our needs. With the help of matrix notation, we showed how to modify our original algorithm to accept I/O vectors that are not in block distribution, without incurring extra communication cost. Indeed, if the vector is cyclically distributed, we showed how to eliminate the rst and the last permutation altogether, reducing the communication to one third of the original cost. Since the cyclic distribution is simple and widely used, this property can be exploited to obtain faster applications. A prime application of the cyclic distribution would be in the eld of quantum molecular dynamics where a potential energy operator is applied to the input vector representing a wave packet and a kinetic energy operator is applied to the output vector 20]. Since both operations 34 are componentwise, any distribution can be chosen and hence the cyclic distribution is preferred.
We presented results concerning the performance of our implementation of Algorithm 3.1. The tests were carried out on a Cray T3E with up to 64 processors. Our implementation proved to scale reasonably well for small problem sizes (N 4096) with up to 8 processors, and to scale very well for larger problem sizes (N 16384). In part, the very favorable results obtained for larger N are due to the cache e ect. Because of this e ect, we analyzed our results in terms of FFT op rate per processor, and also by using the theoretical cost function. Both analyses con rmed the previous results.
We also analyzed the asymptotic behavior of our algorithm by maintaining the local problem size N=p constant and increasing p. We concluded that the e ciency level is maintained as long as N=p is large enough. A study of the experimental data obtained on the Cray T3E indicates that reasonable e ciency levels (E(p; N) 0:5) are already maintained for N=p as small as 256, and good e ciency levels are maintained for N=p > 4096.
Because the cache-based architecture of the Cray T3E in uences our results so much, and there are many other computers with a similar architecture, we proposed the use of cache-friendly FFT algorithms. A cache-friendly sequential algorithm can be derived from our parallel algorithm by substituting the processors by virtual processors. It is also possible to derive a cache-friendly parallel algorithm by writing each generalized butter y phase as a sequence of smaller generalized butter y phases. We expect the scalability of such an algorithm to be similar to the theoretical scalability of our algorithm.
