Multi-dimensional discrete Fourier transforms (DFT) are typically decomposed into multiple 1D transforms. Hence, parallel implementations of any multi-dimensional DFT focus on parallelizing within or across the 1D DFT. Existing DFT packages exploit the inherent parallelism across the 1D DFTs and offer rigid frameworks, that cannot be extended to incorporate both forms of parallelism and various data layouts to enable some of the parallelism. However, in the era of exascale, where systems have thousand of nodes and intricate network topologies, flexibility and parallel efficiency are key aspects all multi-dimensional DFT frameworks need to have in order to map and scale the computation appropriately. In this work, we present a flexible framework, built on the Redistribution Operations and Tensor Expressions (ROTE) framework, that facilitates the development of a family of parallel multi-dimensional DFT algorithms by 1) unifying the two parallelization schemes within a single framework, 2) exploiting the two different parallelization schemes to different degrees and 3) using different data layouts to distribute the data across the compute nodes. We demonstrate the need of a versatile framework and thus a need for a family of parallel multi-dimensional DFT algorithms on the K-Computer, where we show almost linear strong scaling results for problem sizes of 1024 3 on 32k compute nodes.
INTRODUCTION
The multi-dimensional discrete Fourier transform (DFT) has proven to be an ubiquitous mathematical kernel, that is widely used in a Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. multitude of applications from different scientific fields like molecular dynamics [1] [2] [3] , material sciences [4] [5] [6] [7] , quantum mechanics [8] [9] [10] [11] . As most of these applications spend a significant amount of their total execution time on computing multi-dimensional DFTs, it is vital that the parallel multi-dimensional DFT be efficient as we move into the exa-scale era. As multi-dimensional DFTs are defined in terms of multiple 1D DFTs, parallel implementations of the multi-dimensional DFTs can be classified into two distinct classes. The first class focuses on exploiting parallelism within the 1D DFTs, while the second class exploits the inherent parallelism across multiple distinct 1D DFTs. Most state-of-the-art frameworks for computing 3D DFTs like FFTW [12] , P3DFFT [13] , FFTE [14] opt to parallelize across the multiple 1D DFTs.
We illustrate the limitations of the conventional approach of parallelizing the multi-dimensional DFT in Figure 1 , where we report the performance of three different parallelization schemes for computing the 3D DFT of 64 3 and 1024 3 on the K-Computer. Notice that for the data size of 64 3 , parallelizing the 3D-DFT on a 1D grid of processor such that each processors computes a 2D plane yielded the shortest execution time. However, for the 1024 3 input, the same parallelization scheme scales only to 32 processors, while better performance can be attained by switching to a parallelization scheme comprising of a 2D grid of processors where each processor computes a small batch of 1D DFTs. Just as with the 1D parallelization scheme, the parallelism with a 2D grid of processors is restricted to at most 1024 processors. By parallelizing within the 1D DFT for the last dimension, we obtain an even shorter execution time using a much greater number of processors. These observations highlight two major limitations of existing approaches: 1) there is a need for a framework that allows one to seamlessly switch between parallel algorithms for computing the multi-dimensional DFT as different parallel algorithms are superior for different problem sizes, and 2) there is a need to parallelize both across and within the 1D DFTs to scale to larger number of compute nodes.
Increasingly there is interest in scaling the parallel implementations of the multi-dimensional DFT to higher number of nodes using higher dimensional grids [15] . These newer algorithms are a subset of the parallel DFT algorithms highlighted in Johnson et. al. [16] , that showed that a significant number of algorithms can arise from the parallelization both within and across multiple 1D DFTs. However, these algorithms are largely unexplored in practice as the existing parallel DFT frameworks make implementing these algorithms difficult. In addition, due to the rigidity of the existing frameworks, the resulting performance is still unsatisfactory.
In this work, we present a flexible framework for employing different parallelization schemes to compute the multi-dimensional DFT on a multi-dimensional computation grid of processors. By recognizing that the inputs and outputs of a multi-dimensional DFT are multi-dimensional arrays of data, we leverage insights from the multi-linear algebra (tensor) community to simplify the management of communication and data layout of the data on a multi-dimensional grid.
Contributions. The paper makes the following contributions:
• A flexible framework that allows the unification of two different parallelization schemes for multi-dimensional DFTs on multi-dimensional grids • Demonstrate the need for a family of parallel DFT algorithms in order to attain high performance for any given problem size and network configuration parameters.
• Detailed analysis of when to chose between algorithms given the problem size and the network topology configuration parameters.
THE DISCRETE FOURIER TRANSFORM
In this section, we briefly present the decomposition of both the 1D and multi dimensional DFTs, expressing the algorithms in terms of linear algebra operations and outlining the possibilities for parallelizing the computation.
The 1D Discrete Fourier Transform
The 1D DFT is a matrix-vector multiplication, where given the input x, the output y is obtained as
The DFT n is the n × n DFT matrix, defined as
with ω n = e −k 2π n being the complex root of unity. The DFT matrix is a dense symmetric matrix such that DFT T n = DFT n , where (·) T represents the transposition operation. The typical computation of the DFT is implemented using the Fast Fourier Transforms (FFT), where instead of performing O(n 2 ) complex arithmetic operations by doing the matrix-vector multiplication, a recursive decomposition of Figure 2 : The decomposition of the 1D DFT using the CooleyTukey algorithm for a given problem size of 16 = 4 · 4. The algorithm requires four steps, namely two batch DFTs (I) and (IV) of size 4, a point-wise operation (II) and a transposition (III). Data is stored in column major order.
the DFT matrix is performed to obtain an O(n log(n)) algorithm. The most widely-know of these algorithms is the Cooley-Tukey algorithm [17] . The Cooley-Tukey algorithm is described as a factorization of the DFT matrix of size n when n is a composite number such as n = n 0 n 1 .
An alternative view of the mathematical description of the CooleyTukey algorithm defined in [17] is the following:
wherex andỹ are two matrices of size n 0 × n 1 and n 1 × n 0 , that represent the input x and output y, respectively. The columns of matrixx (ỹ) are n 1 (n 0 ) contiguous sub-vectors (i.e. x i , 0 ≤ i < n 0 or y i , 0 ≤ i < n 1 ) of the input (output) vector x (y), and are obtained by splitting x (y) into sub-vectors of size n 0 (n 1 ) as follows
The matrixỹ is similarly constructed. The reason for the formulation of the Cooley-Tukey algorithm as described by Equation 3 is that the four stages of the algorithm (as depicted in Figure 2 ) are made explicit. The first step applies the DFT n 1 on the rows of the matrixx (I), assuming data is stored in column major order. The elements within each column of either matrixx orỹ are in consecutive locations in memory. This means that in order to compute the first stage of the DFT, elements are accessed at a stride of n 0 . The result of the first compute stage is 
where ω n represents the complex roots of unity. The resultant matrix is then transposed (III) and finally the second DFT of size n 0 is applied in the rows (IV). Again, elements required for the fourth stage are read at a stride. In addition, due to the matrix transposition, the second DFT cannot start computation until all previous stages have been completed.
Parallelizing the 1D DFT
Parallelizing the 1D DFT on p processors implies all four compute stages to be parallelized. Traditionally, the input matrixx (stored in column major order) is split such that each processor receives n 1 /p columns of size n 0 . Distributing the data using this data layout implies that computation cannot start since each processor does not have the necessary data points. As such, an all-to-all communication step is needed to re-distributes the data such that each processor receives n 0 /p rows of size n 1 . The first and second stage of the Cooley-Tukey algorithm can then be applied locally. A second allto-all communication is performed to transpose (stage III) the global data across the different processors. After this communication step, each processor owns n 1 /p rows of size n 0 and thus can locally compute the last stage of the algorithm. Storing the data in the correct order requires a third communication step. The described computation gives the so-called six step algorithm [18] defined as
The parallel implementation of the six step algorithm is depicted in Figure 3 , at requires a total of three communication steps. The astute reader will recognize that some of the communication steps can be avoided by storing the initial data in a different layout. This has been observed in literature, but seldom exploited in practice.
n-Dimensional DFTs
Multi-dimensional DFTs can be defined in terms of multiple 1D DFTs and multi-dimensional DFTs of lower dimensions. For example, Figure 4 shows two variants for computing the 3D DFT. The first algorithm represents the so-called slab pencil decomposition [19] [20] [21] , where the 3D DFT is decomposed into a 2D DFT followed by a batch of multiple 1D DFTs. The second algorithm represents the pencil-pencil-pencil decomposition [19] [20] [21] , where the 3D DFT is decomposed into three batches of 1D DFTs, where each 1D DFT is applied in the three dimensions. The slab-pencil decomposition views the input (output) column vectors x (y) as 2D matricesx (ỹ) of size (n 0 n 1 ) ×n 2 . Mathematically the decomposition is expressed as
Data is stored in column major, hence the 2D DFT is applied on the columns, while the 1D DFT is applied on the rows. The pencil-pencil-pencil algorithm views the input (output) vectors are reshaped into 3D cubesx (ŷ) of size n 0 × n 1 × n 2 . The input (output) column vector x (y) is decomposed into n 2 groups of n 1 subgroups of size n 0 . Hence, the 3D cubex can be viewed as matrix of matrices such asx
where eachx i is the 2D matrix of size n 0 ×n 1 for all values 0 ≤ i < n 2 .
Mathematically, the pencil-pencil-pencil algorithm is expressed aŝ
where the DFT n 0 is applied in the depth dimension, the DFT n 1 is applied in the column dimension and finally the DFT n 2 is applied in the row dimension. Data is store in column major and therefore the dimension corresponding to n 0 is laid out in the fastest dimension in memory, while the dimension corresponding to n 2 is in the slowest dimension in memory. 
Parallelizing the n-Dimensional DFTs
Since the multi-dimensional DFT can be decomposed into multiple lower-dimensional DFTs, most prevalent FFT frameworks exploit this feature for parallelization. We illustrate this parallelization scheme on the two algorithms for computing the 3D DFT that were presented previously. Parallelizing the slab-pencil algorithm is performed by viewing the processors as a 1D grid and distributing the data cube such that each processor receives a slab of n 2 /p 2D data of size n 0 × n 1 . Each processor then locally computes multiple 2D DFTs on the slabs. Data is then exchanged across processors so that each processor can compute (n 0 × n 1 )/p 1D DFT of size n 2 . Finally, another communication step is required to store the data in the correct format as seen in Figure 5 .
The pencil-pencil-pencil algorithm is parallelized in the two dimensions corresponding to n 1 and n 2 across a mesh of processors of size p y × p z . Each processor receives a batch of 1D pencils as seen in Figure 6 . Each processor can apply the local 1D DFTs. In order to apply the other two 1D DFTs data is first exchanged between the processors in the p y dimension of the mesh, followed by the communication between the processors on the p z dimension. Each communication stage rotates the data cube from xyz to yzx and zxy as seen in Figure 6 . Each communication step requires data to be packed and un-packed before and after the all-to-all communication. Finally, in order to store the data in the correct format a global communication between all processors is required. However, the DFT is typically used in larger applications like convolutions or PDE solvers, therefore the last stage of communication for both the slab-pencil and pencil-pencil-pencil can be dropped. This improves reduces execution time in favor of having the data in shuffled format.
Limitations of existing parallelization schemes
Notice that in both cases, the amount of parallelism is limited by the size of a particular dimension of the data. For the slab-pencil algorithm, the maximum number of processors that can participate in the computation is max(n 0 , n 1 , n 2 ) as each processor is assigned a 2D slab of data. Similarly for the pencil-pencil-pencil case, the number of participating processors is upper bound by max(n 0 n 1 , n 1 n 2 , n 0 n 2 ) since each processor is assigned at least one 1D DFT to compute locally. The level of parallelism within the communication is also limited by the size of a particular dimension of the data. The slab-pencil distribution spreads the data on a one dimensional grid. Hence, exchanging the data for computing the 1D DFT require all processors to communicate. The pencil-pencil-pencil case parallelizes the communication in two dimensions, which is done between groups of processors. All frameworks that implement the parallelization across the 1D DFT depend on efficient all-to-all communications [22, 23] . However, as we will briefly show in the result section, the all-to-all communication cannot efficiently scale as the number of processors increase. 
n-DIMENSIONAL DFTS IN ROTE
The inputs and outputs of multi-dimensional DFTs are multi-dimensional data arrays. These are commonly referred to as tensors in the multilinear algebra community. As such, we leverage the insights from that community for distributed data management on a distributed grid through the use of the Redistribution Operations and Tensor Expressions (ROTE) framework [24] .
The Redistribution Operations and Tensor Expressions Framework
The Redistribution Operations and Tensor Expressions (ROTE) framework provides a formal notation for describing the data layout of tensors that are distributed on multi-dimensional meshes. By describing the data layouts before and after a data redistribution operation, ROTE provides the infrastructure to map the necessary data movement into a sequence of one or more collective communication subroutines to achieve the desired data movement. In order to make the paper self-contain, we present a high-level overview of the notation of ROTE and refer the reader to existing documentations on ROTE for more details [24] . 
is an order-2 tensor. For consistency within the paper, we use the terms "dimensions", and "size" to refer to the order of the tensor, and dimension of the mode respectively. In addition, we annotate the matrix A with two superscripts (ι and η) to denote the size of the two modes such that A is ι × η. Multi-dimensional processing grids are defined within ROTE as multi-dimensional arrays of processing units. Let G be the ddimensional grid size. Similar to the data tensors, an order-d G grid will have d dimensions and each dimension have a specific number of processors. For example, the 1D grid defined as
is a 1 × 2 processor mesh on which data can be distributed. ρ represents the column dimension of size one, while ν represents the row dimension of size two.
Describing Global Data Distributions.
A key aspect of ROTE is the description of multi-dimensional data that is distributed across multi-dimensional meshes. This description is based on the idea of distributing each dimension of the data over specific dimensions of the computational grid. We illustrate with the following example. Consider distributing our two dimensional tensor A ιη across the 1D processor grid, G ρν such that columns of A ιη are distributed in a cyclic fashion across the processors (i.e. Figure 7a ). This is described in ROTE with as A ιη having the distribution
To understand the expression, we note that there are two pairs of round parenthesis within the square brackets. The pair of round brackets tells us that A ιη is a two-dimensional tensor. In addition, there is only a single number, 0, within the round brackets. This describes that the processing grid has only one dimension. Finally, the zero within the second parenthesis tells us that the second dimension of A, i.e. the columns of A ιη , are distributed across the grid of processors. The distribution is an elemental-cyclic distribution of the columns of A ιη across the grid. The notation can easily be extended to describe the blocked distribution of data over the processing grid, an important data layout for the computation of the multi-dimensional DFT (See Figure 7b) . The key difference between this distribution and the elementalcyclic distribution is that η 0 consecutive columns, where η 0 |η, are distributed to each processors as opposed to assigning one column at a time to each processor in a cyclic fashion. Expressing the blocked distribution of the matrix A in ROTE notation can be done as follows
where column dimension of size η = η 0 η 1 is blocked or tiled by η 0 . Blocking or tiling the data with a block of size b can be expressed as adding an another dimension to the data set [25] . In the previous example, the dimension described by η is split into two, namely η 0 and η 1 and η 0 is set to the block size (in this case, 4). The dimension corresponding to η 1 is distributed across the two processors on the grid as denoted by having the zero within the third pair of parenthesis. Each processor receiving a block of η 0 elements in a cyclic fashion.
From Distribution to
Communication. ROTE implements data communication over the computation grid by specifying the original and final data layout. For example, the following expression
describes the original tensor A, where the columns of the matrix A are distributed element cyclic across the processor grid, being mapped to the new tensor B, where the rows of the tensor are distributed in an elemental-cyclic fashion. ROTE maps the above description to the all-to-all collective communication. Given the before and after data distributions, the infrastructure within ROTE determines the necessary packing, collective communication and un-packing required.The detailed steps that ROTE takes in order to perform the redistribution from tensor A to tensor B are captured in Figure 8 . The parallel implementation of the 3D DFT using the volumetric decomposition. The data is distributed elemental cyclic in all three dimensions across a 3D mesh of processors of size 2×2×2. The implementation requires three communication steps and the elemental cyclic distribution is preserved between the computed stages.
Parallel 1D DFT using ROTE
In this section, we describe how ROTE can be used as our framework for computing the 1D DFT using the Cooley-Tukey algorithm. Recall that the Cooley-Tukey algorithm decompose the computation of the 1D DFT of size n = n 0 n 1 into four stages involving computations with two dimensional tensors (i.e. matrices). Using the notation from ROTE, we can re-express Equation 3 as
ι 0 ι 1 and η 1 η 0 represent the size of the two-dimensional tensors of the input and output respectively. In the following paragraphs, we derive the 1D DFT parallelization across p processors using ROTE's notation. Recall that the input matrix is distributed across the processors (in a one dimensional grid) such that each processor has η 1 /p columns. However under this data layout, the computation requires an initial redistribution of the input data such that each processor holds the row of the input matrixX ι 0 ι 1 . This naturally means that the input data (to avoid the redistribution at the start) has to be distributed as
where the rows are distributed in elemental cyclic fashion. Notice that while ι 0 × ι 1 must be equal to the input size, the Cooley-Tukey algorithm places no other constraints on what ι 0 and ι 1 must be. However, it is natural to make ι 1 = p since that will ensure that the rows are cyclic-ly distributed across all p processors. The computation stage of the parallel 6-steps algorithm is then able to perform the local DFT computation, followed by the local point-wise multiplication with the twiddle matrix Twid 0 . Since the computation is an element-wise complex multiplication it follows that the distribution of the twiddle matrix Twid 0 must be the same with that of the inputX , i.e.
After the first two steps, the intermediate result in the temporary array T is
where T Figure 8 shows the steps required to perform the transposition. First the block transposition is applied such that
This operation is mapped to the all-to-all collective communication.
Each processor then transposes the data locally such that
Finally, each processor applies its local computation of the last stage of the Cooley Tukey algorithm. The output Y is computed as
The computation between the two matrices is paired on the dimension corresponding to ι 0 . It can be seen thatX ι 0 ι 1 [(0)()] and Y η 1 η 0 [(0)()] have the same distribution. The first local computation does the DFT of size n 1 followed by the twiddle scaling, while the second local computation does a transposition and the DFT of size n 0 . An instantiation of the parallel algorithm in the ROTE framework is shown in Figure 10 . Notice that the besides the advantage of removing redundant communication with the elemental cyclic distribution, the computation of the DFT using ROTE has the property that the data remains in elemental cyclic distribution after the computation of the 1D DFT [26] . In addition, for a problem size n = n 0 n 1 , where p|n 0 and p|n 1 , the 1D DFT algorithm requires only one communication stage and not three as in the case of the blocked distribution. If one of the conditions is not satisfied then either the mesh is modified or extra communication are required as shown in Inda et. al [26] . We leave the scenario when the above conditions are not satisfied as future work.
EXPERIMENTAL RESULTS
In this section, we describe the experimental setup and the results obtained on the K-Computer, outlining the importance of having a flexible framework that allows one to chose the appropriate algorithm for a given problem size. We first present the characteristics of the system. We then briefly present some of the implementation details for parallelizing the computation using MPI and OpenMP, emphasizing the simplicity of expressing the computation within the framework. We then present strong scaling results for the 3D DFT for 64 3 , 256 3 and 1024 3 using all three decomposition (slabpencil, pencil-pencil-pencil and volumetric). We give a breakdown of the performance and emphasize that depending on the problem size a family of algorithms is need. There is no one decomposition that gives the shortest time to solution.
System Configuration
The K-Computer at Riken AICS consists of approximately 80, 000 SPARC64 VIIIfx processors. Each compute node contains a single CPU with 16 GB of main memory and a total bandwidth of 64 GB/s. Each SPARC64 VIIIfx has 8 cores with one thread per core running at 2.0 Ghz and 6 MB of L2 cache. Each core can do 128-bit single instruction multiple data (SIMD) instructions, giving a peak performance for double precision fused multiply-add instructions of 128 GFlops. Each compute node is connected with a 6D mesh/torus network (TOFU network). The TOFU network provides a logical 3D torus network for each job. Each node has 10 links with a bandwidth of 10GB/s full-duplex and the network allows for multiple pathways to communicate data. The infrastructure allows programs to be adapted to one, two and three dimensional torus networks. The maximum number of processors in one dimension is 54, while the maximum torus size is 48 × 54 × 32 [27] . The DFT framework is written in C/C++. The current implementation uses FFTW for the local computation, OpenMP to parallelize the computation, ROTE to describe the data distribution and the Figure 11 : Example of applying thread level parallelism on the local computation using OpenMP. Each processor receives a cube of data points, that is divided between the threads. Each thread takes ownership on that chunk of data and applies its computation and data transformations. In this example, each thread applies computation in two dimensions followed by a rotation across the main diagonal of the data cube.
communication between the processors and MPI to parallelize the computation across the compute nodes. The entire framework is compiled with the custom compiler provided on the K-Computer. All optimization are set to -O2. In addition, the -fopenmp and -lfftw3 flags are enabled for compilation.
OpenMP + MPI Parallelism
The framework uses OpenMP [28] parallelism for the local computation and MPI [29] for the distributed computation. The MPI is hidden away by the ROTE infrastructure. As seen in Figure 10 , where we parallelize the 1D DFT of size 256 on four processors, ROTE allows users to specify the grid, the tensors, the distribution and the communication. Writing the communication is straightforward. The construction of the processor grid g requires the MPI_COMM_WORLD global communicator and the shape of the grip gridShape. Based on the shape of the grid, the global communicator is split accordingly. The construction of the data tensors A and B requires the shape of the tensor, the distribution and the grid on which the data is distributed on. For example, order-2 tensor A is is distributed on the first dimension [(0)()], while order-2 tensor B is distributed on the second dimension [()(0)]. Each tensor object provides functions to specify the communication. For the 1D DFT the AlltoAllRedistFrom is used to specify the all-to-all communication described in Figure 8 .
Parallelizing the computation using OpenMP is made explicit. We uses #pragma omp parallel region to specify and create the pool of threads. Based on the thread id (omp_get_thread_num), each thread will compute on its own chunk of data as shown in Figure 11 . We use the clause #pragma omp barrier to synchronize the threads and the clause #pragma omp master to specify that only the master thread can perform the computation. Previously this claused was used around the redistribution function AlltoAllRedistFrom. However, modifications to the original framework were done in order to move the #pragma omp_master within ROTE's redistribution function, closely around the actual MPI all-to-all collective communication. The overall benefit is that a single pool of threads is created up front and those threads are used both for performing the computation and packing the data. There is no need to create additional threads within the packing routine, since thread creation is expensive. The K-Computer provides eight threads in total, and all eight are used for both the local computation and packing.
One Large All-to-All vs. Multiple Small All-to-Alls
Parallelizing the multi-dimensional DFTs across and within the 1D DFTs requires multiple all-to-all communication steps. For example, the slab-pencil approach requires one all-to-all, while the pencilpencil approach requires two all-to-alls. The decision of which approach to use given the size of the DFT and the total number of processors depends on the total time spent in data communication.
The overall time spend in data movement can be determined as the sum of each communication step. In turn, each communication step can be estimated using the simple model
where α represents the time to establish the communication between the processors (latency), β represents the time required to send one byte of data between two processors (inverse bandwidth) and n represents the total amount of data. Following the analysis provided in [30] , similar lower bounds can be derived for the all-to-all collective such as MST: log(p)α + log(p)β n 2p (19) Bucket:
Equation 19 represents the lower bound of the all-to-all collective if a minimum spanning tree (MST) implementation is used. Under this assumption, the p processors are connected via a hyper-cube topology and the communication requires loд(p) stages, where each two processors exchange n 2p data points. Equation 20 represents the lower bound of the all-to-all collective if a bucket algorithm is used. Given this implementation, the p processors are connected via ring topology, where each processor has a left and a right neighbor. The communication requires (p − 1), where each processor receives a chunk of data from the processor on the left, keeps part of the data for itself and the remaining data is sent to the right processor. In both equations, as the number of processors p increases, the β part decreases (MST) or remains constant (bucket), while the startup cost α gradually increases. The start-up cost can be reduced by splitting the all-to-all into multiple stages. For example, the lower bounds for the bucket algorithm using one stage across all p x p y processors and the bucket algorithm using two stages across groups of p x and p y processors can be expressed as
To make the analysis simpler, we assume that α and β are the same for both implementations. Note that splitting the bucket implementation of the all-to-all between groups of processors improves the overall all-to-all communication time. The same cannot be said when the all-to-all collective is implemented using the MST algorithm. A similar analysis shows that the split version over groups of processors has the same execution as the non-split version. However, for real implementations of the all-to-all the performance of splitting and non-splitting the communication stages are different. We further show empirical results for the single versus multiple all-to-alls on a torus network as seen in Figure 12 . In this experiment, we perform multiple all-to-alls on 1D, 2D and 3D grids for data cubes of size 64 3 , 256 3 and 1024 3 on the torus network on the K-Computer. We carve out 1D, 2D and 3D grids from the entire network and pin the MPI ranks appropriately using the features provided by the TOFU infrastructure. We use the MPI function MPI_Comm_Splitt to split the MPI communicator. For each problem size, we increase the number of processors as shown in Table 1 . In the plots, we compare the communication of using one, two and three all-to-alls in different dimensions.
Communicating in the dimensions of a torus network is similar to communicating between processors laid out on a ring topology. However, the results shown Figure 12 outline that the all-to-all implementations does not follow the exact bucket algorithm analysis. On the contrary, it shows that for small number of processors irrespective of problem size, one single all-to-all provides faster time to solution. This is due to the fact that the network is not fully utilized, and other torus links between the processors can be used to move data. As the number of processors is increased the 1D communication scheme tappers off, since the network links start being used. For example for a cube of 1024 3 data points, one all-to-all works better when the number of processors is smaller or equal to 32, two all-to-alls gives better results for a number of processors smaller than 128, while for large number of processors three all-to-all thrive.
Discussion
We provide an analysis of the performance obtained on the KComputer for the 3D DFT using all three implementations, namely Figure 12 : The execution of one, two and three all-to-alls applied on 1D, 2D and 3D grid of processors. Each all-to-all is applied in one dimension. The left, middle and right plots show the execution time for the three scenarios for a cube of size 64 3 , 256 3 and 1024 3 , respectively. The data cube is distributed between the processors. the slab-pencil, the pencil-pencil-pencil and the volumetric decomposition. We run cubic 3D DFTs of size 64 3 , 256 3 and 1024 3 . For each configuration, we report strong scaling results, where we keep the problem size fixed and increase the number of processor in each dimension as shown in Table 1 . We measure the computation of the 3D FFT computation. Before the actual measurement, each MPI processor sleeps for one minute. Then the computation is executed for 10 runs without recording the execution time. The sleep time and the dry runs are meant to reduce noise on the network. The actual measurement executes the computation within a loop for 10 minutes. We take the average execution time for all experiments. For all experiments, we impose that the input and output to remain in elemental cyclic fashion.
A -64 3 3D DFT. Figure 1 shows the execution time for the 3D DFT of size 64 3 using all three implementations. Since data is required in elemental cyclic, the three implementations can only be scaled up to eight, 64 and 512 processors, respectively. The slabpencil decomposition outperforms the other two algorithms. In addition, the execution scales almost linearly as the number of processors is increased, whilst the other two taper off. The 64 3 is a small problem that requires in total 4MB of data. As the number of processors increases the start-up cost α becomes the bottleneck. In addition, fewer all-to-alls may provide better time to solution. This can also be seen in the left plot in Figure 12 . The execution of one all-to-all yields shorter execution time compared to execution two or three all-to-alls. Further analysis is required to understand the implementation of the all-to-all on the torus network.
B -256 3 3D DFT. The top-left plot in Figure 13 shows the execution time of the 3D DFT of size 256 3 using all three parallel algorithms. The slab-pencil decomposition performs better when the number of processors is smaller than 16. However, as the number of processors is increased, the volumetric decomposition yields faster execution time. In addition, the volumetric decomposition is able to scale the computation to 4096 compute nodes. The same trend can be seen in the middle plot in Figure 12 , when executing the multiple all-to-alls in the different dimensions of the 3D torus. There is a gap between the 3D FFT implementation and simply running the all-to-all functions. This is mainly due to the packing routines required before and after the all-to-all functions. The top-right, bottom-left and bottom-right plots in Figure 13 show the breakdown of the execution time for all three algorithms. The slabpencil and pencil-pencil-pencil decomposition are dominated by computation. Local computation can be improved using techniques like the ones showed in [31] . Moving the other to approaches, it can be seen that the percentage spent in computation decreases, while the percentage spent in network communication increases as expected. However, the percentage spent in packing the data also increases significantly. This suggests that there is still room for improvement by optimizing the packing routines within the ROTE framework or exposing the ROTE packing routines in order to fuse the packing with the computation.
C -1024 3 3D DFT. Figure 1 shows the execution time of the 3D DFT for 1024 3 data points, using all three implementations. Similar to the 256 3 case, the slab-pencil decomposition performs better for small number of processors of up to 32. However, as the number of processors increases, the volumetric decomposition yields better results. In addition, the volumetric implementation almost linearly scales to 32k compute nodes, and outperforms for 16k and 32k compute nodes both algorithms presented in [15] . There is still room for improvement, since computation and the packing routines are not fully optimized. Similar to the 256 3 case, as the number of processors is increased the percentage of the total execution time spent in packing and un-packing the data before and after the all-to-all also increases. Improving the computation and the packing routines will give better results, however they are left for future work.
CONCLUSION
In this paper, we presented a flexible framework for implementing parallel multi-dimensional DFTs on multi-dimensional processing grids. Specifically, we show that for different input problem size and different computational resource availability, different parallel multi-dimensional DFT algorithms are necessary for attaining efficient performance. In addition, we show that it is necessary to parallelize within the 1D DFT in order to scale the computation of the multi-dimensional DFT towards higher number of processing units. Despite incurring more rounds of communication, we show that for large enough data sizes, parallelizing with the 1D DFTs can improve the overall execution time over the conventional approach of simply parallelizing across multiple 1D DFTs. The top-left plot shows the scaling results for the 3D DFT using all three implementations (blue line for the slab-pencil decomposition, yellow line for the pencil-pencil-pencil decomposition and red line for the volumetric decomposition). The top-right, bottom-left and bottom-right show the execution breakdown for the three decomposition as the number of processors is increased. The blue portion represents the time spent in communicating over the network using the all-to-all collective. The yellow part represents the time spent in packing the data before and after the all-to-all. The red portion represents the time spent in the local computation.
While we showed improved performance as we scale to larger number of nodes, further improvements to performance can be attained. As shown in Figure 13 , a drawback of the current ROTE framework is that a local packing step is required to repack the data back into elemental-cyclic form after the collective communication.
The time for the repacking is significant as it could be as much as 50% of the overall execution time. One possibility for reducing this packing time is to expose the packing performed by ROTE or allow ROTE to accept data in packed form. This is something we are currently exploring. We also believe that the ROTE notation for describing data layout and processing grid can be extended to other architectures such as GPUs and multi-GPU systems. This can be achieved by replacing the MPI routines with the appropriate data movement primitives for GPUs and multi-GPUs. As the hierarchy of threads, thread blocks, streaming processors, and GPUs can be described as a multi-dimensional array, this suggest that ROTE can be used as the notation for porting the existing code to GPUs and multi-GPUs systems.
