A Flexible Framework for Parallel Multi-Dimensional DFTs by Popovici, Doru Thom et al.
A Flexible Framework for Parallel Multi-Dimensional DFTs
Doru Thom Popovici
Carnegie Mellon University
dpopovic@alumni.cmu.edu
Martin D. Schatz
Facebook
mschatz@fb.com
Franz Franchetti
Tze Meng Low
Carnegie Mellon University
{franzf,lowt}@cmu.edu
ABSTRACT
Multi-dimensional discrete Fourier transforms (DFT) are typically
decomposed into multiple 1D transforms. Hence, parallel implemen-
tations of any multi-dimensional DFT focus on parallelizing within
or across the 1D DFT. Existing DFT packages exploit the inherent
parallelism across the 1D DFTs and offer rigid frameworks, that
cannot be extended to incorporate both forms of parallelism and
various data layouts to enable some of the parallelism. However,
in the era of exascale, where systems have thousand of nodes and
intricate network topologies, flexibility and parallel efficiency are
key aspects all multi-dimensional DFT frameworks need to have
in order to map and scale the computation appropriately. In this
work, we present a flexible framework, built on the Redistribution
Operations and Tensor Expressions (ROTE) framework, that facili-
tates the development of a family of parallel multi-dimensional DFT
algorithms by 1) unifying the two parallelization schemes within
a single framework, 2) exploiting the two different parallelization
schemes to different degrees and 3) using different data layouts to
distribute the data across the compute nodes. We show the need
of a versatile framework and thus a need for a family of parallel
multi-dimensional DFT algorithms on the K-Computer, where we
show almost linear strong scaling results for problem sizes of 10243
on 32k compute nodes.
CCS CONCEPTS
• Theory of computation → Parallel algorithms; • Software
and its engineering→Application specific development en-
vironments; •Computingmethodologies→ Parallel algorithms.
KEYWORDS
3D FFTs, Distributed Systems, Performance, Scalability
ACM Reference Format:
Doru Thom Popovici, Martin D. Schatz, and Franz Franchetti, Tze Meng
Low. 2019. A Flexible Framework for Parallel Multi-Dimensional DFTs. In
Proceedings of May’19. ACM, New York, NY, USA, 11 pages. https://doi.org/
10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
The multi-dimensional discrete Fourier transform (DFT) has proven
to be an ubiquitous mathematical kernel, that is widely used in a
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
May’19, 2019
© 2019 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
Figure 1: Strong scaling results for two 3D DFTs of size 643
and 10243 using the slab-pencil, pencil-pencil-pencil and vol-
umetric decomposition on the K-Computer. The execution
time is shown in log scale.
multitude of applications from different scientific fields like molecu-
lar dynamics [1–3], material sciences [4–7], quantummechanics [8–
11]. As most of these applications spend a significant amount of
the total execution on computing multi-dimensional DFTs, it is
vital that parallel multi-dimensional DFT be efficient as we move
into the exa-scale era. As multi-dimensional DFTs are defined in
terms of multiple 1D DFTs and, parallel implementations of the
multi-dimensional DFTs can be classified into two distinct classes.
The first class focuses on exploiting parallelism within the 1D DFTs,
while the second class exploits the inherent parallelism across mul-
tiple distinct 1D DFTs. Most state-of-the-art frameworks for com-
puting 3D DFTs like FFTW [12], P3DFFT [13], FFTE [14] opt to
parallelize across the 1D DFTs.
We illustrate the limitations of the conventional approach of
parallelizing multi-dimensional DFTs in Figure 1, where we report
the performance of three different parallelization schemes for com-
puting the 3D DFT of 643 and 10243 on the K-Computer. Notice
that for data set of size 643, parallelizing the 3D-DFT on a 1D grid
of processor such that each processors computes a 2D plane yielded
the shortest execution time. However, for the 10243 input, the same
parallelization scheme scales only to 32 processors, while better per-
formance can be attained by switching to a parallelization scheme
comprising of a 2D grid of processors where each processor com-
putes a small batch of 1D DFTs. Just as with the 1D parallelization
scheme, the parallelism with a 2D grid of processors is restricted to
at most 1024 processors. By parallelizing within the 1D DFT for the
last dimension, we obtain an even shorter execution time using a
ar
X
iv
:1
90
4.
10
11
9v
2 
 [c
s.M
S]
  2
2 D
ec
 20
19
May’19, 2019 D. T. Popovici et al.
Figure 2: The decomposition of the 1DDFTusing theCooley-
Tukey algorithm for a given problem size of 16 = 4 ·4. The al-
gorithm requires four steps, namely two batch DFTs (I) and
(IV) of size 4, a point-wise operation (II) and a transposition
(III). Data is stored in column major order.
much greater number of processors. These observations highlight
two major limitations of existing approaches: 1) there is a need for
a framework that allows one to seamlessly switch between parallel
algorithms for computing the multi-dimensional DFT as different
parallel algorithms are superior for different problem sizes, and 2)
there is a need to parallelize both across and within the 1D DFTs to
scale to larger number of compute nodes.
There is an increasing interest in scaling the parallel implemen-
tations of the multi-dimensional DFT to higher number of nodes
using higher dimensional grids as shown in Jung et. al. [15]. These
newer algorithms are a subset of the parallel DFT algorithms high-
lighted in Johnson et. al. [16], that showed that a significant number
of algorithms can arise from the parallelization both within and
across multiple 1D DFTs. However, these algorithms are largely un-
explored in practice as the existing parallel DFT frameworks make
implementing them quite difficult. Due to the rigidity of the existing
frameworks, the resulting performance is still unsatisfactory. There-
fore, in this work we present a flexible framework for employing
different parallelization schemes to compute the multi-dimensional
DFT on a multi-dimensional computation grid of processors. By
recognizing that the inputs and outputs of a multi-dimensional DFT
are multi-dimensional arrays of data, we leverage insights from the
multi-linear algebra (tensor) community to simplify the manage-
ment of communication and data layout on a multi-dimensional
grid.
Contributions. The paper makes the following contributions:
• Aflexible framework that unifies both parallelization schemes
for multi-dimensional DFTs onmulti-dimensional grids, built
on top a tensor framework.
• Demonstrate the need for a family of parallel DFT algorithms
in order to attain high performance for any given problem
size and network configurations.
2 THE DISCRETE FOURIER TRANSFORM
In this section, we briefly present the decomposition of both the 1D
and multi dimensional DFTs, expressing the algorithms in terms of
linear algebra operations and outlining the possibilities for paral-
lelization.
2.1 The 1D Discrete Fourier Transform
The 1D DFT is a matrix-vector multiplication, where given the
input x , the output y is obtained as
y = DFTn · x . (1)
The DFTn is the n × n DFT dense matrix, defined as
DFTn =
[
ωlkn
]
0≤l,k<n , (2)
with ωn = e−k
2π
n being the complex root of unity. Typically, the
computation of the DFT is implemented using the Fast Fourier
Transforms (FFT), where instead of performingO(n2) complex arith-
metic operations by doing the matrix-vector multiplication, a re-
cursive decomposition of the DFT matrix is performed to obtain an
O(n log(n)) algorithm. The most widely-know of these algorithms
is the Cooley-Tukey algorithm [17]. The Cooley-Tukey algorithm
is described as a factorization of the DFT matrix of size n when n is
a composite number such as n = n0n1.
An alternative view of themathematical description of the Cooley-
Tukey algorithm defined in [17] is as follows:
y˜ =
(
Twidn0×n1 ⊙
(
x˜ · DFTn1
) )T · DFTn0 , (3)
where x˜ and y˜ are two matrices of size n0 × n1 and n1 × n0, that
represent the input x and output y, respectively. The columns of
matrix x˜ (y˜) are n1 (n0) contiguous sub-vectors (i.e. xi , 0 ≤ i < n0 or
yi , 0 ≤ i < n1) of the input (output) vector x (y), and are obtained
by splitting x (y) into sub-vectors of size n0 (n1) as follows
x˜ =
[
x0 | x1 | . . . | xn1−1
]
.
The matrix y˜ is similarly constructed.
The reason for the formulation of the Cooley-Tukey algorithm
as described by Equation 3 is that the four stages of the algorithm
(as depicted in Figure 2) are made explicit. The first step applies the
DFTn1 on the rows of the matrix x˜ (I), assuming data is stored in
column major order. The elements within each column of either
matrix x˜ or y˜ are in consecutive locations in memory. This means
that in order to compute the first stage of the DFT, elements are
accessed at a stride of n0. The result of the first compute stage is
then point-wise multiplied with the so-called twiddle matrix (II),
which is defined as
Twidn0×n1 =
[
ωkln
]
0≤l<n0 and 0≤k<n1 , (4)
where ωn represents the complex roots of unity. The resultant
matrix is then transposed (III) and finally the second DFT of size
n0 is applied in the rows (IV). Again, the elements required for the
fourth stage of the computation are read at a stride. The second
Figure 3: The parallel implementation of the Cooley-Tukey
algorithm for the DFT16 on p = 2 processors. Data is dis-
tributed between the processors, each processor does its lo-
cal computation and then exchanges the data. This parallel
implementation requires three communication stages.
A Flexible Framework for Parallel Multi-Dimensional DFTs May’19, 2019
Figure 4: Two algorithms for computing the 3D DFT. The
slab-pencil algorithm (a) decomposes the 3D DFT into a 2D
DFT and a 1DDFT.When applied, the 2DDFT is decomposed
into the corresponding 1D transforms. The pencil-pencil-
pencil algorithm (b) decomposes the 3D DFT into three 1D
DFTs applied in the corresponding dimensions.
DFT cannot start computation until all previous stages have been
completed, because of the transposition stage.
2.2 Parallelizing the 1D DFT
Parallelizing the 1DDFT onp processors requires the parallelization
of all four compute stages. Traditionally, the input matrix x˜ (stored
in column major order) is split such that each processor receives
n1/p columns of size n0. Distributing the data using this data layout
implies that computation cannot start since each processor does not
have the necessary data points. As such, an all-to-all communication
step is needed to re-distributes the data such that each processor
receives n0/p rows of size n1. The first and second stage of the
Cooley-Tukey algorithm can then be applied locally. A second all-
to-all communication is performed to transpose (stage III) the global
data across the different processors. After this communication step,
each processor owns n1/p rows of size n0 and thus can locally
compute the last stage of the algorithm. Storing the data in the
correct order requires a third communication step. The described
computation gives the so-called six step algorithm [18] defined as
y˜ =
(
DFTn0 ·
(
Twidn0×n1 ⊙
(
DFTn1 · x˜T
))T )T
. (5)
The parallel implementation of the six step algorithm is depicted in
Figure 3, and it requires a total of three communication steps. The
astute reader will recognize that some of the communication steps
can be avoided by storing the initial data in a different layout. This
has been observed in literature, but seldom exploited in practice.
2.3 n-Dimensional DFTs
Multi-dimensional DFTs can be defined in terms of multiple 1D
DFTs and multi-dimensional DFTs of lower dimensions. For ex-
ample, Figure 4 shows two variants for computing the 3D DFT.
The first algorithm represents the so-called slab pencil decompo-
sition [19–21], where the 3D DFT is decomposed into a 2D DFT
followed by a batch of multiple 1D DFTs. The second algorithm
represents the pencil-pencil-pencil decomposition [19–21], where
the 3D DFT is decomposed into three batches of 1D DFTs, where
each 1D DFT is applied in the three dimensions.
The slab-pencil decomposition views the input (output) column
vectors x (y) as 2D matrices x˜ (y˜) of size (n0n1)×n2. Mathematically
the decomposition is expressed as
y˜ =
(
DFTn0×n1 · x˜
) · DFTn2 (6)
Data is stored in column major, hence the 2D DFT is applied on the
columns, while the 1D DFT is applied on the rows.
The pencil-pencil-pencil algorithm views the input (output) vec-
tors are reshaped into 3D cubes xˆ (yˆ) of size n0 ×n1 ×n2. The input
(output) column vector x (y) is decomposed into n2 groups of n1
subgroups of size n0. Hence, the 3D cube xˆ can be viewed as matrix
of matrices such as
xˆ =
[
x˜0 | x˜1 | . . . | x˜n2
]
, (7)
where each x˜i is the 2Dmatrix of sizen0×n1 for all values 0 ≤ i < n2.
Mathematically, the pencil-pencil-pencil algorithm is expressed as
yˆ =
[ (
DFTn0 · x˜0
) · DFTn1 | . . . | (DFTn0 · x˜n2 ) · DFTn1 ] · DFTn2
(8)
where the DFTn0 is applied in the depth dimension, the DFTn1 is
applied in the column dimension and finally theDFTn2 is applied in
the row dimension. Data is store in column major and therefore the
dimension corresponding ton0 is laid out in the fastest dimension in
memory, while the dimension corresponding to n2 is in the slowest
dimension in memory.
Figure 5: The parallel implementation of the slab-pencil al-
gorithm for computing the 3D DFT on a 1Dmesh of two pro-
cessors. Each processor applies a 2D DFT followed by a 1D.
The implementation requires two communications stages.
2.4 Parallelizing the n-Dimensional DFTs
Since the multi-dimensional DFT can be decomposed into multiple
lower-dimensional DFTs, most prevalent DFT frameworks exploit
this feature for parallelization. We illustrate this parallelization
scheme on the two algorithms for computing the 3D DFT that
were presented previously. Parallelizing the slab-pencil algorithm
is performed by viewing the processors as a 1D grid and distributing
the data cube such that each processor receives a slab of n2/p 2D
data of size n0 × n1. Each processor then locally computes multiple
2D DFTs on the slabs. Data is then exchanged across processors
so that each processor can compute (n0 × n1)/p 1D DFT of size n2.
Finally, another communication step is required to store the data
in the correct format as seen in Figure 5.
The pencil-pencil-pencil algorithm is parallelized in the two di-
mensions corresponding to n1 and n2 across a mesh of processors
of size py × pz . Each processor receives a batch of 1D pencils as
seen in Figure 6. Each processor can apply the local 1D DFTs. In
May’19, 2019 D. T. Popovici et al.
Figure 6: The parallel implementation of the pencil-pencil-pencil algorithm for computing the 3D DFTs on a 2D mesh of size
2 × 2 processors. Each processor 1D DFTs in each dimension. The implementation requires three communications stages.
order to apply the other two 1D DFTs data is first exchanged be-
tween the processors in the py dimension of the mesh, followed by
the communication between the processors on the pz dimension.
Each communication stage rotates the data cube from xyz to yzx
and zxy as seen in Figure 6. Each communication step requires
data to be packed and un-packed before and after the all-to-all
communication. Finally, in order to store the data in the correct
format a global communication between all processors is required.
However, the DFT is typically used in larger applications like con-
volutions or PDE solvers, therefore the last stage of communication
for both the slab-pencil and pencil-pencil-pencil can be dropped.
This improves reduces execution time in favor of having the data
in shuffled format.
2.5 Limitations of existing parallelization
schemes
Notice that in both cases, the amount of parallelism is limited by
the size of a particular dimension of the data. For the slab-pencil
algorithm, the maximum number of processors that can partic-
ipate in the computation is max(n0,n1,n2) as each processor is
assigned a 2D slab of data. Similarly for the pencil-pencil-pencil
case, the number of participating processors is upper bound by
max(n0n1,n1n2,n0n2) since each processor is assigned at least one
1D DFT to compute locally. The level of parallelism within the com-
munication is also limited by the size of a particular dimension of the
data. The slab-pencil distribution spreads the data on a one dimen-
sional grid. Hence, exchanging the data for computing the 1D DFT
require all processors to communicate. The pencil-pencil-pencil
case parallelizes the communication in two dimensions, which is
done between groups of processors. All frameworks that implement
the parallelization across the 1D DFT depend on efficient all-to-all
communications [22, 23]. However, as we will briefly show in the
result section, the all-to-all communication cannot efficiently scale
as the number of processors increase.
3 n-DIMENSIONAL DFTS IN ROTE
The inputs and outputs of any multi-dimensional DFT are multi-
dimensional data arrays. Multi-dimensional arrays are commonly
referred to as tensors in the multi-linear algebra community. As
such, we leverage the insights from that community for distributed
data management on a distributed grid through the use of the
Redistribution Operations and Tensor Expressions (ROTE) frame-
work [24].
3.1 The Redistribution Operations and Tensor
Expressions Framework
The Redistribution Operations and Tensor Expressions (ROTE)
framework provides a formal notation for describing the data lay-
out of tensors that are distributed on multi-dimensional meshes. By
describing the data layouts before and after a data redistribution
operation, ROTE provides the infrastructure to map the necessary
data movement into a sequence of one or more collective communi-
cation subroutines to achieve the desired data movement. In order
to make the paper self-contain, we present a high-level overview
of the notation of ROTE and refer the reader to existing documen-
tations on ROTE [24] for more details.
3.1.1 Describing Data and Processing Grid. Within ROTE, data
and processing grids are viewed as multi-dimensional arrays or
tensors. A tensor is a d-dimensional, or a d-mode, array. The order
of a tensor represents the number of dimensions or the number
of modes represented by the tensor. The dimension of each mode
refers to the length or size of a specific mode. For example, the 2× 8
matrix defined as
Aιη =
[
a0,0 a0,1 a0,2 a0,3 a0,4 a0,5 a0,6 a0,7
a1,1 a1,1 a1,2 a1,3 a1,4 a1,5 a1,6 a1,7
]
(9)
is an order-2 tensor. For consistency within the paper, we use the
terms "dimensions", and "size" to refer to the order of the tensor,
and dimension of the mode respectively. To make things clearer,
we annotate the matrix A with two superscripts (ι and η) to denote
the size of the two modes such that A is ι × η.
Multi-dimensional processing grids are defined within ROTE
as multi-dimensional arrays of processing units. Let G be the d-
dimensional grid size. Similar to the data tensors, an order-dG grid
will have d dimensions and each dimension have a specific number
of processors. For example, the 1D grid defined as
Gρν =
[
p0 p1
]
(10)
is a 1 × 2 processor mesh on which data can be distributed. ρ
represents the column dimension of size one, while ν represents
the row dimension of size two.
3.1.2 Describing Global Data Distributions. One of ROTE’s key
aspects is the way in multi-dimensional data sets are described
and distributed across multi-dimensional meshes. The description
is based on the idea of distributing each dimension of the data
over specific dimensions of the processor grid. We illustrate this
aspect in the following example. Consider distributing the two
A Flexible Framework for Parallel Multi-Dimensional DFTs May’19, 2019
Figure 7: Data layouts for distributing the 2D matrix Aιη on
the 2D grid Gρν . A. represents the elemental cyclic distribu-
tion of the row dimension across the two processors. B. rep-
resents the block distribution for the row dimension.
dimensional tensor Aιη across the 1D processor grid, Gρν such
that columns of Aιη are distributed in a cyclic fashion across the
processors (i.e. Figure 7). This is described in ROTE with as Aιη
having the distribution
Aιη [()(0)].
To understand the expression, note that there are two pairs of
round parenthesis within the square brackets. The pair of round
brackets tells us that Aιη is a two-dimensional tensor. In addition,
note that there is only a single value, 0, within the round brackets.
This number specifies that the specific dimension of the data is
distributed on the dimension of the processor grid specified by that
value. Therefore, the zero within the second parenthesis tells us that
the second dimension of A, i.e. the columns of Aιη , is distributed
across the 0-th dimension of the processor grid. The distribution is
an elemental-cyclic distribution of the columns of Aιη across the
grid.
The notation can easily be extended to describe blocked data
layouts. Blocked distributions are also an important data layout for
the computation of the multi-dimensional DFT. The key difference
between the blocked and the elemental-cyclic distributions is the
number of consecutive elements that are distributed to each of the
processors on the mesh. The elemental-cyclic distribution assumes
the data being distributed at the granularity of one element, while
the blocked distribution assumes that each processor receives b
consecutive elements. Expressing the blocked distribution of the
matrix A in ROTE notation can be done as follows
Aιη0η1 [()()(0)], (11)
where the column dimension of size η = η0η1 is blocked or tiled by
the η0 size.
Blocking or tiling the data with a block of size b can be inter-
preted as adding an another dimension to the data set as shown
in [25]. In the previous example, the dimension described by η is
split into two, namely η0 and η1 and η0 is set to the block size
(in this case, 4). The dimension corresponding to η1 is distributed
across the two processors on the grid as denoted by the zero value
within the third pair of parenthesis. Each processor receives a block
of η0 elements in a cyclic fashion.
Figure 8: Transposing the matrix Aιη [()(0)]. The transpo-
sition is broken into a block transposition and multiple
within-block transpositions. The operations from matrix
Aιη [()(0)] to matrix Bιη [(0)()] represents the block transposi-
tion over the network (packing, all-to-all, un-packing). The
last stage Cηι [()(0)] does the local transpositions.
3.1.3 From Distribution to Communication. ROTE implements data
communication over the computation grid by specifying the origi-
nal and final data layout. For example, the following expression
Bιη [(0)()] = Aιη [()(0)] (12)
describes the original tensor A, where the columns of the matrix
A are distributed element cyclic across the processor grid, being
mapped to the new tensor B, where the rows of the tensor are
distributed in an elemental-cyclic fashion. ROTE maps the above
description to the all-to-all collective communication. Given the
before and after data distributions, the infrastructure within ROTE
determines the necessary packing, collective communication and
un-packing required. The detailed steps that ROTE takes in order
to perform the redistribution of the tensor A from the initial distri-
bution to the final distribution expressed as tensor B are captured
in Figure 8.
3.2 Parallel 1D DFT using ROTE
In this section, we describe howROTE can be used as our framework
for computing the 1DDFT using the Cooley-Tukey algorithm. Recall
that the Cooley-Tukey algorithm decompose the computation of the
May’19, 2019 D. T. Popovici et al.
Figure 9: The parallel implementation of the 3D DFT using the volumetric decomposition. The data is distributed elemental
cyclic in all three dimensions across a 3Dmesh of processors of size 2×2×2. The implementation requires three communication
steps and the elemental cyclic distribution is preserved between the computed stages.
1D DFT of size n = n0n1 into four stages involving computations
with two dimensional tensors (i.e. matrices). Using the notation
from ROTE, we can re-express Equation 3 as
Y˜η1η0 =
((
Twid
ι0η1
n0×n1 ⊙
(
X˜ ι0ι1 · DFT ι1η1n1
))T ) · DFT ι0η0n0 . (13)
ι0ι1 and η1η0 represent the size of the two-dimensional tensors of
the input and output respectively.
In the following paragraphs, we derive the 1DDFT parallelization
across p processors using ROTE’s notation. Recall that the input
matrix is distributed across the processors (in a one dimensional
grid) such that each processor has η1/p columns. However under
this data layout, the computation requires an initial redistribution
of the input data such that each processor holds the row of the
input matrix X˜ ι0ι1 . Naturally, this suggest that the input data (to
avoid the redistribution at the start) has to be distributed as
X˜ ι0ι1 [(0)()],
where the rows are distributed in elemental cyclic fashion. Notice
that while ι0 × ι1 must be equal to the input size, the Cooley-Tukey
algorithm places no other constraints on what ι0 and ι1 must be.
However, it is natural to make ι1 = p since that will ensure that the
rows are cyclic-ly distributed across all p processors.
The computation stage of the parallel 6-steps algorithm is then
able to perform the local DFT computation, followed by the local
point-wise multiplication with the twiddle matrix Twid0. Since the
computation is an element-wise complex multiplication it follows
that the distribution of the twiddle matrix Twid0 must be the same
with that of the input X˜ , i.e.
Twid
ι0η1
0 [(0)()].
After the first two steps, the intermediate result in the temporary
array T is as follows
T
ι0η1
0 [(0)()] =
(
Twid
ι0η1
n0×n1 [(0)()] ⊙
(
X˜ ι0ι1 [(0)()] · DFT ι1η1n1
))
,
(14)
where T ι0η10 [(0)()] has the same distribution as the input X˜ ι0ι1 .
The 1D DFT requires a transposition of the array T ι0η10 . Figure 8
shows the steps required to perform the transposition. First the
block transposition is applied such that
T
ι0η1
1 [()(0)] = T
ι0η1
0 [(0)()]. (15)
This operation is mapped to the all-to-all collective communication.
Each processor then transposes the data locally such that
T
η1ι0
2 [(0)()] = T
ι0η1
1 [()(0)] (16)
Finally, each processor applies its local computation of the last stage
of the Cooley Tukey algorithm. The output Y is computed as
Y˜η1η0 [(0)()] = Tη1ι02 [(0)()] · DFT
ι0η0
n0 . (17)
The computation between the two matrices is paired on the di-
mension corresponding to ι0. Note that X˜ ι0ι1 [(0)()] and Y˜η1η0 [(0)()]
have the same distribution. The first local computation does the
DFT of size n1 followed by the twiddle scaling, while the second
local computation does a transposition and the DFT of size n0. An
instantiation of the parallel algorithm in the ROTE framework is
shown in Figure 10.
Notice that the besides the advantage of removing redundant
communication with the elemental cyclic distribution, the com-
putation of the DFT using ROTE has the property that the data
remains in elemental cyclic distribution after the computation of the
1D DFT [26]. In addition, for a problem size n = n0n1, where p |n0
and p |n1, the 1D DFT algorithm requires only one communication
stage and not three as in the case of the blocked distribution. If one
of the conditions is not satisfied then either the mesh is modified
or extra communication is required as shown in Inda et. al [26]. We
leave the scenario when the above conditions are not satisfied as
future work.
4 EXPERIMENTAL RESULTS
In this section, we describe the experimental setup and the re-
sults obtained on the K-Computer, outlining the importance of
having a flexible framework that allows one to chose the appro-
priate algorithm for a given problem size. We begin by presenting
the characteristics of the system and then briefly outline some of
the implementation details for parallelizing the computation using
MPI and OpenMP. We emphasize the simplicity of expressing the
computation within the framework. Lastly, we analyze the strong
scaling of the 3D DFT for 643, 2563 and 10243 using all three de-
composition (slab-pencil, pencil-pencil-pencil and volumetric). We
give a breakdown of the performance. We show that there is no
one decomposition that gives the shortest time to solution.
A Flexible Framework for Parallel Multi-Dimensional DFTs May’19, 2019
// create the processing grid
ObjShape gridShape (1);
gridShape [0] = 4;
const Grid g(MPI_COMM_WORLD , gridShape);
// create the shape and size of the data tensors
ObjShape shapeX (2);
shapeX [0] = 16;
shapeX [1] = 16;
// input and output tensors
DistTensor <complex <double >> A(shapeX , "[(0) ,()]", g);
DistTensor <complex <double >> B(shapeX , "[() ,(0)]", g);
fftw_complex *ABuff = (fftw_complex *) A.Buffer ();
fftw_complex *BBuff = (fftw_complex *) B.Buffer ();
// batch DFT_16 in the rows
fftw_execute_dft(batch_4_dft_16 , ABuff , ABuff);
// point -wise multiplication with twiddle factors
for(int i = 0; i != 64; ++i) {
ABuff[i] = ABuff[i] * twidd_local[i];
}
// redistribution
B.AllToAllRedistFrom(A, 0, 0);
// transposition and batch DFT_16 in the rows
fftw_execute_dft(transpose_batch_4_dft_16 , BBuff + ,
BBuff);
Figure 10: Code snippet for computing a DFT of size 256. The
1DDFT is parallelized across 4 processors on a 1D grid. Com-
putation is done using FFTW. The grid configuration, tensor
definition and data distribution is done using ROTE.
4.1 System Configuration
The Riken AICS K-Computer consists of approximately 80, 000
SPARC64 VIIIfx processors. Each compute node has a single CPU
with 16 GB of main memory and a total bandwidth of 64 GB/s
(main memory bandwidth). Each SPARC64 VIIIfx has 8 cores with
one thread per core running at 2.0 Ghz and 6 MB of L2 cache.
Each core can do 128-bit single instruction multiple data (SIMD)
instructions, giving a peak performance for double precision fused
multiply-add instructions of 128 GFlops. Each compute node is
connected with a 6Dmesh/torus network (TOFU interconnect). The
TOFU interconnect provides a logical 3D torus network for each
job. Each node has 10 links with a bandwidth of 10GB/s full-duplex
and the network allows for multiple pathways to communicate
data. The infrastructure allows programs to be adapted to one, two
and three dimensional torus networks. The maximum number of
processors in one dimension is 54, while the maximum torus size is
48 × 54 × 32 [27].
The K-Computer provides its own customize software, rang-
ing from the compilers and the libraries for Fourier transforms
and linear algebra operations to the infrastructure for OpenMP
and MPI. The current implementation of the DFT framework pre-
sented in this work uses FFTW for the local DFT computation and
OpenMP to parallelize the computation across the threads. The
DFT framework is built on top of ROTE, which allows one to easily
Figure 11: Example of applying thread level parallelism on
the local computation using OpenMP. Each processor re-
ceives a cube of data points, that is divided between the
threads. Each thread takes ownership on that chunk of data
and applies its computation and data transformations. In
this example, each thread applies computation in two di-
mensions followed by a rotation across the main diagonal
of the data cube.
describe the data distribution and the communication between the
compute nodes. Underneath, ROTE uses MPI to do the collective
communication between the compute nodes. The entire framework
is compiled with the Fujitsu custom compiler. All optimization are
set to -O2. In addition, the -fopenmp and -lfftw3 flags are enabled
for compilation.
4.2 OpenMP + MPI Parallelism
The framework uses OpenMP [28] parallelism for the local com-
putation and MPI [29] for the distributed computation. The MPI
is hidden away by the ROTE infrastructure. As seen in Figure 10,
where we parallelize the 1D DFT of size 256 on four processors,
ROTE allows users to specify the grid, the tensors, the distribution
and the communication. Writing the communication is straight-
forward. The construction of the processor grid g requires the
MPI_COMM_WORLD global communicator and the shape of the grip
gridShape. Based on the shape of the grid, the global communi-
cator is split accordingly. The construction of the tensors A and B
requires the shape of the data tensor, the distribution and the grid
on which the data is distributed on. For example, order-2 tensor A
is is distributed on the first dimension [(0)()], while order-2 tensor
B is distributed on the second dimension [()(0)]. Each tensor object
provides the necessary functionalities to specify the communica-
tion. For the 1D DFT the AlltoAllRedistFrom is used to specify
the all-to-all communication described in Figure 8.
Parallelizing the computation using OpenMP is made explicit.
We uses #pragma omp parallel region to specify and create
a pool of threads. Based on the thread id (omp_get_thread_num),
each thread will compute on its own chunk of data as shown in
Figure 11. The figure outlines that once the computation is complete,
threads rotate the data. This step is required so that the subsequent
MPI all-to-all function call does not incur penalties from accessing
data in local memory. We use the clause #pragma omp barrier to
synchronize the threads and the clause #pragma omp master to
specify that only the master thread can perform the communication.
Previously, this clause was used around the redistribution function
AlltoAllRedistFrom and within the redistribution function other
May’19, 2019 D. T. Popovici et al.
# of processors 1D Grid 2D Grid 3D Grid
2 (2) - -
4 (4) (2,2) -
8 (8) (4,2) (2,2,2)
16 (16) (4,4) (4,2,2)
32 (32) (8,2) (4,4,2)
64 - (8,8) (4,4,4)
128 - (16,8) (8,4,4)
256 - (16,16) (8,8,4)
512 - (32,16) (8,8,8)
1k - (32,32) (16,8,8)
2k - - (16,16,8)
4k - - (16,16,16)
8k - - (32,16,16)
16k - - (32,32,16)
32k - - (32,32,32)
Table 1: Table showing the different configurations for the
1D, 2D and 3D grids. The maximum number for each grid
represents the maximum size that can be created for the
given grid taking into account that the maximum torus size
allowed on the K-Computer is 48 × 54 × 32 [27].
threads were created to perform the packing before and after the
MPI call. However, modifications to the original ROTE framework
were done and the #pragma omp_master was moved around the
MPI_Alltoall, within ROTE’s redistribution function. The overall
benefit is that a single pool of threads is created up front and those
threads are used for both performing the computation and the
packing/un-packing of the data. The K-Computer provides eight
threads in total, and in the current implementation we use all of
them.
4.3 Results and Discussion
In the following paragraphs, we focus on the performance obtained
by our 3D DFT framework on the K-Computer. We compare the
execution time of cubic 3D DFTs of size 643, 2563 and 10243, using
all three algorithms, slab-pencil, pencil-pencil and volumetric de-
composition. For all of the configurations, we report strong scaling
results, where we keep the problem size fixed and increase the
number of processor in each dimension as shown in Table 1. We
focus on power of two-sized processing grids, where the number of
processor does not exceed the number of processor specified in [27]
in each dimension. Wemeasure the entire 3D DFT computation end-
to-end. The actual measurement executes the computation within
a loop for 10 minutes. We then take the average execution time
for each experiment. Before the measurement, each MPI process
sleeps for approximately one minute. Then 10 dry runs of the 3D
DFT are executed without recording the actual execution time. The
two steps are meant to reduce noise on the network.
A - 643 Problem Size. We first present the strong scaling re-
sults of the 3D DFT when the problem size is small, i.e. 643 complex
data points. As seen in Figure 1, the slab-pencil algorithm for this
problem size gives the shortest execution time and it also scales
to eight compute nodes (64 cores). For this specific problem size
Figure 12: The execution of one, two and three all-to-alls ap-
plied on 1D, 2D and 3D grid of processors for the small prob-
lem size of 643. Each all-to-all is applied in one dimension.
The data cube is distributed between the processors.
the slab-pencil algorithm cannot be execute on more than eight
processors, since we impose the input and output to be in elemental
cyclic-ly distributed. Changing the algorithm from slab-pencil to
pencil-pencil or volumetric to expose more parallelism, does not
provide improvements. Execution suffers slowdowns and in addi-
tion the execution of both algorithms tappers off as the number
of processors increases. This behaviour is directly related to the
all-to-all communication and the number of all-to-alls.
Depending on the parallel algorithm (slab-pencil, pencil-pencil
or volumetric), the 3D DFT may require one or more all-to-all
communication steps. Therefore, the scaling behaviour of the 3D
DFT is highly dependent on the scaling behaviour of the all-to-all
communications. Figure 12 reports the strong scaling of the all-to-all
for the same problems size using one/two/three all-to-alls, similar to
the three implementations of the 3D DFT. The all-to-alls codes are
obtained by removing the local computation and packing. Note that
the execution times of the three scenarios exhibit the same trend as
the 3D DFT execution times. The slab-pencil approach achieves the
shortest execution time and scales to more than eight processors.
The pencil-pencil and volumetric variants of the data movement
experience slowdowns for a small number of processors and show
diminishing returns as the number of processors increased.
The slowdowns and the diminishing returns can be understood
by analyzing the lower bounds of the collective communications.
Since the K-Computer has its own all-to-all collectives, we dis-
cuss the key aspects of the communication by analyzing the lower
bounds outlined in Bruck et. al [22]. The paper presents the lower
bounds for the all-to-all communication as follows
logk+1(p)α + logk+1(p)
n
(k + 1)p β , (18)⌈p − 1
k
⌉
α +
⌈p − 1
k
⌉ n
p2
β , (19)
where p represents the number of processors, n represents the total
amount of data across all p and k represents the number of ports
a compute node can send and receive. α represents the start-up
A Flexible Framework for Parallel Multi-Dimensional DFTs May’19, 2019
Figure 13: The execution of one, two and three all-to-alls applied on 1D, 2D and 3D grid of processors. Each all-to-all is applied
in one dimension. The left plot shows the strong scaling of a three dimensional cube of size 2563, while the right plot outlines
the scaling of a cube of size 10243, respectively. The data cube is distributed between the processors.
cost and β represents the inverse of the bandwidth on each of
the links. Equation 18 represents the lower bound of the all-to-all
that minimizes the number of transfers between the nodes, while
Equation 19 minimizes the amount of data that is being transferred
in each step. The first minimizes the α term, while the second
minimizes the β term. No one algorithm minimizes both at the
same time.
The all-to-all communication becomes latency bound as the num-
ber of processors is increased. Not that the latency term logk+1(p)α
(
⌈p−1
k
⌉
α ) increases while the bandwidth term logk+1(p) n(k+1)p β
(
⌈p−1
k
⌉ n
p2 β) decreases as the number of processors p is increased.
Therefore, if p is sufficiently large such that
p ≥ nβ(k + 1)α (20)
for Equation 18 and
p2 ≥ nβ
α
(21)
for Equation 19, the execution time of the all-to-all becomes latency
dominated. In addition, as the number of processors p is increased
above those threshold the execution time of the all-to-all should
increase. This can be observed in the empirical results outlined in
Figure 12. This provides an explanation to why the DFT for the a
size of 643 complex points plateaus as the number of processor is
increased.
The number of ports k lowers the bounds of the all-to-all com-
munication, if the node has sufficient ports/connections to satisfy
the k value. The k value provides a possible reason why multiple
all-to-alls are slower compared to a single all-to-all for the same
number of processors p. For example, the k number of ports can be
equal to one, two or three when the slab-pencil algorithm is applied
on p = 4 compute nodes. However, the pencil-pencil algorithm
with two all-to-alls on the same amount of processors p = 4 has k
equal to one. Evaluating the above equations, the slab-pencil algo-
rithm where one all-to-all is used can provide better execution time
compared to the other two. However, this effect cannot be seen
as p increases, due to the latency aspect that was explained above.
While, empirically results as seen in Figure 12 and Figure 13 seem to
back up this aspect, further investigation is needed, especially since
the all-to-alls are treated as black boxes. We use the MPI collectives
offered by the K-Computer infrastructure.
B - 2563 3D DFT. In Figure 14, we analyze the strong scaling
behaviour of the 3D DFT for a medium size problem of 2563 com-
plex data points. Compared to the small size, the amount of data
increases by 64 times, which suggests the problem size can be scaled
to larger number of compute nodes. The top left plot outlines the
strong scaling of the 3D DFT. Note that when the number of proces-
sors is small (≤ 16), the slab-pencil decomposition outperforms the
other two implementations. However as the number of processors
increases, the other two algorithms thrive. Ultimately, the volu-
metric decomposition offers the best scaling to 4096 cores/32768
threads. Similar behaviour can be seen in the data movement in
the left plot in Figure 13, where we remove the local computation
and packing. Note that as the number of processors is increased,
the scaling of the slab-pencil all-to-all diminishes and catches up
to the pencil-pencil approach where two all-to-alls are used. Subse-
quently, the volumetric all-to-all where three all-to-alls are needed
catches up and scales to the corresponding number of processors.
This emphasizes the need for a more detailed analysis on how the
all-to-all is implemented and when to decide to change algorithms
for both the all-to-all and the 3D DFT, aspect which we leave for
future work.
Data movement across the network is one aspect of the 3D
DFT computation, however the local computation and packing also
plays a role. The top-right, bottom-left and bottom-right plots in
Figure 14 show the breakdown of the execution time for all three
algorithms, slab-pencil, pencil-pencil and volumetric, respectively.
May’19, 2019 D. T. Popovici et al.
Figure 14: Strong scaling results for the 3D DFT of size 2563. The top-left plot shows the scaling results for the 3D DFT using all
three implementations (blue line for the slab-pencil decomposition, yellow line for the pencil-pencil-pencil decomposition
and red line for the volumetric decomposition). The top-right, bottom-left and bottom-right show the execution breakdown
for the three decomposition as the number of processors is increased. The blue portion represents the time spent in network
communication, while the yellow and red represent the packing and computation times, respectively.
The slab-pencil and pencil-pencil-pencil decomposition show a
significant percentage of the execution time to be sent in the local
DFT computation. The local computation is basically batches of
1D and 2D DFTs followed by point-wise multiplications that are
done outside the DFT. The point-wise computation is not merged
since FFTW does not provide such features. For small number of
nodes, the amount of local data may reside in local DRAM. As
shown in [31], the DFT is a latency bound computation, therefore
applying the double buffering technique on the computation may
reduce local computation when the number of processors is small.
However, as the number of processors increases the amount of
computation is significantly reduced. Note that the data packing
becomes a significant part of the overall computation. This suggests
that there is still room for improvement by optimizing the packing
routineswithin the ROTE framework or exposing the ROTE packing
routines in order to fuse the packing with the computation.
C - 10243 3D DFT. Finally, we report strong scaling results for a
large problem size of 10243 complex data points. Figure 1 shows the
strong scaling behaviour of the entire 3D DFT for this problem size.
Similar to the 2563 case, the slab-pencil decomposition performs
better for small number of processors of up to 32. However, as
the number of processors increases, the volumetric decomposition
yields better results. In addition, the volumetric implementation
almost linearly scales to 32, 768 compute nodes (2626, 144 cores),
and outperforms for 16, 768 and 32, 768 compute nodes both algo-
rithms presented in [15]. Note that the pencil-pencil decomposition
follows the volumetric decomposition until scaling the problem to
512 compute nodes. Similar to the 2563 case, computation can still
be improved for the case where the number of processors is small.
The same cannot be said as the number of processors is increased
beyond 4096 compute nodes. Computation and packing becomes
insignificant compare to the data movement. The execution of the
A Flexible Framework for Parallel Multi-Dimensional DFTs May’19, 2019
3D DFT for large number of nodes follows within 10% the execution
of the all-to-all communication depicted in Figure 13.
5 CONCLUSION
In this paper, we presented a flexible framework for implementing
parallel multi-dimensional DFTs on multi-dimensional processing
grids. Specifically, we show that for different input problem size
and different computational resource availability, different paral-
lel multi-dimensional DFT algorithms are necessary for attaining
efficient performance. In addition, we show that it is necessary to
parallelize within the 1D DFT in order to scale the computation of
the multi-dimensional DFT towards higher number of processing
units. Despite incurring more rounds of communication, we show
that for large enough data sizes, parallelizing with the 1D DFTs can
improve the overall execution time over the conventional approach
of simply parallelizing across multiple 1D DFTs.
While we showed improved performance as we scale to larger
number of nodes, further improvements to performance can be
attained. As shown in Figure 14, a drawback of the current ROTE
framework is that a local packing step is required to repack the data
back into elemental-cyclic form after the collective communication.
The time for the repacking is significant as it could be as much
as 50% of the overall execution time. One possibility for reducing
this packing time is to expose the packing performed by ROTE or
allow ROTE to accept data in packed form. This is something we
are currently exploring. We also believe that the ROTE notation for
describing data layout and processing grid can be extended to other
architectures such as GPUs and multi-GPU systems. This can be
achieved by replacing the MPI routines with the appropriate data
movement primitives for GPUs and multi-GPUs. As the hierarchy
of threads, thread blocks, streaming processors, and GPUs can be
described as a multi-dimensional array, this suggest that ROTE can
be used as the notation for porting the existing code to GPUs and
multi-GPUs systems.
ACKNOWLEDGMENTS
The authors would like to thank Dr. Imamura Toshiyuki and the
Advanced Institute for Computational Science at RIKEN for grant-
ing access to the K-computer. This work was sponsored partly
by the DARPA PERFECT and BRASS programs under agreements
HR0011-13-2-0007 and FA8750-16-2-003, and NSF through award
ACI 1550486. The content, views and conclusions presented in this
document are those of the author(s) and do not necessarily reflect
the position or the policy of the sponsoring agencies.
REFERENCES
[1] Steve Plimpton, Roy Pollock, and Mark Stevens. Particle-mesh ewald and rrespa
for parallel molecular dynamics simulations. In In Proceedings of the Eighth SIAM
Conference on Parallel Processing for Scientific Computing, 1997.
[2] Timothy W Sirk, Stan Moore, and Eugene F Brown. Characteristics of thermal
conductivity in classical water models. The Journal of chemical physics, 138(6):064505,
2013.
[3] Steven J Plimpton and Aidan P Thompson. Computational aspects of many-body
potentials. MRS bulletin, 37(5):513–521, 2012.
[4] Choong-Seock Chang, Seunghoe Ku, and H Weitzner. Numerical study of neoclas-
sical plasma pedestal in a tokamak geometry. Physics of Plasmas, 11(5):2649–2667,
2004.
[5] J-L Vay, A Almgren, J Bell, L Ge, DP Grote, M Hogan, O Kononenko, R Lehe,
A Myers, C Ng, et al. Warp-X: A new exascale computing platform for beam–
plasma simulations. Nuclear Instruments and Methods in Physics Research Section A:
Accelerators, Spectrometers, Detectors and Associated Equipment, 2018.
[6] Ann S Almgren, John B Bell, Mike J Lijewski, Zarija Lukić, and Ethan Van Andel.
Nyx: A massively parallel amr code for computational cosmology. The Astrophysical
Journal, 765(1):39, 2013.
[7] Ricardo A Lebensohn, Anand K Kanjarla, and Philip Eisenlohr. An elasto-
viscoplastic formulation based on fast fourier transforms for the prediction of mi-
cromechanical fields in polycrystalline materials. International Journal of Plasticity,
32:59–69, 2012.
[8] Ricky A Kendall, Edoardo Aprà, David E Bernholdt, Eric J Bylaska, Michel Dupuis,
George I Fann, Robert J Harrison, Jialin Ju, Jeffrey A Nichols, Jarek Nieplocha, et al.
High performance computational chemistry: An overview of NWChem a distributed
parallel application. Computer Physics Communications, 128(1-2):260–283, 2000.
[9] Marat Valiev, Eric J Bylaska, Niranjan Govind, Karol Kowalski, Tjerk P Straatsma,
Hubertus JJ Van Dam, Dunyou Wang, Jarek Nieplocha, Edoardo Apra, Theresa L
Windus, et al. NWChem: a comprehensive and scalable open-source solution for large
scale molecular simulations. Computer Physics Communications, 181(9):1477–1489,
2010.
[10] TP Straatsma, EJ Bylaska, HJJ van Dam, N Govind, WA de Jong, K Kowalski,
and M Valiev. Advances in scalable computational chemistry: NWChem. In Annual
Reports in Computational Chemistry, volume 7, pages 151–177. Elsevier, 2011.
[11] A Canning, LW Wang, A Williamson, and A Zunger. Parallel empirical pseu-
dopotential electronic structure calculations for million atom systems. Journal of
Computational Physics, 160(1):29–41, 2000.
[12] Matteo Frigo and Steven G. Johnson. The design and implementation of FFTW3.
Proc. of the IEEE, special issue on "Program Generation, Optimization, and Adaptation",
93(2):216–231, 2005.
[13] Dmitry Pekurovsky. P3dfft: A Framework for Parallel Computations of Fourier
Transforms in Three Dimensions. SIAM Journal on Scientific Computing, 34(4):C192–
C209, 2012.
[14] Daisuke Takahashi. An implementation of parallel 3-d FFT with 2-d decomposi-
tion on a massively parallel cluster of multi-core processors. In Parallel Processing
and Applied Mathematics, 8th International Conference, PPAM 2009, Wroclaw, Poland,
September 13-16, 2009. Revised Selected Papers, Part I, pages 606–614, 2009.
[15] Jaewoon Jung, Chigusa Kobayashi, Toshiyuki Imamura, and Yuji Sugita. Par-
allel implementation of 3d fft with volumetric decomposition schemes for efficient
molecular dynamics simulations. Computer Physics Communications, 200:57–65, 2016.
[16] Jeremy Johnson and Xu Xu. A recursive implementation of the dimensionless fft.
In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE
International Conference on, volume 2, pages II–649. IEEE, 2003.
[17] James W Cooley and John W Tukey. An algorithm for the machine calculation
of complex fourier series. Mathematics of computation, 19(90):297–301, 1965.
[18] Charles Van Loan. Computational Frameworks for the Fast Fourier Transform.
Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992.
[19] Paul N Swarztrauber. Multiprocessor ffts. Parallel computing, 5(1-2):197–210,
1987.
[20] Ian T Foster and Patrick H Worley. Parallel algorithms for the spectral transform
method. SIAM Journal on Scientific Computing, 18(3):806–837, 1997.
[21] Roland A Sweet, William L Briggs, Suely Oliveira, Jules L Porsche, and Tom Turn-
bull. Ffts and three-dimensional poisson solvers for hypercubes. Parallel computing,
17(2-3):121–131, 1991.
[22] Jehoshua Bruck, Ching-Tien Ho, Shlomo Kipnis, Eli Upfal, and Derrick Weath-
ersby. Efficient algorithms for all-to-all communications inmultiport message-passing
systems. IEEE Transactions on parallel and distributed systems, 8(11):1143–1156, 1997.
[23] Bogdan Prisacari, German Rodriguez, Cyriel Minkenberg, and Torsten Hoefler.
Bandwidth-optimal all-to-all exchanges in fat tree networks. In Proceedings of the
27th international ACM conference on International conference on supercomputing,
pages 139–148. ACM, 2013.
[24] Martin Daniel Schatz et al. Distributed tensor computations: formalizing distribu-
tions, redistributions, and algorithm derivation. PhD thesis, 2015.
[25] Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy
Sadayappan. A practical automatic polyhedral parallelizer and locality optimizer. In
Acm Sigplan Notices, volume 43, pages 101–113. ACM, 2008.
[26] Márcia A Inda and Rob H Bisseling. A simple and efficient parallel fft algorithm
using the bsp model. Parallel Computing, 27(14):1847–1878, 2001.
[27] Riken aics. http://www.aics.riken.jp/en/.
[28] OpenMP Architecture Review Board. OpenMP application program interface
version 4.0, May 2018.
[29] Marc Snir, Steve W. Otto, Steven Huss-Lederman, David W. Walker, and Jack
Dongarra. MPI: The Complete Reference. The MIT Press, 1996.
[30] Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. Collec-
tive communication: theory, practice, and experience. Concurrency and Computation:
Practice and Experience, 19(13):1749–1783, 2007.
[31] Thom Popovici, Tze-Meng Low, and Franz Franchetti. Large bandwidth-efficient
FFTs on multicore and multi-socket systems. In IEEE International Parallel and
Distributed Processing Symposium (IPDPS). IEEE, 2018.
