3 research outputs found
P3DFFT: a framework for parallel computations of Fourier transforms in three dimensions
Fourier and related transforms is a family of algorithms widely employed in
diverse areas of computational science, notoriously difficult to scale on
high-performance parallel computers with large number of processing elements
(cores). This paper introduces a popular software package called P3DFFT
implementing Fast Fourier Transforms (FFT) in three dimensions (3D) in a highly
efficient and scalable way. It overcomes a well-known scalability bottleneck of
3D FFT implementations by using two-dimensional domain decomposition. Designed
for portable performance, P3DFFT achieves excellent timings for a number of
systems and problem sizes. On Cray XT5 system P3DFFT attains 45% efficiency in
weak scaling from 128 to 65,536 computational cores. Library features include
Fourier and Chebyshev transforms, Fortran and C interfaces, in- and
out-of-place transforms, uneven data grids, single and double precision. P3DFFT
is available as open source at http://code.google.com/p/p3dfft/. This paper
discusses P3DFFT implementation and performance in a way that helps guide the
user in making optimal choices for parameters of their runs
A Flexible Framework for Parallel Multi-Dimensional DFTs
Multi-dimensional discrete Fourier transforms (DFT) are typically decomposed
into multiple 1D transforms. Hence, parallel implementations of any
multi-dimensional DFT focus on parallelizing within or across the 1D DFT.
Existing DFT packages exploit the inherent parallelism across the 1D DFTs and
offer rigid frameworks, that cannot be extended to incorporate both forms of
parallelism and various data layouts to enable some of the parallelism.
However, in the era of exascale, where systems have thousand of nodes and
intricate network topologies, flexibility and parallel efficiency are key
aspects all multi-dimensional DFT frameworks need to have in order to map and
scale the computation appropriately. In this work, we present a flexible
framework, built on the Redistribution Operations and Tensor Expressions (ROTE)
framework, that facilitates the development of a family of parallel
multi-dimensional DFT algorithms by 1) unifying the two parallelization schemes
within a single framework, 2) exploiting the two different parallelization
schemes to different degrees and 3) using different data layouts to distribute
the data across the compute nodes. We demonstrate the need of a versatile
framework and thus a need for a family of parallel multi-dimensional DFT
algorithms on the K-Computer, where we show almost linear strong scaling
results for problem sizes of 1024^3 on 32k compute nodes
FFT, FMM, or Multigrid? A comparative Study of State-Of-the-Art Poisson Solvers for Uniform and Nonuniform Grids in the Unit Cube
In this work, we benchmark and discuss the performance of the scalable
methods for the Poisson problem which are used widely in practice: the fast
Fourier transform (FFT), the fast multipole method (FMM), the geometric
multigrid (GMG), and algebraic multigrid (AMG). In total we compare five
different codes, three of which are developed in our group. Our FFT, GMG, and
FMM are parallel solvers that use high-order approximation schemes for Poisson
problems with continuous forcing functions (the source or right-hand side). We
examine and report results for weak scaling, strong scaling, and time to
solution for uniform and highly refined grids. We present results on the
Stampede system at the Texas Advanced Computing Center and on the Titan system
at the Oak Ridge National Laboratory. In our largest test case, we solved a
problem with 600 billion unknowns on 229,379 cores of Titan. Overall, all
methods scale quite well to these problem sizes. We have tested all of the
methods with different source functions (the right-hand side in the Poisson
problem). Our results indicate that FFT is the method of choice for smooth
source functions that require uniform resolution. However, FFT loses its
performance advantage when the source function has highly localized features
like internal sharp layers. FMM and GMG considerably outperform FFT for those
cases. The distinction between FMM and GMG is less pronounced and is sensitive
to the quality (from a performance point of view) of the underlying
implementations. The high-order accurate versions of GMG and FMM significantly
outperform their low-order accurate counterparts.Comment: 25 pages; accepted paper in SISC journa