4,302 research outputs found
Optimizing the adaptive fast multipole method for fractal sets
We have performed a detailed analysis of the fast multipole method (FMM) in
the adaptive case, in which the depth of the FMM tree is non-uniform. Previous
works in this area have focused mostly on special types of adaptive
distributions, for example when points accumulate on a 2D manifold or
accumulate around a few points in space. Instead, we considered a more general
situation in which fractal sets, e.g., Cantor sets and generalizations, are
used to create adaptive sets of points. Such sets are characterized by their
dimension, a number between 0 and 3. We introduced a mathematical framework to
define a converging sequence of octrees, and based on that, demonstrated how to
increase .
A new complexity analysis for the adaptive FMM is introduced. It is shown
that the complexity is achievable for any distribution of
particles, when a modified adaptive FMM is exploited. We analyzed how the FMM
performs for fractal point distributions, and how optimal parameters can be
picked, e.g., the criterion used to stop the subdivision of an FMM cell. A new
subdividing double-threshold method is introduced, and better performance
demonstrated. Parameters in the FMM are modeled as a function of particle
distribution dimension, and the optimal values are obtained. A three
dimensional kernel independent black box adaptive FMM is implemented and used
for all calculations
On well-separated sets and fast multipole methods
The notion of well-separated sets is crucial in fast multipole methods as the
main idea is to approximate the interaction between such sets via cluster
expansions. We revisit the one-parameter multipole acceptance criterion in a
general setting and derive a relative error estimate. This analysis benefits
asymmetric versions of the method, where the division of the multipole boxes is
more liberal than in conventional codes. Such variants offer a particularly
elegant implementation with a balanced multipole tree, a feature which might be
very favorable on modern computer architectures
Data-Driven Execution of Fast Multipole Methods
Fast multipole methods have O(N) complexity, are compute bound, and require
very little synchronization, which makes them a favorable algorithm on
next-generation supercomputers. Their most common application is to accelerate
N-body problems, but they can also be used to solve boundary integral
equations. When the particle distribution is irregular and the tree structure
is adaptive, load-balancing becomes a non-trivial question. A common strategy
for load-balancing FMMs is to use the work load from the previous step as
weights to statically repartition the next step. The authors discuss in the
paper another approach based on data-driven execution to efficiently tackle
this challenging load-balancing problem. The core idea consists of breaking the
most time-consuming stages of the FMMs into smaller tasks. The algorithm can
then be represented as a Directed Acyclic Graph (DAG) where nodes represent
tasks, and edges represent dependencies among them. The execution of the
algorithm is performed by asynchronously scheduling the tasks using the QUARK
runtime environment, in a way such that data dependencies are not violated for
numerical correctness purposes. This asynchronous scheduling results in an
out-of-order execution. The performance results of the data-driven FMM
execution outperform the previous strategy and show linear speedup on a
quad-socket quad-core Intel Xeon system
An FMM Based on Dual Tree Traversal for Many-core Architectures
The present work attempts to integrate the independent efforts in the fast
N-body community to create the fastest N-body library for many-core and
heterogenous architectures. Focus is placed on low accuracy optimizations, in
response to the recent interest to use FMM as a preconditioner for sparse
linear solvers. A direct comparison with other state-of-the-art fast N-body
codes demonstrates that orders of magnitude increase in performance can be
achieved by careful selection of the optimal algorithm and low-level
optimization of the code. The current N-body solver uses a fast multipole
method with an efficient strategy for finding the list of cell-cell
interactions by a dual tree traversal. A task-based threading model is used to
maximize thread-level parallelism and intra-node load-balancing. In order to
extract the full potential of the SIMD units on the latest CPUs, the inner
kernels are optimized using AVX instructions. Our code -- exaFMM -- is an order
of magnitude faster than the current state-of-the-art FMM codes, which are
themselves an order of magnitude faster than the average FMM code
FFT, FMM, or Multigrid? A comparative Study of State-Of-the-Art Poisson Solvers for Uniform and Nonuniform Grids in the Unit Cube
In this work, we benchmark and discuss the performance of the scalable
methods for the Poisson problem which are used widely in practice: the fast
Fourier transform (FFT), the fast multipole method (FMM), the geometric
multigrid (GMG), and algebraic multigrid (AMG). In total we compare five
different codes, three of which are developed in our group. Our FFT, GMG, and
FMM are parallel solvers that use high-order approximation schemes for Poisson
problems with continuous forcing functions (the source or right-hand side). We
examine and report results for weak scaling, strong scaling, and time to
solution for uniform and highly refined grids. We present results on the
Stampede system at the Texas Advanced Computing Center and on the Titan system
at the Oak Ridge National Laboratory. In our largest test case, we solved a
problem with 600 billion unknowns on 229,379 cores of Titan. Overall, all
methods scale quite well to these problem sizes. We have tested all of the
methods with different source functions (the right-hand side in the Poisson
problem). Our results indicate that FFT is the method of choice for smooth
source functions that require uniform resolution. However, FFT loses its
performance advantage when the source function has highly localized features
like internal sharp layers. FMM and GMG considerably outperform FFT for those
cases. The distinction between FMM and GMG is less pronounced and is sensitive
to the quality (from a performance point of view) of the underlying
implementations. The high-order accurate versions of GMG and FMM significantly
outperform their low-order accurate counterparts.Comment: 25 pages; accepted paper in SISC journa
Dynamic autotuning of adaptive fast multipole methods on hybrid multicore CPU & GPU systems
We discuss an implementation of adaptive fast multipole methods targeting
hybrid multicore CPU- and GPU-systems. From previous experiences with the
computational profile of our version of the fast multipole algorithm, suitable
parts are off-loaded to the GPU, while the remaining parts are threaded and
executed concurrently by the CPU. The parameters defining the algorithm affects
the performance and by measuring this effect we are able to dynamically balance
the algorithm towards optimal performance. Our setup uses the dynamic nature of
the computations and is therefore of general character
Fast Multipole Method as a Matrix-Free Hierarchical Low-Rank Approximation
There has been a large increase in the amount of work on hierarchical
low-rank approximation methods, where the interest is shared by multiple
communities that previously did not intersect. This objective of this article
is two-fold; to provide a thorough review of the recent advancements in this
field from both analytical and algebraic perspectives, and to present a
comparative benchmark of two highly optimized implementations of contrasting
methods for some simple yet representative test cases. We categorize the recent
advances in this field from the perspective of compute-memory tradeoff, which
has not been considered in much detail in this area. Benchmark tests reveal
that there is a large difference in the memory consumption and performance
between the different methods.Comment: 19 pages, 6 figure
Fast Multipole Preconditioners for Sparse Matrices Arising from Elliptic Equations
Among optimal hierarchical algorithms for the computational solution of
elliptic problems, the Fast Multipole Method (FMM) stands out for its
adaptability to emerging architectures, having high arithmetic intensity,
tunable accuracy, and relaxable global synchronization requirements. We
demonstrate that, beyond its traditional use as a solver in problems for which
explicit free-space kernel representations are available, the FMM has
applicability as a preconditioner in finite domain elliptic boundary value
problems, by equipping it with boundary integral capability for satisfying
conditions at finite boundaries and by wrapping it in a Krylov method for
extensibility to more general operators. Here, we do not discuss the well
developed applications of FMM to implement matrix-vector multiplications within
Krylov solvers of boundary element methods. Instead, we propose using FMM for
the volume-to-volume contribution of inhomogeneous Poisson-like problems, where
the boundary integral is a small part of the overall computation. Our method
may be used to precondition sparse matrices arising from finite
difference/element discretizations, and can handle a broader range of
scientific applications. Compared with multigrid methods, it is capable of
comparable algebraic convergence rates down to the truncation error of the
discretized PDE, and it offers potentially superior multicore and distributed
memory scalability properties on commodity architecture supercomputers.
Compared with other methods exploiting the low rank character of off-diagonal
blocks of the dense resolvent operator, FMM-preconditioned Krylov iteration may
reduce the amount of communication because it is matrix-free and exploits the
tree structure of FMM. We describe our tests in reproducible detail with freely
available codes and outline directions for further extensibility.Comment: 17 pages, 9 figure
A parallel directional Fast Multipole Method
This paper introduces a parallel directional fast multipole method (FMM) for
solving N-body problems with highly oscillatory kernels, with a focus on the
Helmholtz kernel in three dimensions. This class of oscillatory kernels
requires a more restrictive low-rank criterion than that of the low-frequency
regime, and thus effective parallelizations must adapt to the modified data
dependencies. We propose a simple partition at a fixed level of the octree and
show that, if the partitions are properly balanced between p processes, the
overall runtime is essentially O(N log N/p+ p). By the structure of the
low-rank criterion, we are able to avoid communication at the top of the
octree. We demonstrate the effectiveness of our parallelization on several
challenging models
DASHMM Accelerated Adaptive Fast Multipole Poisson-Boltzmann Solver on Distributed Memory Architecture
We present an updated version of the AFMPB package for fast calculation of
molecular solvation-free energy. The main feature of the new version is the
successful adoption of the DASHMM library, which enables AFMPB to operate on
distributed memory computers. As a result, the new version can easily handle
larger molecules or situations with higher accuracy requirements. To
demonstrate the updated code, we applied the new version to a dengue virus
system with more than one million atoms and a mesh with approximately 20
million triangles, and were able to reduce the time-to-solution from 10 hours
reported in the previous release on a shared memory computer to less than 30
seconds on a Cray XC30 cluster using 12, 288 cores
- …