272 research outputs found
Fast Multipole Method as a Matrix-Free Hierarchical Low-Rank Approximation
There has been a large increase in the amount of work on hierarchical
low-rank approximation methods, where the interest is shared by multiple
communities that previously did not intersect. This objective of this article
is two-fold; to provide a thorough review of the recent advancements in this
field from both analytical and algebraic perspectives, and to present a
comparative benchmark of two highly optimized implementations of contrasting
methods for some simple yet representative test cases. We categorize the recent
advances in this field from the perspective of compute-memory tradeoff, which
has not been considered in much detail in this area. Benchmark tests reveal
that there is a large difference in the memory consumption and performance
between the different methods.Comment: 19 pages, 6 figure
FFT, FMM, or Multigrid? A comparative Study of State-Of-the-Art Poisson Solvers for Uniform and Nonuniform Grids in the Unit Cube
In this work, we benchmark and discuss the performance of the scalable
methods for the Poisson problem which are used widely in practice: the fast
Fourier transform (FFT), the fast multipole method (FMM), the geometric
multigrid (GMG), and algebraic multigrid (AMG). In total we compare five
different codes, three of which are developed in our group. Our FFT, GMG, and
FMM are parallel solvers that use high-order approximation schemes for Poisson
problems with continuous forcing functions (the source or right-hand side). We
examine and report results for weak scaling, strong scaling, and time to
solution for uniform and highly refined grids. We present results on the
Stampede system at the Texas Advanced Computing Center and on the Titan system
at the Oak Ridge National Laboratory. In our largest test case, we solved a
problem with 600 billion unknowns on 229,379 cores of Titan. Overall, all
methods scale quite well to these problem sizes. We have tested all of the
methods with different source functions (the right-hand side in the Poisson
problem). Our results indicate that FFT is the method of choice for smooth
source functions that require uniform resolution. However, FFT loses its
performance advantage when the source function has highly localized features
like internal sharp layers. FMM and GMG considerably outperform FFT for those
cases. The distinction between FMM and GMG is less pronounced and is sensitive
to the quality (from a performance point of view) of the underlying
implementations. The high-order accurate versions of GMG and FMM significantly
outperform their low-order accurate counterparts.Comment: 25 pages; accepted paper in SISC journa
Fast Multipole Preconditioners for Sparse Matrices Arising from Elliptic Equations
Among optimal hierarchical algorithms for the computational solution of
elliptic problems, the Fast Multipole Method (FMM) stands out for its
adaptability to emerging architectures, having high arithmetic intensity,
tunable accuracy, and relaxable global synchronization requirements. We
demonstrate that, beyond its traditional use as a solver in problems for which
explicit free-space kernel representations are available, the FMM has
applicability as a preconditioner in finite domain elliptic boundary value
problems, by equipping it with boundary integral capability for satisfying
conditions at finite boundaries and by wrapping it in a Krylov method for
extensibility to more general operators. Here, we do not discuss the well
developed applications of FMM to implement matrix-vector multiplications within
Krylov solvers of boundary element methods. Instead, we propose using FMM for
the volume-to-volume contribution of inhomogeneous Poisson-like problems, where
the boundary integral is a small part of the overall computation. Our method
may be used to precondition sparse matrices arising from finite
difference/element discretizations, and can handle a broader range of
scientific applications. Compared with multigrid methods, it is capable of
comparable algebraic convergence rates down to the truncation error of the
discretized PDE, and it offers potentially superior multicore and distributed
memory scalability properties on commodity architecture supercomputers.
Compared with other methods exploiting the low rank character of off-diagonal
blocks of the dense resolvent operator, FMM-preconditioned Krylov iteration may
reduce the amount of communication because it is matrix-free and exploits the
tree structure of FMM. We describe our tests in reproducible detail with freely
available codes and outline directions for further extensibility.Comment: 17 pages, 9 figure
A Finite Element Based P3M Method for N-body Problems
We introduce a fast mesh-based method for computing N-body interactions that
is both scalable and accurate. The method is founded on a
particle-particle--particle-mesh P3M approach, which decomposes a potential
into rapidly decaying short-range interactions and smooth, mesh-resolvable
long-range interactions. However, in contrast to the traditional approach of
using Gaussian screen functions to accomplish this decomposition, our method
employs specially designed polynomial bases to construct the screened
potentials. Because of this form of the screen, the long-range component of the
potential is then solved exactly with a finite element method, leading
ultimately to a sparse matrix problem that is solved efficiently with standard
multigrid methods. Moreover, since this system represents an exact
discretization, the optimal resolution properties of the FFT are unnecessary,
though the short-range calculation is now more involved than P3M/PME methods.
We introduce the method, analyze its key properties, and demonstrate the
accuracy of the algorithm.Comment: 20 pages, submitted to SIS
Learning with Analytical Models
To understand and predict the performance of scientific applications, several
analytical and machine learning approaches have been proposed, each having its
advantages and disadvantages. In this paper, we propose and validate a hybrid
approach for performance modeling and prediction, which combines analytical and
machine learning models. The proposed hybrid model aims to minimize prediction
cost while providing reasonable prediction accuracy. Our validation results
show that the hybrid model is able to learn and correct the analytical models
to better match the actual performance. Furthermore, the proposed hybrid model
improves the prediction accuracy in comparison to pure machine learning
techniques while using small training datasets, thus making it suitable for
hardware and workload changes
Optimal, scalable forward models for computing gravity anomalies
We describe three approaches for computing a gravity signal from a density
anomaly. The first approach consists of the classical "summation" technique,
whilst the remaining two methods solve the Poisson problem for the
gravitational potential using either a Finite Element (FE) discretization
employing a multilevel preconditioner, or a Green's function evaluated with the
Fast Multipole Method (FMM). The methods utilizing the PDE formulation
described here differ from previously published approaches used in gravity
modeling in that they are optimal, implying that both the memory and
computational time required scale linearly with respect to the number of
unknowns in the potential field. Additionally, all of the implementations
presented here are developed such that the computations can be performed in a
massively parallel, distributed memory computing environment. Through numerical
experiments, we compare the methods on the basis of their discretization error,
CPU time and parallel scalability. We demonstrate the parallel scalability of
all these techniques by running forward models with up to voxels on
1000's of cores.Comment: 38 pages, 13 figures; accepted by Geophysical Journal Internationa
Flexibly imposing periodicity in kernel independent FMM: A Multipole-To-Local operator approach
An important but missing component in the application of the kernel
independent fast multipole method (KIFMM) is the capability for flexibly and
efficiently imposing singly, doubly, and triply periodic boundary conditions.
In most popular packages such periodicities are imposed with the hierarchical
repetition of periodic boxes, which may give an incorrect answer due to the
conditional convergence of some kernel sums. Here we present an efficient
method to properly impose periodic boundary conditions using a near-far
splitting scheme. The near-field contribution is directly calculated with the
KIFMM method, while the far-field contribution is calculated with a
multipole-to-local (M2L) operator which is independent of the source and target
point distribution. The M2L operator is constructed with the far-field portion
of the kernel function to generate the far-field contribution with the downward
equivalent source points in KIFMM. This method guarantees the sum of the
near-field \& far-field converge pointwise to results satisfying periodicity
and compatibility conditions. The computational cost of the far-field
calculation observes the same complexity as FMM and is
designed to be small by reusing the data computed by KIFMM for the near-field.
The far-field calculations require no additional control parameters, and
observes the same theoretical error bound as KIFMM. We present accuracy and
timing test results for the Laplace kernel in singly periodic domains and the
Stokes velocity kernel in doubly and triply periodic domains
BAGEL: Brilliantly Advanced General Electronic-structure Library
On behalf of the development team, I review the capabilities of the BAGEL
program package in this article. BAGEL is a newly-developed full-fledged
program package for electronic-structure computation in quantum chemistry,
which is released under the GNU General Public License with many contributions
from the developers. The unique features include analytical CASPT2 nuclear
energy gradients and derivative couplings, relativistic multireference wave
functions based on the Dirac equation, and implementations of novel electronic
structure theories. All of the programs are efficiently parallelized using both
threads and MPI processes. We also discuss the code generator SMITH3, which has
been used to implement some of the programs in BAGEL. The developers'
contributions are listed at the end of the main text.Comment: Software Focus article, WIREs: Computational Molecular Scienc
Hierarchical Matrix Operations on GPUs: Matrix-Vector Multiplication and Compression
Hierarchical matrices are space and time efficient representations of dense
matrices that exploit the low rank structure of matrix blocks at different
levels of granularity. The hierarchically low rank block partitioning produces
representations that can be stored and operated on in near-linear complexity
instead of the usual polynomial complexity of dense matrices. In this paper, we
present high performance implementations of matrix vector multiplication and
compression operations for the variant of hierarchical matrices
on GPUs. This variant exploits, in addition to the hierarchical block
partitioning, hierarchical bases for the block representations and results in a
scheme that requires only storage and complexity for the mat-vec
and compression kernels. These two operations are at the core of algebraic
operations for hierarchical matrices, the mat-vec being a ubiquitous operation
in numerical algorithms while compression/recompression represents a key
building block for other algebraic operations, which require periodic
recompression during execution. The difficulties in developing efficient GPU
algorithms come primarily from the irregular tree data structures that underlie
the hierarchical representations, and the key to performance is to recast the
computations on flattened trees in ways that allow batched linear algebra
operations to be performed. This requires marshaling the irregularly laid out
data in a way that allows them to be used by the batched routines. Marshaling
operations only involve pointer arithmetic with no data movement and as a
result have minimal overhead. Our numerical results on covariance matrices from
2D and 3D problems from spatial statistics show the high efficiency our
routines achieve---over 550GB/s for the bandwidth-limited mat-vec and over
850GFLOPS/s in sustained performance for the compression on the P100 Pascal
GPU
A Study of Three Dimensional Edge and Corner Problems using the neBEM Solver
The previously reported neBEM solver has been used to solve electrostatic
problems having three-dimensional edges and corners in the physical domain.
Both rectangular and triangular elements have been used to discretize the
geometries under study. In order to maintain very high level of precision, a
library of C functions yielding exact values of potential and flux influences
due to uniform surface distribution of singularities on flat triangular and
rectangular elements has been developed and used. Here we present the exact
expressions proposed for computing the influence of uniform singularity
distributions on triangular elements and illustrate their accuracy. We then
consider several problems of electrostatics containing edges and singularities
of various orders including plates and cubes, and L-shaped conductors. We have
tried to show that using the approach proposed in the earlier paper on neBEM
and its present enhanced (through the inclusion of triangular elements) form,
it is possible to obtain accurate estimates of integral features such as the
capacitance of a given conductor and detailed ones such as the charge density
distribution at the edges / corners without taking resort to any new or special
formulation. Results obtained using neBEM have been compared extensively with
both existing analytical and numerical results. The comparisons illustrate the
accuracy, flexibility and robustness of the new approach quite comprehensively.Comment: Submitted to Elsevie
- …