862 research outputs found
Spherical harmonic transform with GPUs
We describe an algorithm for computing an inverse spherical harmonic
transform suitable for graphic processing units (GPU). We use CUDA and base our
implementation on a Fortran90 routine included in a publicly available parallel
package, S2HAT. We focus our attention on the two major sequential steps
involved in the transforms computation, retaining the efficient parallel
framework of the original code. We detail optimization techniques used to
enhance the performance of the CUDA-based code and contrast them with those
implemented in the Fortran90 version. We also present performance comparisons
of a single CPU plus GPU unit with the S2HAT code running on either a single or
4 processors. In particular we find that use of the latest generation of GPUs,
such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms
by as much as 18 times with respect to S2HAT executed on one core, and by as
much as 5.5 with respect to S2HAT on 4 cores, with the overall performance
being limited by the Fast Fourier transforms. The work presented here has been
performed in the context of the Cosmic Microwave Background simulations and
analysis. However, we expect that the developed software will be of more
general interest and applicability
Performance Analysis of a Novel GPU Computation-to-core Mapping Scheme for Robust Facet Image Modeling
Though the GPGPU concept is well-known
in image processing, much more work remains to be done
to fully exploit GPUs as an alternative computation
engine. This paper investigates the computation-to-core
mapping strategies to probe the efficiency and scalability
of the robust facet image modeling algorithm on GPUs.
Our fine-grained computation-to-core mapping scheme
shows a significant performance gain over the standard
pixel-wise mapping scheme. With in-depth performance
comparisons across the two different mapping schemes,
we analyze the impact of the level of parallelism on
the GPU computation and suggest two principles for
optimizing future image processing applications on the
GPU platform
Accelerating moderately stiff chemical kinetics in reactive-flow simulations using GPUs
The chemical kinetics ODEs arising from operator-split reactive-flow
simulations were solved on GPUs using explicit integration algorithms. Nonstiff
chemical kinetics of a hydrogen oxidation mechanism (9 species and 38
irreversible reactions) were computed using the explicit fifth-order
Runge-Kutta-Cash-Karp method, and the GPU-accelerated version performed faster
than single- and six-core CPU versions by factors of 126 and 25, respectively,
for 524,288 ODEs. Moderately stiff kinetics, represented with mechanisms for
hydrogen/carbon-monoxide (13 species and 54 irreversible reactions) and methane
(53 species and 634 irreversible reactions) oxidation, were computed using the
stabilized explicit second-order Runge-Kutta-Chebyshev (RKC) algorithm. The
GPU-based RKC implementation demonstrated an increase in performance of nearly
59 and 10 times, for problem sizes consisting of 262,144 ODEs and larger, than
the single- and six-core CPU-based RKC algorithms using the
hydrogen/carbon-monoxide mechanism. With the methane mechanism, RKC-GPU
performed more than 65 and 11 times faster, for problem sizes consisting of
131,072 ODEs and larger, than the single- and six-core RKC-CPU versions, and up
to 57 times faster than the six-core CPU-based implicit VODE algorithm on
65,536 ODEs. In the presence of more severe stiffness, such as ethylene
oxidation (111 species and 1566 irreversible reactions), RKC-GPU performed more
than 17 times faster than RKC-CPU on six cores for 32,768 ODEs and larger, and
at best 4.5 times faster than VODE on six CPU cores for 65,536 ODEs. With a
larger time step size, RKC-GPU performed at best 2.5 times slower than six-core
VODE for 8192 ODEs and larger. Therefore, the need for developing new
strategies for integrating stiff chemistry on GPUs was discussed.Comment: 27 pages, LaTeX; corrected typos in Appendix equations A.10 and A.1
On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters
The predominance of Kohn-Sham density functional theory (KS-DFT) for the
theoretical treatment of large experimentally relevant systems in molecular
chemistry and materials science relies primarily on the existence of efficient
software implementations which are capable of leveraging the latest advances in
modern high performance computing (HPC). With recent trends in HPC leading
towards in increasing reliance on heterogeneous accelerator based architectures
such as graphics processing units (GPU), existing code bases must embrace these
architectural advances to maintain the high-levels of performance which have
come to be expected for these methods. In this work, we purpose a three-level
parallelism scheme for the distributed numerical integration of the
exchange-correlation (XC) potential in the Gaussian basis set discretization of
the Kohn-Sham equations on large computing clusters consisting of multiple GPUs
per compute node. In addition, we purpose and demonstrate the efficacy of the
use of batched kernels, including batched level-3 BLAS operations, in achieving
high-levels of performance on the GPU. We demonstrate the performance and
scalability of the implementation of the purposed method in the NWChemEx
software package by comparing to the existing scalable CPU XC integration in
NWChem.Comment: 26 pages, 9 figure
Three-dimensional shapelets and an automated classification scheme for dark matter haloes
We extend the two-dimensional Cartesian shapelet formalism to d-dimensions.
Concentrating on the three-dimensional case, we derive shapelet-based equations
for the mass, centroid, root-mean-square radius, and components of the
quadrupole moment and moment of inertia tensors. Using cosmological N-body
simulations as an application domain, we show that three-dimensional shapelets
can be used to replicate the complex sub-structure of dark matter halos and
demonstrate the basis of an automated classification scheme for halo shapes. We
investigate the shapelet decomposition process from an algorithmic viewpoint,
and consider opportunities for accelerating the computation of shapelet-based
representations using graphics processing units (GPUs).Comment: 19 pages, 11 figures, accepted for publication in MNRA
Distributed Memory, GPU Accelerated Fock Construction for Hybrid, Gaussian Basis Density Functional Theory
With the growing reliance of modern supercomputers on accelerator-based
architectures such a GPUs, the development and optimization of electronic
structure methods to exploit these massively parallel resources has become a
recent priority. While significant strides have been made in the development of
GPU accelerated, distributed memory algorithms for many-body (e.g.
coupled-cluster) and spectral single-body (e.g. planewave, real-space and
finite-element density functional theory [DFT]), the vast majority of
GPU-accelerated Gaussian atomic orbital methods have focused on shared memory
systems with only a handful of examples pursuing massive parallelism on
distributed memory GPU architectures. In the present work, we present a set of
distributed memory algorithms for the evaluation of the Coulomb and
exact-exchange matrices for hybrid Kohn-Sham DFT with Gaussian basis sets via
direct density-fitted (DF-J-Engine) and seminumerical (sn-K) methods,
respectively. The absolute performance and strong scalability of the developed
methods are demonstrated on systems ranging from a few hundred to over one
thousand atoms using up to 128 NVIDIA A100 GPUs on the Perlmutter
supercomputer.Comment: 45 pages, 9 figure
QTM: computational package using MPI protocol for quantum trajectories method
The Quantum Trajectories Method (QTM) is one of {the} frequently used methods
for studying open quantum systems. { The main idea of this method is {the}
evolution of wave functions which {describe the system (as functions of time).
Then,} so-called quantum jumps are applied at {a} randomly selected point in
time. {The} obtained system state is called as a trajectory. After averaging
many single trajectories{,} we obtain the approximation of the behavior of {a}
quantum system.} {This fact also allows} us to use parallel computation
methods. In the article{,} we discuss the QTM package which is supported by the
MPI technology. Using MPI allowed {utilizing} the parallel computing for
calculating the trajectories and averaging them -- as the effect of these
actions{,} the time {taken by} calculations is shorter. In spite of using the
C++ programming language, the presented solution is easy to utilize and does
not need any advanced programming techniques. At the same time{,} it offers a
higher performance than other packages realizing the QTM. It is especially
important in the case of harder computational tasks{,} and the use of MPI
allows {improving the} performance of particular problems which can be solved
in the field of open quantum systems.Comment: 28 pages, 9 figure
Parallelization of dynamic programming recurrences in computational biology
The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms
- …