42 research outputs found
Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices
Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8–4.3 speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9 and 48.2 speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively
A greedy algorithm for computing eigenvalues of a symmetric matrix
We present a greedy algorithm for computing selected eigenpairs of a large
sparse matrix that can exploit localization features of the eigenvector.
When the eigenvector to be computed is localized, meaning only a small number
of its components have large magnitudes, the proposed algorithm identifies the
location of these components in a greedy manner, and obtains approximations to
the desired eigenpairs of by computing eigenpairs of a submatrix extracted
from the corresponding rows and columns of . Even when the eigenvector is
not completely localized, the approximate eigenvectors obtained by the greedy
algorithm can be used as good starting guesses to accelerate the convergence of
an iterative eigensolver applied to . We discuss a few possibilities for
selecting important rows and columns of and techniques for constructing
good initial guesses for an iterative eigensolver using the approximate
eigenvectors returned from the greedy algorithm. We demonstrate the
effectiveness of this approach with examples from nuclear quantum many-body
calculations, many-body localization studies of quantum spin chains and road
network analysis.Comment: 19 pages, 9 figures, 1 tabl
Design Principles for Sparse Matrix Multiplication on the GPU
We implement two novel algorithms for sparse-matrix dense-matrix
multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the
popular compressed-sparse-row (CSR) format and thus do not require expensive
format conversion. While previous SpMM work concentrates on thread-level
parallelism, we additionally focus on latency hiding with instruction-level
parallelism and load-balancing. We show, both theoretically and experimentally,
that the proposed SpMM is a better fit for the GPU than previous approaches. We
identify a key memory access pattern that allows efficient access into both
input and output matrices that is crucial to getting excellent performance on
SpMM. By combining these two ingredients---(i) merge-based load-balancing and
(ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and
a 31.7% geomean speedup over state-of-the-art SpMM implementations on
real-world datasets.Comment: 16 pages, 7 figures, International European Conference on Parallel
and Distributed Computing (Euro-Par) 201
ChASE: Chebyshev Accelerated Subspace iteration Eigensolver for sequences of Hermitian eigenvalue problems
Solving dense Hermitian eigenproblems arranged in a sequence with direct
solvers fails to take advantage of those spectral properties which are
pertinent to the entire sequence, and not just to the single problem. When such
features take the form of correlations between the eigenvectors of consecutive
problems, as is the case in many real-world applications, the potential benefit
of exploiting them can be substantial. We present ChASE, a modern algorithm and
library based on subspace iteration with polynomial acceleration. Novel to
ChASE is the computation of the spectral estimates that enter in the filter and
an optimization of the polynomial degree which further reduces the necessary
FLOPs. ChASE is written in C++ using the modern software engineering concepts
which favor a simple integration in application codes and a straightforward
portability over heterogeneous platforms. When solving sequences of Hermitian
eigenproblems for a portion of their extremal spectrum, ChASE greatly benefits
from the sequence's spectral properties and outperforms direct solvers in many
scenarios. The library ships with two distinct parallelization schemes,
supports execution over distributed GPUs, and it is easily extensible to other
parallel computing architectures.Comment: 33 pages. Submitted to ACM TOM
Split representation of adaptively compressed polarizability operator
The polarizability operator plays a central role in density functional
perturbation theory and other perturbative treatment of first principle
electronic structure theories. The cost of computing the polarizability
operator generally scales as where is the number
of electrons in the system. The recently developed adaptively compressed
polarizability operator (ACP) formulation [L. Lin, Z. Xu and L. Ying,
Multiscale Model. Simul. 2017] reduces such complexity to
in the context of phonon calculations with a large basis
set for the first time, and demonstrates its effectiveness for model problems.
In this paper, we improve the performance of the ACP formulation by splitting
the polarizability into a near singular component that is statically
compressed, and a smooth component that is adaptively compressed. The new split
representation maintains the complexity, and accelerates
nearly all components of the ACP formulation, including Chebyshev interpolation
of energy levels, iterative solution of Sternheimer equations, and convergence
of the Dyson equations. For simulation of real materials, we discuss how to
incorporate nonlocal pseudopotentials and finite temperature effects. We
demonstrate the effectiveness of our method using one-dimensional model problem
in insulating and metallic regimes, as well as its accuracy for real molecules
and solids.Comment: 32 pages, 8 figures. arXiv admin note: text overlap with
arXiv:1605.0802