42 research outputs found

    Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

    Get PDF
    Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8×{\times }–4.3×{\times } speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9×{\times } and 48.2×{\times } speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively

    A greedy algorithm for computing eigenvalues of a symmetric matrix

    Full text link
    We present a greedy algorithm for computing selected eigenpairs of a large sparse matrix HH that can exploit localization features of the eigenvector. When the eigenvector to be computed is localized, meaning only a small number of its components have large magnitudes, the proposed algorithm identifies the location of these components in a greedy manner, and obtains approximations to the desired eigenpairs of HH by computing eigenpairs of a submatrix extracted from the corresponding rows and columns of HH. Even when the eigenvector is not completely localized, the approximate eigenvectors obtained by the greedy algorithm can be used as good starting guesses to accelerate the convergence of an iterative eigensolver applied to HH. We discuss a few possibilities for selecting important rows and columns of HH and techniques for constructing good initial guesses for an iterative eigensolver using the approximate eigenvectors returned from the greedy algorithm. We demonstrate the effectiveness of this approach with examples from nuclear quantum many-body calculations, many-body localization studies of quantum spin chains and road network analysis.Comment: 19 pages, 9 figures, 1 tabl

    Design Principles for Sparse Matrix Multiplication on the GPU

    Full text link
    We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients---(i) merge-based load-balancing and (ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and a 31.7% geomean speedup over state-of-the-art SpMM implementations on real-world datasets.Comment: 16 pages, 7 figures, International European Conference on Parallel and Distributed Computing (Euro-Par) 201

    ChASE: Chebyshev Accelerated Subspace iteration Eigensolver for sequences of Hermitian eigenvalue problems

    Full text link
    Solving dense Hermitian eigenproblems arranged in a sequence with direct solvers fails to take advantage of those spectral properties which are pertinent to the entire sequence, and not just to the single problem. When such features take the form of correlations between the eigenvectors of consecutive problems, as is the case in many real-world applications, the potential benefit of exploiting them can be substantial. We present ChASE, a modern algorithm and library based on subspace iteration with polynomial acceleration. Novel to ChASE is the computation of the spectral estimates that enter in the filter and an optimization of the polynomial degree which further reduces the necessary FLOPs. ChASE is written in C++ using the modern software engineering concepts which favor a simple integration in application codes and a straightforward portability over heterogeneous platforms. When solving sequences of Hermitian eigenproblems for a portion of their extremal spectrum, ChASE greatly benefits from the sequence's spectral properties and outperforms direct solvers in many scenarios. The library ships with two distinct parallelization schemes, supports execution over distributed GPUs, and it is easily extensible to other parallel computing architectures.Comment: 33 pages. Submitted to ACM TOM

    Split representation of adaptively compressed polarizability operator

    Full text link
    The polarizability operator plays a central role in density functional perturbation theory and other perturbative treatment of first principle electronic structure theories. The cost of computing the polarizability operator generally scales as O(Ne4)\mathcal{O}(N_{e}^4) where NeN_e is the number of electrons in the system. The recently developed adaptively compressed polarizability operator (ACP) formulation [L. Lin, Z. Xu and L. Ying, Multiscale Model. Simul. 2017] reduces such complexity to O(Ne3)\mathcal{O}(N_{e}^3) in the context of phonon calculations with a large basis set for the first time, and demonstrates its effectiveness for model problems. In this paper, we improve the performance of the ACP formulation by splitting the polarizability into a near singular component that is statically compressed, and a smooth component that is adaptively compressed. The new split representation maintains the O(Ne3)\mathcal{O}(N_e^3) complexity, and accelerates nearly all components of the ACP formulation, including Chebyshev interpolation of energy levels, iterative solution of Sternheimer equations, and convergence of the Dyson equations. For simulation of real materials, we discuss how to incorporate nonlocal pseudopotentials and finite temperature effects. We demonstrate the effectiveness of our method using one-dimensional model problem in insulating and metallic regimes, as well as its accuracy for real molecules and solids.Comment: 32 pages, 8 figures. arXiv admin note: text overlap with arXiv:1605.0802
    corecore