8 research outputs found
Design Principles for Sparse Matrix Multiplication on the GPU
We implement two novel algorithms for sparse-matrix dense-matrix
multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the
popular compressed-sparse-row (CSR) format and thus do not require expensive
format conversion. While previous SpMM work concentrates on thread-level
parallelism, we additionally focus on latency hiding with instruction-level
parallelism and load-balancing. We show, both theoretically and experimentally,
that the proposed SpMM is a better fit for the GPU than previous approaches. We
identify a key memory access pattern that allows efficient access into both
input and output matrices that is crucial to getting excellent performance on
SpMM. By combining these two ingredients---(i) merge-based load-balancing and
(ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and
a 31.7% geomean speedup over state-of-the-art SpMM implementations on
real-world datasets.Comment: 16 pages, 7 figures, International European Conference on Parallel
and Distributed Computing (Euro-Par) 201
Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices
Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8–4.3 speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9 and 48.2 speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively
Steklov Spectral Geometry for Extrinsic Shape Analysis
We propose using the Dirichlet-to-Neumann operator as an extrinsic
alternative to the Laplacian for spectral geometry processing and shape
analysis. Intrinsic approaches, usually based on the Laplace-Beltrami operator,
cannot capture the spatial embedding of a shape up to rigid motion, and many
previous extrinsic methods lack theoretical justification. Instead, we consider
the Steklov eigenvalue problem, computing the spectrum of the
Dirichlet-to-Neumann operator of a surface bounding a volume. A remarkable
property of this operator is that it completely encodes volumetric geometry. We
use the boundary element method (BEM) to discretize the operator, accelerated
by hierarchical numerical schemes and preconditioning; this pipeline allows us
to solve eigenvalue and linear problems on large-scale meshes despite the
density of the Dirichlet-to-Neumann discretization. We further demonstrate that
our operators naturally fit into existing frameworks for geometry processing,
making a shift from intrinsic to extrinsic geometry as simple as substituting
the Laplace-Beltrami operator with the Dirichlet-to-Neumann operator.Comment: Additional experiments adde
Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures
International audienceTiling is a key technique to reduce data movement in matrix computations. While tiling is well understood and widely used for dense matrix/tensor computations, effective tiling of sparse matrix computations remains a challenging problem. This paper proposes a novel method to efficiently summarize the impact of the sparsity structure of a matrix on achievable data reuse as a one-dimensional signature, which is then used to build an analytical cost model for tile size optimization for sparse matrix computations. The proposed model-driven approach to sparse tiling is evaluated on two key sparse matrix kernels: Sparse Matrix-Dense Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM). Experimental results demonstrate that model-based tiled SpMM and SDDMM achieve high performance relative to the current state-of-the-art
Algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multi-vectors
Traditional solution approaches for problems in quantum mechanics scale as
, where is the number of electrons. Various methods have
been proposed to address this issue and obtain linear scaling .
One promising formulation is the direct minimization of energy. Such methods
take advantage of physical localization of the solution, namely that the
solution can be sought in terms of non-orthogonal orbitals with local support.
In this work a numerically efficient implementation of sparse parallel vectors
within the open-source finite element library deal.II is proposed. The main
algorithmic ingredient is the matrix-free evaluation of the Hamiltonian
operator by cell-wise quadrature. Based on an a-priori chosen support for each
vector we develop algorithms and data structures to perform (i) matrix-free
sparse matrix multivector products (SpMM), (ii) the projection of an operator
onto a sparse sub-space (inner products), and (iii) post-multiplication of a
sparse multivector with a square matrix. The node-level performance is analyzed
using a roofline model. Our matrix-free implementation of finite element
operators with sparse multivectors achieves the performance of 157 GFlop/s on
Intel Cascade Lake architecture. Strong and weak scaling results are reported
for a typical benchmark problem using quadratic and quartic finite element
bases.Comment: 29 pages, 12 figure