Search CORE

8 research outputs found

Design Principles for Sparse Matrix Multiplication on the GPU

Author: A Tiskin
AE Sarıyüce
AV Knyazev
F Vazquez
G Greiner
G Ortega
Ramakrishnan Kannan
S Dalton
S Filippone
TA Davis
V Simoncini
Z Bai
Publication venue
Publication date: 01/01/2018
Field of study

We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion. While previous SpMM work concentrates on thread-level parallelism, we additionally focus on latency hiding with instruction-level parallelism and load-balancing. We show, both theoretically and experimentally, that the proposed SpMM is a better fit for the GPU than previous approaches. We identify a key memory access pattern that allows efficient access into both input and output matrices that is crucial to getting excellent performance on SpMM. By combining these two ingredients---(i) merge-based load-balancing and (ii) row-major coalesced memory access---we demonstrate a 4.1x peak speedup and a 31.7% geomean speedup over state-of-the-art SpMM implementations on real-world datasets.Comment: 16 pages, 7 figures, International European Conference on Parallel and Distributed Computing (Euro-Par) 201

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

Author: A Dziekonski
AV Knyazev
AV Knyazev
C Yang
G Ortega
JW Choi
M Knap
M Shao
P Maris
P Maris
X Yang
Y Wang
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8

{\times }

–4.3

{\times }

speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9

{\times }

and 48.2

{\times }

speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively

Crossref

eScholarship - University of California

Steklov Spectral Geometry for Extrinsic Shape Analysis

Author: Ben-Chen Mirela
Polterovich Iosif
Solomon Justin
Wang Yu
Publication venue
Publication date: 24/04/2018
Field of study

We propose using the Dirichlet-to-Neumann operator as an extrinsic alternative to the Laplacian for spectral geometry processing and shape analysis. Intrinsic approaches, usually based on the Laplace-Beltrami operator, cannot capture the spatial embedding of a shape up to rigid motion, and many previous extrinsic methods lack theoretical justification. Instead, we consider the Steklov eigenvalue problem, computing the spectrum of the Dirichlet-to-Neumann operator of a surface bounding a volume. A remarkable property of this operator is that it completely encodes volumetric geometry. We use the boundary element method (BEM) to discretize the operator, accelerated by hierarchical numerical schemes and preconditioning; this pipeline allows us to solve eigenvalue and linear problems on large-scale meshes despite the density of the Dirichlet-to-Neumann discretization. We further demonstrate that our operators naturally fit into existing frameworks for geometry processing, making a shift from intrinsic to extrinsic geometry as simple as substituting the Laplace-Beltrami operator with the Dirichlet-to-Neumann operator.Comment: Additional experiments adde

arXiv.org e-Print Archive

DSpace@MIT

ZENODO

Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures

Author: Emre Süreyya
Rastello Fabrice
Sadayyapan Ponnuswamy
Sukumaran-Rajam Aravind
Publication venue: HAL CCSD
Publication date: 09/11/2020
Field of study

International audienceTiling is a key technique to reduce data movement in matrix computations. While tiling is well understood and widely used for dense matrix/tensor computations, effective tiling of sparse matrix computations remains a challenging problem. This paper proposes a novel method to efficiently summarize the impact of the sparsity structure of a matrix on achievable data reuse as a one-dimensional signature, which is then used to build an analytical cost model for tile size optimization for sparse matrix computations. The proposed model-driven approach to sparse tiling is evaluated on two key sparse matrix kernels: Sparse Matrix-Dense Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM). Experimental results demonstrate that model-based tiled SpMM and SDDMM achieve high performance relative to the current state-of-the-art

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1

Algorithms and data structures for matrix-free finite element operators with MPI-parallel sparse multi-vectors

Author: Davydov Denis
Kronbichler Martin
Publication venue
Publication date: 01/07/2019
Field of study

Traditional solution approaches for problems in quantum mechanics scale as

\mathcal O(M^3)

, where

M

is the number of electrons. Various methods have been proposed to address this issue and obtain linear scaling

\mathcal O(M)

. One promising formulation is the direct minimization of energy. Such methods take advantage of physical localization of the solution, namely that the solution can be sought in terms of non-orthogonal orbitals with local support. In this work a numerically efficient implementation of sparse parallel vectors within the open-source finite element library deal.II is proposed. The main algorithmic ingredient is the matrix-free evaluation of the Hamiltonian operator by cell-wise quadrature. Based on an a-priori chosen support for each vector we develop algorithms and data structures to perform (i) matrix-free sparse matrix multivector products (SpMM), (ii) the projection of an operator onto a sparse sub-space (inner products), and (iii) post-multiplication of a sparse multivector with a square matrix. The node-level performance is analyzed using a roofline model. Our matrix-free implementation of finite element operators with sparse multivectors achieves the performance of 157 GFlop/s on Intel Cascade Lake architecture. Strong and weak scaling results are reported for a typical benchmark problem using quadratic and quartic finite element bases.Comment: 29 pages, 12 figure

arXiv.org e-Print Archive

OPUS Augsburg

Equipping Sparse Solvers for Exascale

Author: Alappat Christie Louis
Alvermann Andreas
Basermann Achim
Fehske Holger
Futamura Yasunori
Galgon Martin
Huber Sarah
Imakura Akira
Kawai Masatoshi
Kreutzer Moritz
Lang Bruno
Nakajima Kengo
Röhrig-Zöllner Melven
Sakurai Tetsuya
Shahzad Faisal
Thies Jonas
Wellein Gerhard
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2020
Field of study

Institute of Transport Research:Publications