4,630 research outputs found
Compression and load balancing for efficient sparse matrix-vector product on multicore processors and graphics processing units
[EN] We contribute to the optimization of the sparse matrix-vector product by introducing a variant of the coordinate sparse matrix format that balances the workload distribution and compresses both the indexing arrays and the numerical information. Our approach is multi-platform, in the sense that the realizations for (general-purpose) multicore processors as well as graphics accelerators (GPUs) are built upon common principles, but differ in the implementation details, which are adapted to avoid thread divergence in the GPU case or maximize compression element-wise (i.e., for each matrix entry) for multicore architectures. Our evaluation on the two last generations of NVIDIA GPUs as well as Intel and AMD processors demonstrate the benefits of the new kernels when compared with the optimized implementations of the sparse matrix-vector product in NVIDIA's cuSPARSE and Intel's MKL, respectively.J. I. Aliaga, E. S. Quintana-OrtĂ, and A. E. Tomás were supported by TIN2017-82972-R of the Spanish MINECO. H. Anzt and T. GrĂĽtzmacher were supported by the Impuls und Vernetzungsfond of the Helmholtz Association under grant VH-NG-1241 and by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. The authors would like to thank the Steinbuch Centre for Computing (SCC) of the Karlsruhe Institute of Technology for providing access to an NVIDIA A100 GPU.Aliaga, JI.; Anzt, H.; GrĂĽtzmacher, T.; Quintana-OrtĂ, ES.; Tomás DomĂnguez, AE. (2022). Compression and load balancing for efficient sparse matrix-vector product on multicore processors and graphics processing units. Concurrency and Computation: Practice and Experience. 34(14):1-13. https://doi.org/10.1002/cpe.6515113341
Speculative Segmented Sum for Sparse Matrix-Vector Multiplication on Heterogeneous Processors
Sparse matrix-vector multiplication (SpMV) is a central building block for
scientific software and graph applications. Recently, heterogeneous processors
composed of different types of cores attracted much attention because of their
flexible core configuration and high energy efficiency. In this paper, we
propose a compressed sparse row (CSR) format based SpMV algorithm utilizing
both types of cores in a CPU-GPU heterogeneous processor. We first
speculatively execute segmented sum operations on the GPU part of a
heterogeneous processor and generate a possibly incorrect results. Then the CPU
part of the same chip is triggered to re-arrange the predicted partial sums for
a correct resulting vector. On three heterogeneous processors from Intel, AMD
and nVidia, using 20 sparse matrices as a benchmark suite, the experimental
results show that our method obtains significant performance improvement over
the best existing CSR-based SpMV algorithms. The source code of this work is
downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSRComment: 22 pages, 8 figures, Published at Parallel Computing (PARCO
Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs
Many problems in geophysical and atmospheric modelling require the fast
solution of elliptic partial differential equations (PDEs) in "flat" three
dimensional geometries. In particular, an anisotropic elliptic PDE for the
pressure correction has to be solved at every time step in the dynamical core
of many numerical weather prediction models, and equations of a very similar
structure arise in global ocean models, subsurface flow simulations and gas and
oil reservoir modelling. The elliptic solve is often the bottleneck of the
forecast, and an algorithmically optimal method has to be used and implemented
efficiently. Graphics Processing Units have been shown to be highly efficient
for a wide range of applications in scientific computing, and recently
iterative solvers have been parallelised on these architectures. We describe
the GPU implementation and optimisation of a Preconditioned Conjugate Gradient
(PCG) algorithm for the solution of a three dimensional anisotropic elliptic
PDE for the pressure correction in NWP. Our implementation exploits the strong
vertical anisotropy of the elliptic operator in the construction of a suitable
preconditioner. As the algorithm is memory bound, performance can be improved
significantly by reducing the amount of global memory access. We achieve this
by using a matrix-free implementation which does not require explicit storage
of the matrix and instead recalculates the local stencil. Global memory access
can also be reduced by rewriting the algorithm using loop fusion and we show
that this further reduces the runtime on the GPU. We demonstrate the
performance of our matrix-free GPU code by comparing it to a sequential CPU
implementation and to a matrix-explicit GPU code which uses existing libraries.
The absolute performance of the algorithm for different problem sizes is
quantified in terms of floating point throughput and global memory bandwidth.Comment: 18 pages, 7 figure
CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication
Sparse matrix-vector multiplication (SpMV) is a fundamental building block
for numerous applications. In this paper, we propose CSR5 (Compressed Sparse
Row 5), a new storage format, which offers high-throughput SpMV on various
platforms including CPUs, GPUs and Xeon Phi. First, the CSR5 format is
insensitive to the sparsity structure of the input matrix. Thus the single
format can support an SpMV algorithm that is efficient both for regular
matrices and for irregular matrices. Furthermore, we show that the overhead of
the format conversion from the CSR to the CSR5 can be as low as the cost of a
few SpMV operations. We compare the CSR5-based SpMV algorithm with 11
state-of-the-art formats and algorithms on four mainstream processors using 14
regular and 10 irregular matrices as a benchmark suite. For the 14 regular
matrices in the suite, we achieve comparable or better performance over the
previous work. For the 10 irregular matrices, the CSR5 obtains average
performance improvement of 17.6\%, 28.5\%, 173.0\% and 293.3\% (up to 213.3\%,
153.6\%, 405.1\% and 943.3\%) over the best existing work on dual-socket Intel
CPUs, an nVidia GPU, an AMD GPU and an Intel Xeon Phi, respectively. For
real-world applications such as a solver with only tens of iterations, the CSR5
format can be more practical because of its low-overhead for format conversion.
The source code of this work is downloadable at
https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR5Comment: 12 pages, 10 figures, In Proceedings of the 29th ACM International
Conference on Supercomputing (ICS '15
Solving Lattice QCD systems of equations using mixed precision solvers on GPUs
Modern graphics hardware is designed for highly parallel numerical tasks and
promises significant cost and performance benefits for many scientific
applications. One such application is lattice quantum chromodyamics (lattice
QCD), where the main computational challenge is to efficiently solve the
discretized Dirac equation in the presence of an SU(3) gauge field. Using
NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector
product that performs at up to 40 Gflops, 135 Gflops and 212 Gflops for double,
single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have
developed a new mixed precision approach for Krylov solvers using reliable
updates which allows for full double precision accuracy while using only single
or half precision arithmetic for the bulk of the computation. The resulting
BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations
until convergence, perform better than the usual defect-correction approach for
mixed precision.Comment: 30 pages, 7 figure
- …