28,875 research outputs found
Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments
Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key
primitive for many high performance graph algorithms as well as for some linear
solvers, such as algebraic multigrid. Here we show that SpGEMM also yields
efficient algorithms for general sparse-matrix indexing in distributed memory,
provided that the underlying SpGEMM implementation is sufficiently flexible and
scalable. We demonstrate that our parallel SpGEMM methods, which use
two-dimensional block data distributions with serial hypersparse kernels, are
indeed highly flexible, scalable, and memory-efficient in the general case.
This algorithm is the first to yield increasing speedup on an unbounded number
of processors; our experiments show scaling up to thousands of processors in a
variety of test scenarios
Design of efficient block-sparse data structures and associated tensor multiplications for applications in Cluster in Molecule electronic structure calculations
The Cluster-in-molecule approach (CiM) is one of the most popular methods in electronic structure calculations for medium to large molecules and systems. The Nooijen group is currently developing a new CiM approach using the range-separated coulomb potential developed by M. Lecours, however the progress does not reach our expectations as we encountered performance bottlenecks from the two-electron three-index integrals. To deal with this problem, we have implemented two block-sparse data structures named the Tile and the Tile Master to provide sparse matrix storage formats and efficient matrix multiplication algorithms benefiting from the high sparsity of the data. The Tile structure focuses on the efficiency of Sparse-matrix dense-matrix multiplication (SpMM), while the Tile Master emphasizes solving the three-index integral problem using the block-sparse structure and compressed three-dimensional array format. Both the Tile structure and the Tile Master are made highly efficient and able to achieve multi-threading under high parallel structures, including the new Intel KNL structure Xeon processors and any Graphic processing unit (GPU) using the Nvidia Compute unified device architecture cores (CUDA) structures. The benchmarking result indicates that the Tile structure is averaging around 2 to 5 times faster than the NumPy dot algorithm, and up to 30 times faster than our previous Compressed sparse row format (CSR) multiplication routine. The Tile Master on the other hand can compress three-index quantities down almost 95% in storage space using the block-sparse structure, and could handle the calculations efficiently using different dense and sparse calculation routines determined by the Atomic orbital (AO) geometries. To sum up, the Tile structure and the Tile Master will provide useful tools to solve the complex
three-index integral problem in our CiM approach, as well as other scientific calculation problems with sparse matrices involved
Benchmarking mixed-mode PETSc performance on high-performance architectures
The trend towards highly parallel multi-processing is ubiquitous in all modern computer architectures, ranging from handheld devices to large-scale HPC systems; yet many applications are struggling to fully utilise the multiple levels of parallelism exposed in modern high-performance platforms. In order to realise the full potential of recent hardware advances, a mixed-mode between shared-memory programming techniques and inter-node message passing can be adopted which provides high-levels of parallelism with minimal overheads. For scientific applications this entails that not only the simulation code itself, but the whole software stack needs to evolve. In this paper, we evaluate the mixed-mode performance of PETSc, a widely used scientific library for the scalable solution of partial differential equations. We describe the addition of OpenMP threaded functionality to the library, focusing on sparse matrix-vector multiplication. We highlight key challenges in achieving good parallel performance, such as explicit communication overlap using task-based parallelism, and show how to further improve performance by explicitly load balancing threads within MPI processes. Using a set of matrices extracted from Fluidity, a CFD application code which uses the library as its linear solver engine, we then benchmark the parallel performance of mixed-mode PETSc across multiple nodes on several modern HPC architectures. We evaluate the parallel scalability on Uniform Memory Access (UMA) systems, such as the Fujitsu PRIMEHPC FX10 and IBM BlueGene/Q, as well as a Non-Uniform Memory Access (NUMA) Cray XE6 platform. A detailed comparison is performed which highlights the characteristics of each particular architecture, before demonstrating efficient strong scalability of sparse matrix-vector multiplication with significant speedups over the pure-MPI mode
Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI
Matrix-matrix multiplication is a basic operation in linear algebra and an
essential building block for a wide range of algorithms in various scientific
fields. Theory and implementation for the dense, square matrix case are
well-developed. If matrices are sparse, with application-specific sparsity
patterns, the optimal implementation remains an open question. Here, we explore
the performance of communication reducing 2.5D algorithms and one-sided MPI
communication in the context of linear scaling electronic structure theory. In
particular, we extend the DBCSR sparse matrix library, which is the basic
building block for linear scaling electronic structure theory and low scaling
correlated methods in CP2K. The library is specifically designed to efficiently
perform block-sparse matrix-matrix multiplication of matrices with a relatively
large occupation. Here, we compare the performance of the original
implementation based on Cannon's algorithm and MPI point-to-point
communication, with an implementation based on MPI one-sided communications
(RMA), in both a 2D and a 2.5D approach. The 2.5D approach trades memory and
auxiliary operations for reduced communication, which can lead to a speedup if
communication is dominant. The 2.5D algorithm is somewhat easier to implement
with one-sided communications. A detailed description of the implementation is
provided, also for non ideal processor topologies, since this is important for
actual applications. Given the importance of the precise sparsity pattern, and
even the actual matrix data, which decides the effective fill-in upon
multiplication, the tests are performed within the CP2K package with
application benchmarks. Results show a substantial boost in performance for the
RMA based 2.5D algorithm, up to 1.80x, which is observed to increase with the
number of involved processes in the parallelization.Comment: In Proceedings of PASC '17, Lugano, Switzerland, June 26-28, 2017, 10
pages, 4 figure
Exact Sparse Matrix-Vector Multiplication on GPU's and Multicore Architectures
We propose different implementations of the sparse matrix--dense vector
multiplication (\spmv{}) for finite fields and rings \Zb/m\Zb. We take
advantage of graphic card processors (GPU) and multi-core architectures. Our
aim is to improve the speed of \spmv{} in the \linbox library, and henceforth
the speed of its black box algorithms. Besides, we use this and a new
parallelization of the sigma-basis algorithm in a parallel block Wiedemann rank
implementation over finite fields
- …