5,597 research outputs found
Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI
Matrix-matrix multiplication is a basic operation in linear algebra and an
essential building block for a wide range of algorithms in various scientific
fields. Theory and implementation for the dense, square matrix case are
well-developed. If matrices are sparse, with application-specific sparsity
patterns, the optimal implementation remains an open question. Here, we explore
the performance of communication reducing 2.5D algorithms and one-sided MPI
communication in the context of linear scaling electronic structure theory. In
particular, we extend the DBCSR sparse matrix library, which is the basic
building block for linear scaling electronic structure theory and low scaling
correlated methods in CP2K. The library is specifically designed to efficiently
perform block-sparse matrix-matrix multiplication of matrices with a relatively
large occupation. Here, we compare the performance of the original
implementation based on Cannon's algorithm and MPI point-to-point
communication, with an implementation based on MPI one-sided communications
(RMA), in both a 2D and a 2.5D approach. The 2.5D approach trades memory and
auxiliary operations for reduced communication, which can lead to a speedup if
communication is dominant. The 2.5D algorithm is somewhat easier to implement
with one-sided communications. A detailed description of the implementation is
provided, also for non ideal processor topologies, since this is important for
actual applications. Given the importance of the precise sparsity pattern, and
even the actual matrix data, which decides the effective fill-in upon
multiplication, the tests are performed within the CP2K package with
application benchmarks. Results show a substantial boost in performance for the
RMA based 2.5D algorithm, up to 1.80x, which is observed to increase with the
number of involved processes in the parallelization.Comment: In Proceedings of PASC '17, Lugano, Switzerland, June 26-28, 2017, 10
pages, 4 figure
Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation
The increasing number of processing elements and decreas- ing memory to core
ratio in modern high-performance platforms makes efficient strong scaling a key
requirement for numerical algorithms. In order to achieve efficient scalability
on massively parallel systems scientific software must evolve across the entire
stack to exploit the multiple levels of parallelism exposed in modern
architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP
parallelisation to optimise parallel sparse matrix-vector multiplication in
PETSc, a widely used scientific library for the scalable solution of partial
differential equations. Using large matrices generated by Fluidity, an open
source CFD application code which uses PETSc as its linear solver engine, we
evaluate the effect of explicit communication overlap using task-based
parallelism and show how to further improve performance by explicitly load
balancing threads within MPI processes. We demonstrate a significant speedup
over the pure-MPI mode and efficient strong scaling of sparse matrix-vector
multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems
Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices
A task-based formulation of Scalable Universal Matrix Multiplication
Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is
applied to the multiplication of hierarchy-free, rank-structured matrices that
appear in the domain of quantum chemistry (QC). The novel features of our
formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and
(2) fine-grained task-based composition. These features make it tolerant of the
load imbalance due to the irregular matrix structure and eliminate all
artifactual sources of global synchronization.Scalability of iterative
computation of square-root inverse of block-rank-sparse QC matrices is
demonstrated; for full-rank (dense) matrices the performance of our SUMMA
formulation usually exceeds that of the state-of-the-art dense MM
implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text
overlap with arXiv:1504.0504
A Novel Partitioning Method for Accelerating the Block Cimmino Algorithm
We propose a novel block-row partitioning method in order to improve the
convergence rate of the block Cimmino algorithm for solving general sparse
linear systems of equations. The convergence rate of the block Cimmino
algorithm depends on the orthogonality among the block rows obtained by the
partitioning method. The proposed method takes numerical orthogonality among
block rows into account by proposing a row inner-product graph model of the
coefficient matrix. In the graph partitioning formulation defined on this graph
model, the partitioning objective of minimizing the cutsize directly
corresponds to minimizing the sum of inter-block inner products between block
rows thus leading to an improvement in the eigenvalue spectrum of the iteration
matrix. This in turn leads to a significant reduction in the number of
iterations required for convergence. Extensive experiments conducted on a large
set of matrices confirm the validity of the proposed method against a
state-of-the-art method
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
- …