8,195 research outputs found
Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms
Matrix multiplication is a very important computation kernel both in its own
right as a building block of many scientific applications and as a popular
representative for other scientific applications. Cannon algorithm which dates
back to 1969 was the first efficient algorithm for parallel matrix
multiplication providing theoretically optimal communication cost. However this
algorithm requires a square number of processors. In the mid 1990s, the SUMMA
algorithm was introduced. SUMMA overcomes the shortcomings of Cannon algorithm
as it can be used on a non-square number of processors as well. Since then the
number of processors in HPC platforms has increased by two orders of magnitude
making the contribution of communication in the overall execution time more
significant. Therefore, the state of the art parallel matrix multiplication
algorithms should be revisited to reduce the communication cost further. This
paper introduces a new parallel matrix multiplication algorithm, Hierarchical
SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the
communication cost of SUMMA by introducing a two-level virtual hierarchy into
the two-dimensional arrangement of processors. Experiments on an IBM BlueGene-P
demonstrate the reduction of communication cost up to 2.08 times on 2048 cores
and up to 5.89 times on 16384 cores.Comment: 9 page
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
A hierarchic task-based programming model for distributed heterogeneous computing
Distributed computing platforms are evolving to heterogeneous ecosystems with Clusters, Grids and Clouds introducing in its computing nodes, processors with different core architectures, accelerators (i.e. GPUs, FPGAs), as well as different memories and storage devices in order to achieve better performance with lower energy consumption. As a consequence of this heterogeneity, programming applications for these distributed heterogeneous platforms becomes a complex task. Additionally to the complexity of developing an application for distributed platforms, developers must also deal now with the complexity of the different computing devices inside the node. In this article, we present a programming model that aims to facilitate the development and execution of applications in current and future distributed heterogeneous parallel architectures. This programming model is based on the hierarchical composition of the COMP Superscalar and Omp Superscalar programming models that allow developers to implement infrastructure-agnostic applications. The underlying runtime enables applications to adapt to the infrastructure without the need of maintaining different versions of the code. Our programming model proposal has been evaluated on real platforms, in terms of heterogeneous resource usage, performance and adaptation.This work has been supported by the European Commission through the Horizon 2020 Research and Innovation program
under contract 687584 (TANGO project) by the Spanish Government under contract TIN2015-65316 and grant SEV-2015-0493 (Severo Ochoa Program) and by Generalitat de Catalunya under contracts 2014-SGR-1051 and 2014-SGR-1272.Peer ReviewedPostprint (author's final draft
Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation
The increasing number of processing elements and decreas- ing memory to core
ratio in modern high-performance platforms makes efficient strong scaling a key
requirement for numerical algorithms. In order to achieve efficient scalability
on massively parallel systems scientific software must evolve across the entire
stack to exploit the multiple levels of parallelism exposed in modern
architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP
parallelisation to optimise parallel sparse matrix-vector multiplication in
PETSc, a widely used scientific library for the scalable solution of partial
differential equations. Using large matrices generated by Fluidity, an open
source CFD application code which uses PETSc as its linear solver engine, we
evaluate the effect of explicit communication overlap using task-based
parallelism and show how to further improve performance by explicitly load
balancing threads within MPI processes. We demonstrate a significant speedup
over the pure-MPI mode and efficient strong scaling of sparse matrix-vector
multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems
- …