Search CORE

8,753 research outputs found

Improving MPI Threading Support for Current Hardware Architectures

Author: Patinyasakdikul Thananon
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 15/12/2019
Field of study

Threading support for Message Passing Interface (MPI) has been defined in the MPI standard for more than twenty years. While many standard-compliance MPI implementations fully support multithreading, the threading support in MPI still cannot provide the optimal performance on the same level as the non-threading environment. The performance disparity leads to low adoption rate from applications, and eventually, lesser interest in optimizing MPI threading support. However, with the current advancement in computation hardware, the number of CPU core per packet is growing drastically. Using shared-memory MPI communication has become more costly. MPI threading without local communication is one of the alternatives and the some interests are shifting back toward threading to MPI.In this work, we investigate different approaches to leverage the power of thread parallelism and tools to help us to raise the multi-threaded MPI performance to reasonable level. We propose a novel multi-threaded MPI benchmark with multiple communication patterns to stress multiple points of the MPI implementation, with the ability to switch between using MPI process and threads for quick comparison between two modes. Enabling the us, and the others MPI developers to stress test their implementation design.We address the interoperability between MPI implementation and threading frameworks by introducing the thread-synchronization object, an object that gives the MPI implementation more control over user-level thread, allowing for more thread utilization in MPI. In our implementation, the synchronization object relieves the lock contention on the internal progress engine and able to achieve up to 7x the performance of the original implementation. Moving forward, we explore the possibility of harnessing the true thread concurrency. We proposed several strategies to address the bottlenecks in MPI implementation. From our evaluation, with our novel threading optimization, we can achieve up to 22x the performance comparing to the legacy MPI designs

University of Tennessee, Knoxville: Trace

Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation

Author: G. Goumas
G. Schubert
G. Wellein
M. Butler
M.D. Piggott
N. Bell
P. Balaji
S. Williams
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

The increasing number of processing elements and decreas- ing memory to core ratio in modern high-performance platforms makes efficient strong scaling a key requirement for numerical algorithms. In order to achieve efficient scalability on massively parallel systems scientific software must evolve across the entire stack to exploit the multiple levels of parallelism exposed in modern architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP parallelisation to optimise parallel sparse matrix-vector multiplication in PETSc, a widely used scientific library for the scalable solution of partial differential equations. Using large matrices generated by Fluidity, an open source CFD application code which uses PETSc as its linear solver engine, we evaluate the effect of explicit communication overlap using task-based parallelism and show how to further improve performance by explicitly load balancing threads within MPI processes. We demonstrate a significant speedup over the pure-MPI mode and efficient strong scaling of sparse matrix-vector multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems

arXiv.org e-Print Archive

CiteSeerX

Crossref

Spiral - Imperial College Digital Repository

A Parallel Mesh-Adaptive Framework for Hyperbolic Conservation Laws

Author: Berger
Friedel
Fryxell
Godunov
Grauer
Grauer
Groth
Hilbert
Jürgen Dreher
Keppens
Kurganov
Lax
MacNeice
Nessyahu
Powell
Rainer Grauer
Roe
Steiner
Toro
Tóth
Woodward
Ziegler
Zumbusch
Zumbusch
Publication venue: 'Elsevier BV'
Publication date: 01/02/2006
Field of study

We report on the development of a computational framework for the parallel, mesh-adaptive solution of systems of hyperbolic conservation laws like the time-dependent Euler equations in compressible gas dynamics or Magneto-Hydrodynamics (MHD) and similar models in plasma physics. Local mesh refinement is realized by the recursive bisection of grid blocks along each spatial dimension, implemented numerical schemes include standard finite-differences as well as shock-capturing central schemes, both in connection with Runge-Kutta type integrators. Parallel execution is achieved through a configurable hybrid of POSIX-multi-threading and MPI-distribution with dynamic load balancing. One- two- and three-dimensional test computations for the Euler equations have been carried out and show good parallel scaling behavior. The Racoon framework is currently used to study the formation of singularities in plasmas and fluids.Comment: late submissio

arXiv.org e-Print Archive

Crossref

CERN Document Server

Experiences with OpenMP in tmLQCD

Author: Deuzeman A.
Jansen K.
Kostrzewa B.
Urbach C.
Publication venue
Publication date: 01/01/2013
Field of study

An overview is given of the lessons learned from the introduction of multi-threading using OpenMP in tmLQCD. In particular, programming style, performance measurements, cache misses, scaling, thread distribution for hybrid codes, race conditions, the overlapping of communication and computation and the measurement and reduction of certain overheads are discussed. Performance measurements and sampling profiles are given for different implementations of the hopping matrix computational kernel.Comment: presented at the 31st International Symposium on Lattice Field Theory (Lattice 2013), 29 July - 3 August 2013, Mainz, German

arXiv.org e-Print Archive

DESY Publication Database

Crossref

DESY

Bern Open Repository and Information System (BORIS)

Parallel sparse matrix-vector multiplication as a test case for hybrid MPI+OpenMP programming

Author: Fehske Holger
Hager Georg
Schubert Gerald
Wellein Gerhard
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/12/2010
Field of study

We evaluate optimized parallel sparse matrix-vector operations for two representative application areas on widespread multicore-based cluster configurations. First the single-socket baseline performance is analyzed and modeled with respect to basic architectural properties of standard multicore chips. Going beyond the single node, parallel sparse matrix-vector operations often suffer from an unfavorable communication to computation ratio. Starting from the observation that nonblocking MPI is not able to hide communication cost using standard MPI implementations, we demonstrate that explicit overlap of communication and computation can be achieved by using a dedicated communication thread, which may run on a virtual core. We compare our approach to pure MPI and the widely used "vector-like" hybrid programming strategy.Comment: 12 pages, 6 figure

arXiv.org e-Print Archive

Crossref