Search CORE

48 research outputs found

MPI Thread-Level Checking for MPI+OpenMP Applications

Author: Barthou Denis
Carribault Patrick
Saillard Emmanuelle
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

International audienceMPI is the most widely used parallel programming model. But the reducing amount of memory per compute core tends to push MPI to be mixed with shared-memory approaches like OpenMP. In such cases, the interoperability of those two models is challenging. The MPI 2.0 standard defines the so-called thread level to indicate how MPI will interact with threads. But even if hybrid programs are more common, there is still a lack in debugging tools and more precisely in thread level compliance. To fill this gap, we propose a static analysis to verify the thread-level required by an application. This work extends PARCOACH, a GCC plugin focused on the detection of MPI collective errors in MPI and MPI+OpenMP programs. We validated our analysis on computational benchmarks and applications and measured a low overhead

Crossref

HAL-CEA

PARCOACH: Combining static and dynamic validation of MPI collective communications

Author: Barthou Denis
Carribault Patrick
Saillard Emmanuelle
Publication venue: 'SAGE Publications'
Publication date: 01/01/2013
Field of study

International audienceNowadays most scientific applications are parallelized based on MPI communications. Collective MPI communications have to be executed in the same order by all processes in their communicator and the same number of times, otherwise it is not conforming to the standard and a deadlock or other undefined behavior can occur. As soon as the control-flow involving these collective operations becomes more complex, in particular including conditionals on process ranks, ensuring the correction of such code is error-prone. We propose in this paper a static analysis to detect when such situation occurs, combined with a code transformation that prevents from deadlocking. We focus on blocking MPI collective operations in SPMD applications, assuming MPI calls are not nested in multithreaded regions. We show on several benchmarks the small impact on performance and the ease of integration of our techniques in the development process

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL-CEA

Multigrain Affinity for Heterogeneous Work Stealing

Author: Carribault Patrick
Cohen Albert
Vet Jean-Yves
Publication venue: HAL CCSD
Publication date: 23/01/2012
Field of study

International audienceIn a parallel computing context, peak performance is hard to reach with irregular applications such as sparse linear algebra operations. It requires dynamic adjustments to automatically balance the workload between several processors. The problem becomes even more complicated when an architecture contains processing units with radically different computing capabilities. We present a hierarchical scheduling scheme designed to harness several CPUs and a GPU. It is built on a two-level work stealing mechanism tightly coupled to a software-managed cache. We show that our approach is well suited to dynamically control heterogeneous architectures, while taking advantage of a reduction of data transfers

INRIA a CCSD electronic archive server

HAL-CEA

Multigrain Affinity for Heterogeneous Work Stealing

Author: Carribault Patrick
Cohen Albert
Vet Jean-Yves
Publication venue: HAL CCSD
Publication date: 23/01/2012
Field of study

INRIA a CCSD electronic archive server

Relative performance projection on Arm architectures

Author: Carribault Patrick
Dupros Fabrice
Gavoille Clément
Goglin Brice
Jeannot Emmanuel
Taboada Hugo
Publication venue: HAL CCSD
Publication date: 22/08/2022
Field of study

International audienceWith the advent of multi-many-core processors and hardware accelerators, choosing a specific architecture to renew a supercomputer can become very tedious. This decision process should consider the current and future parallel application needs and the design of the target software stack. It should also consider the single-core behavior of the application as it is one of the performance limitations in today's machines. In such a scheme, performance hints on the impact of some hardware and software stack modifications are mandatory to drive this choice. This paper proposes a workflow for performance projection based on execution on an actual processor and the application's behavior. This projection evaluates the performance variation from an existing core of a processor to a hypothetical one to drive the design choice. For this purpose, we characterize the maximum sustainable performance of the target machine and analyze the application using the software stack of the target machine. To validate this approach, we apply it to three applications of the CORAL benchmark suite: LULESH, MiniFE, and Quicksilver, using a single-core of two Arm-based architectures: Marvell ThunderX2 and Arm Neoverse N1. Finally, we follow this validation work with an example of design-space exploration around the SVE vector size, the choice of DDR4 and HBM2, and the software stack choice on A64FX on our applications with a pool of three source architectures: Arm Neoverse N1, Marvell ThunderX2, and Fujitsu A64FX

INRIA a CCSD electronic archive server

Contribution à la compilation de programmes irréguliers pour des architectures complexes

Author: Carribault Patrick
Publication venue: HAL CCSD
Publication date: 13/07/2007
Field of study

Thès

HAL UVSQ

MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption

Author: Carribault Patrick
Jourdren Hervé
Perache Marc
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

International audienceMessage-Passing Interface (MPI) has become a standard for parallel applications in high-performance computing. Within a shared address space, MPI implementations benefit from the global memory to speed-up intra-node communications while the underlying network pro- tocol is exploited to communicate between nodes. But, it requires the allocation of additional buffers leading to a memory-consumption over- head. This may become an issue on future clusters with reduced mem- ory amount per core. In this article, we propose an MPI implementation upon the MPC framework called MPC-MPI reducing the overall memory footprint. We obtained up to 47% of memory gain on benchmarks and a real-world application

Crossref

HAL-CEA

Static/Dynamic Validation of MPI Collective Communications in Multi-threaded Context

Author: Barthou Denis
Carribault Cea Patrick
Saillard Emmanuelle
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/02/2015
Field of study

International audienceScientific applications mainly rely on the MPI parallel programming model to reach high performance on supercomputers. The advent of manycore architectures (larger number of cores and lower amount of memory per core) leads to mix MPI with a thread-based model like OpenMP. But integrating two different programming models inside the same application can be tricky and generate complex bugs. Thus, the correctness of hybrid programs requires a special care regarding MPI calls location. For example, identical MPI collective operations cannot be performed by multiple non-synchronized threads. To tackle this issue, this paper proposes a static analysis and a reduced dynamic instrumentation to detect bugs related to misuse of MPI collective operations inside or outside threaded regions. This work extends PARCOACH [4] designed for MPI-only applications and keeps the compatibility with these algorithms. We validated our method on multiple hybrid benchmarks and applications with a low overhead

Crossref

HAL-CEA