51 research outputs found
MPI Thread-Level Checking for MPI+OpenMP Applications
International audienceMPI is the most widely used parallel programming model. But the reducing amount of memory per compute core tends to push MPI to be mixed with shared-memory approaches like OpenMP. In such cases, the interoperability of those two models is challenging. The MPI 2.0 standard defines the so-called thread level to indicate how MPI will interact with threads. But even if hybrid programs are more common, there is still a lack in debugging tools and more precisely in thread level compliance. To fill this gap, we propose a static analysis to verify the thread-level required by an application. This work extends PARCOACH, a GCC plugin focused on the detection of MPI collective errors in MPI and MPI+OpenMP programs. We validated our analysis on computational benchmarks and applications and measured a low overhead
PARCOACH: Combining static and dynamic validation of MPI collective communications
International audienceNowadays most scientific applications are parallelized based on MPI communications. Collective MPI communications have to be executed in the same order by all processes in their communicator and the same number of times, otherwise it is not conforming to the standard and a deadlock or other undefined behavior can occur. As soon as the control-flow involving these collective operations becomes more complex, in particular including conditionals on process ranks, ensuring the correction of such code is error-prone. We propose in this paper a static analysis to detect when such situation occurs, combined with a code transformation that prevents from deadlocking. We focus on blocking MPI collective operations in SPMD applications, assuming MPI calls are not nested in multithreaded regions. We show on several benchmarks the small impact on performance and the ease of integration of our techniques in the development process
Multigrain Affinity for Heterogeneous Work Stealing
International audienceIn a parallel computing context, peak performance is hard to reach with irregular applications such as sparse linear algebra operations. It requires dynamic adjustments to automatically balance the workload between several processors. The problem becomes even more complicated when an architecture contains processing units with radically different computing capabilities. We present a hierarchical scheduling scheme designed to harness several CPUs and a GPU. It is built on a two-level work stealing mechanism tightly coupled to a software-managed cache. We show that our approach is well suited to dynamically control heterogeneous architectures, while taking advantage of a reduction of data transfers
Multigrain Affinity for Heterogeneous Work Stealing
International audienceIn a parallel computing context, peak performance is hard to reach with irregular applications such as sparse linear algebra operations. It requires dynamic adjustments to automatically balance the workload between several processors. The problem becomes even more complicated when an architecture contains processing units with radically different computing capabilities. We present a hierarchical scheduling scheme designed to harness several CPUs and a GPU. It is built on a two-level work stealing mechanism tightly coupled to a software-managed cache. We show that our approach is well suited to dynamically control heterogeneous architectures, while taking advantage of a reduction of data transfers
Relative performance projection on Arm architectures
International audienceWith the advent of multi-many-core processors and hardware accelerators, choosing a specific architecture to renew a supercomputer can become very tedious. This decision process should consider the current and future parallel application needs and the design of the target software stack. It should also consider the single-core behavior of the application as it is one of the performance limitations in today's machines. In such a scheme, performance hints on the impact of some hardware and software stack modifications are mandatory to drive this choice. This paper proposes a workflow for performance projection based on execution on an actual processor and the application's behavior. This projection evaluates the performance variation from an existing core of a processor to a hypothetical one to drive the design choice. For this purpose, we characterize the maximum sustainable performance of the target machine and analyze the application using the software stack of the target machine. To validate this approach, we apply it to three applications of the CORAL benchmark suite: LULESH, MiniFE, and Quicksilver, using a single-core of two Arm-based architectures: Marvell ThunderX2 and Arm Neoverse N1. Finally, we follow this validation work with an example of design-space exploration around the SVE vector size, the choice of DDR4 and HBM2, and the software stack choice on A64FX on our applications with a pool of three source architectures: Arm Neoverse N1, Marvell ThunderX2, and Fujitsu A64FX
MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption
International audienceMessage-Passing Interface (MPI) has become a standard for parallel applications in high-performance computing. Within a shared address space, MPI implementations benefit from the global memory to speed-up intra-node communications while the underlying network pro- tocol is exploited to communicate between nodes. But, it requires the allocation of additional buffers leading to a memory-consumption over- head. This may become an issue on future clusters with reduced mem- ory amount per core. In this article, we propose an MPI implementation upon the MPC framework called MPC-MPI reducing the overall memory footprint. We obtained up to 47% of memory gain on benchmarks and a real-world application
Static/Dynamic Validation of MPI Collective Communications in Multi-threaded Context
International audienceScientific applications mainly rely on the MPI parallel programming model to reach high performance on supercomputers. The advent of manycore architectures (larger number of cores and lower amount of memory per core) leads to mix MPI with a thread-based model like OpenMP. But integrating two different programming models inside the same application can be tricky and generate complex bugs. Thus, the correctness of hybrid programs requires a special care regarding MPI calls location. For example, identical MPI collective operations cannot be performed by multiple non-synchronized threads. To tackle this issue, this paper proposes a static analysis and a reduced dynamic instrumentation to detect bugs related to misuse of MPI collective operations inside or outside threaded regions. This work extends PARCOACH [4] designed for MPI-only applications and keeps the compatibility with these algorithms. We validated our method on multiple hybrid benchmarks and applications with a low overhead
Application of Storage Mapping Optimization to Register Promotion
International audienceStorage mapping optimization is a flexible approach to folding array dimensions in numerical codes. It is designed to reduce thememory footprint after a wide spectrum of loop transformations,whether based on uniform dependence vectors or more expressivepolyhedral abstractions. Conversely, few loop transformations havebeen proposed to facilitate register promotion, namely loop fusion,unroll-and-jam or tiling. Building on array data-flow analysis andexpansion, we extend storage mapping optimization to improve opportunities for register promotion.Our work is motivated by the empirical study of a computational biology benchmark, the approximate string matching algorithm BPR from NR-grep , on a wide issue micro-architecture. Ourexperiments confirm the major benefit of register tiling (even onnon-numerical benchmarks) but also shed the light on two novel issues: prior array expansion may be necessary to enable loop trans-formations that finally authorize profitable register promotion, andmore advanced scheduling techniques (beyond tiling and unroll-and-jam) may significantly improve performance in fine-tuning reg-ister usage and instruction-level parallelism
- …