Search CORE

50 research outputs found

Bridging the gap between OpenMP and task-based runtime systems for the fast multipole method

Author: Agullo Emmanuel
Aumage Olivier
Bramas Bérenger
Coulaud Olivier
Pitoiset Samuel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/04/2017
Field of study

International audienceWith the advent of complex modern architectures, the low-level paradigms long considered sufficient to build High Performance Computing (HPC) numerical codes have met their limits. Achieving efficiency, ensuring portability, while preserving programming tractability on such hardware prompted the HPC community to design new, higher level paradigms while relying on runtime systems to maintain performance. However, the common weakness of these projects is to deeply tie applications to specific expert-only runtime system APIs. The OpenMP specification, which aims at providing common parallel programming means for shared-memory platforms, appears as a good candidate to address this issue thanks to the latest task-based constructs introduced in its revision 4.0. The goal of this paper is to assess the effectiveness and limits of this support for designing a high-performance numerical library, ScalFMM, implementing the fast multipole method (FMM) that we have deeply redesigned with respect to the most advanced features provided by OpenMP 4. We show that OpenMP 4 allows for significant performance improvements over previous OpenMP revisions on recent multicore processors and that extensions to the 4.0 standard allow for strongly improving the performance, bridging the gap with the very high performance that was so far reserved to expert-only runtime system APIs

INRIA a CCSD electronic archive server

Combler l'écart de performance entre OpenMP 4.0 et les moteurs d'exécution pour la méthode des multipoles rapide

Author: Agullo Emmanuel
Aumage Olivier
Bramas Berenger
Coulaud Olivier
Pitoiset Samuel
Publication venue: HAL CCSD
Publication date: 01/03/2016
Field of study

With the advent of complex modern architectures, the low-levelparadigms long considered sufficient to build High Performance Computing (HPC)numerical codes have met their limits. Achieving efficiency, ensuringportability, while preserving programming tractability on such hardwareprompted the HPC community to design new, higher level paradigms.The successful ports of fully-featured numerical libraries on severalrecent runtime system proposals have shown, indeed, the benefit oftask-based parallelism models in terms of performance portability oncomplex platforms. However, the common weakness of these projects is todeeply tie applications to specific expert-only runtime system APIs. The\omp specification, which aims at providing a common parallelprogramming means for shared-memory platforms, appears as a goodcandidate to address this issue thanks to the latest task-basedconstructs introduced as part of its revision 4.0.The goal of this paper is to assess the effectiveness and limits ofthis support for designing a high-performance numerical library. Weillustrate our discussion with the \scalfmm library, which implementsstate-of-the-art fast multipole method (FMM) algorithms, that wehave deeply re-designed with respect to the most advancedfeatures provided by \omp 4. We show that \omp 4 allows forsignificant performance improvements over previous \omp revisions onrecent multicore processors. We furthermore propose extensions to the\omp 4 standard and show how they can enhance FMM performance. Toassess our statement, we have implemented this support within the\klanglong source-to-source compiler that translates \omp directives intocalls to the \starpu task-based runtime system. This study shows thatwe can take advantage of the advanced capabilities of a fully-featuredruntime system without resorting to a specific, native runtime port,hence bridging the gap between the \omp standard and the very highperformance that was so far reserved to expert-only runtime systemAPIs.Avec l'arrivée des architectures modernes complexes, les paradigmes de parallélisation de bas niveau, longtemps considérés comme suffisant pour développer des codes numériques efficaces, ont montré leurs limites. Obtenir de l'efficacité et assurer la portabilité tout en maintenant une bonne flexibilité de programmation sur de telles architectures ont incité la communauté du calcul haute performance (HPC) à concevoir de nouveaux paradigmes de plus haut niveau.Les portages réussis de bibliothèques numériques sur plusieurs moteurs exécution récentos ont montré l'avantage des modèles de parallélisme à base de tâche en ce qui concerne la portabilité et la performance sur ces plateformes complexes. Cependant, la faiblesse de tous ces projets est de fortement coupler les applications aux experts des API des moteurs d'exécution.La spécification d'\omp, qui vise à fournir un modèle de programmation parallèle unique pour les plates-formes à mémoire partagée, semble être un bon candidat pour résoudre ce problème. Notamment, en raison des améliorations apportées à l’expressivité du modèle en tâches présentées dans sa révision 4.0.Le but de ce papier est d'évaluer l'efficacité et les limites de ce modèle pour concevoir une bibliothèque numérique performante. Nous illustrons notre discussion avec la bibliothèque \scalfmm, qui implémente les algorithmes les plus récents de la méthode des multipôles rapide (FMM). Nous avons finement adapté ces derniers pour prendre en compte les caractéristiques les plus avancées fournies par \omp4. Nous montrons qu'\omp4 donne de meilleures performances par rapport aux versions précédentes d'\omp pour les processeurs multi-coeurs récents. De plus, nous proposons des extensions au standard d’\omp4 et nous montrons comment elles peuvent améliorer la performance de la FMM. Pour évaluer notre propos, nous avons mis en oeuvre ces extensions dans le compilateur source-à-source \klanglong qui traduit les directives \omp en des appels au moteur d'exécution à base de tâches \starpu. Cette étude montre que nous pouvons tirer profit des capacités avancées du moteur d'exécution sans devoir recourir à un portage sur l'API spécifique de celui-ci.%d'un moteur d'exécution. %prouve que nous pouvons tirer profit des capacités avancées du moteur d'exécution sans recourir à un portage spécifique dans le moteur d’exécution. Par conséquent, on comble le fossé entre le standard \omp et l’approche très performante par moteur d’exécution qui est de loin réservée au seul expert son API

INRIA a CCSD electronic archive server

TBFMM: A C++ generic and parallel fast multipole method library

Author: Bramas Bérenger
Publication venue: 'The Open Journal'
Publication date: 03/12/2020
Field of study

International audienceTBFMM, for task-based FMM, is a high-performance package that implements the parallel fast multipole method (FMM) in modern C++17. It implements parallel strategies for multicore architectures, i.e. to run on a single computing node. TBFMM was designed to be easily customized thanks to C++ templates and fine control of the C++ classes inter-dependencies. Users can implement new FMM kernels, new types of interacting elements or even new parallelization strategies. As such, it can effectively be used as a simulation toolbox for scientists in physics or applied mathematics. It enables users to perform simulations while delegating the data structure, the algorithm and the parallelization to the library. Besides, TBFMM can also provide an interesting use case for the HPC research community regarding parallelization, optimization and scheduling of applications handling irregular data structures

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Enabling task parallelism for many-core architectures

Author: Atkinson Patrick R
Publication venue
Publication date: 28/09/2021
Field of study

Explore Bristol Research

On the Performance of Parallel Tasking Runtimes for an Irregular Fast Multipole Method Application

Author: Atkinson Patrick
McIntosh-Smith Simon
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/09/2017
Field of study

Crossref

Explore Bristol Research

Design and Analysis of a Task-based Parallelization over a Runtime System of an Explicit Finite-Volume CFD Code with Adaptive Time Stepping

Author: Brenner Pierre
Carpaye Jean Marie Couteyen
Roman Jean
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

FLUSEPA (Registered trademark in France No. 134009261) is an advanced simulation tool which performs a large panel of aerodynamic studies. It is the unstructured finite-volume solver developed by Airbus Safran Launchers company to calculate compressible, multidimensional, unsteady, viscous and reactive flows around bodies in relative motion. The time integration in FLUSEPA is done using an explicit temporal adaptive method. The current production version of the code is based on MPI and OpenMP. This implementation leads to important synchronizations that must be reduced. To tackle this problem, we present the study of a task-based parallelization of the aerodynamic solver of FLUSEPA using the runtime system StarPU and combining up to three levels of parallelism. We validate our solution by the simulation (using a finite-volume mesh with 80 million cells) of a take-off blast wave propagation for Ariane 5 launcher.Comment: Accepted manuscript of a paper in Journal of Computational Scienc

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Méthode des multipôles rapide à base de tâches pour des clusters de processeurs multicoeurs

Author: Agullo Emmanuel
Bramas Bérenger
Coulaud Olivier
Khannouz Martin
Stanisic Luka
Publication venue: HAL CCSD
Publication date: 25/10/2016
Field of study

Most high-performance, scientific libraries have adopted hybrid parallelization schemes - such as the popular MPI+OpenMP hybridization - to benefit from the capacities of modern distributed-memory machines. While these approaches have shown to achieve high performance, they require a lot of effort to design and maintain sophisticated synchronization/communication strategies. On the other hand, task-based programming paradigms aim at delegating this burden to a runtime system for maximizing productivity. In this article, we assess the potential of task-based fast multipole methods (FMM) on clusters of multicore processors. We propose both a hybrid MPI+task FMM parallelization and a pure task-based parallelization where the MPI communications are implicitly handled by the runtime system. The latter approach yields a very compact code following a sequential task-based programming model. We show that task-based approaches can compete with a hybrid MPI+OpenMP highly optimized code and that furthermore the compact task-based scheme fully matches the performance of the sophisticated, hybrid MPI+task version, ensuring performance while maximizing productivity. We illustrate our discussion with the ScalFMM FMM library and the StarPU runtime system.La plupart des bibliothèques scientifiques très performantes ont adopté des parallélisations hybrides - comme l’approche MPI+OpenMP - pour profiter des capacités des machines modernes à mémoire distribuée. Ces approches permettent d’obtenir de très hautes performances, mais elles nécessitent beaucoup d’efforts pour concevoir et pour maintenir des stratégies de synchronisation/communication sophistiquées. D’un autre côté, les paradigmes de programmation à base de tâches visent à déléguer ce fardeau à un moteur d'exécution pour maximiser la productivité. Dans cet article, nous évaluons le potentiel de la méthode des multipôles rapide (FMM) à base de tâches sur les clusters de processeurs multic\oe{}urs. Nous proposons deux types de parallélisation, une première approche hybride (MPI+Tâche) à base de tâches et d’appels à MPI pour gérer explicitement les communications et la deuxième uniquement à base de tâches où les communications MPI sont implicitement postées par le moteur d'exécution. Cette dernière approche conduit à un code très compact qui suit le modèle de programmation séquentiel à base de tâches. Nous montrons que cette approche rivalise avec le code hybride MPI+OpenMP fortement optimisé et qu'en outre le code compact atteint les performances de la version hybride MPI+Tâche, assurant une très haute performance tout en maximisant la productivité. Nous illustrons notre propos avec la bibliothèque FMM ScalFMM et le moteur d'exécution StarPU

INRIA a CCSD electronic archive server

Improving the scalability of parallel N-body applications with an event driven constraint based execution model

Author: Aarseth SJ
Alfieri RA
Bonachea D
Chandra R
Dekate C
El-Ghazawi T
Hewitt C
Kale L
Message Passing Interface Forum
O’Shea BW
Salmon JK
Singh JP
Publication venue: 'SAGE Publications'
Publication date: 23/09/2011
Field of study

The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel execution of ephemeral graphs that are dynamically generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The workloads are expressed using the semantics of an Exascale computing execution model called ParalleX. For comparison, results using conventional execution model semantics are also presented. We find improved load balancing during runtime and automatic parallelism discovery improving efficiency using the advanced semantics for Exascale computing.Comment: 11 figure

arXiv.org e-Print Archive

Crossref