50 research outputs found
Bridging the gap between OpenMP and task-based runtime systems for the fast multipole method
International audienceWith the advent of complex modern architectures, the low-level paradigms long considered sufficient to build High Performance Computing (HPC) numerical codes have met their limits. Achieving efficiency, ensuring portability, while preserving programming tractability on such hardware prompted the HPC community to design new, higher level paradigms while relying on runtime systems to maintain performance. However, the common weakness of these projects is to deeply tie applications to specific expert-only runtime system APIs. The OpenMP specification, which aims at providing common parallel programming means for shared-memory platforms, appears as a good candidate to address this issue thanks to the latest task-based constructs introduced in its revision 4.0. The goal of this paper is to assess the effectiveness and limits of this support for designing a high-performance numerical library, ScalFMM, implementing the fast multipole method (FMM) that we have deeply redesigned with respect to the most advanced features provided by OpenMP 4. We show that OpenMP 4 allows for significant performance improvements over previous OpenMP revisions on recent multicore processors and that extensions to the 4.0 standard allow for strongly improving the performance, bridging the gap with the very high performance that was so far reserved to expert-only runtime system APIs
Combler l'écart de performance entre OpenMP 4.0 et les moteurs d'exécution pour la méthode des multipoles rapide
With the advent of complex modern architectures, the low-levelparadigms long considered sufficient to build High Performance Computing (HPC)numerical codes have met their limits. Achieving efficiency, ensuringportability, while preserving programming tractability on such hardwareprompted the HPC community to design new, higher level paradigms.The successful ports of fully-featured numerical libraries on severalrecent runtime system proposals have shown, indeed, the benefit oftask-based parallelism models in terms of performance portability oncomplex platforms. However, the common weakness of these projects is todeeply tie applications to specific expert-only runtime system APIs. The\omp specification, which aims at providing a common parallelprogramming means for shared-memory platforms, appears as a goodcandidate to address this issue thanks to the latest task-basedconstructs introduced as part of its revision 4.0.The goal of this paper is to assess the effectiveness and limits ofthis support for designing a high-performance numerical library. Weillustrate our discussion with the \scalfmm library, which implementsstate-of-the-art fast multipole method (FMM) algorithms, that wehave deeply re-designed with respect to the most advancedfeatures provided by \omp 4. We show that \omp 4 allows forsignificant performance improvements over previous \omp revisions onrecent multicore processors. We furthermore propose extensions to the\omp 4 standard and show how they can enhance FMM performance. Toassess our statement, we have implemented this support within the\klanglong source-to-source compiler that translates \omp directives intocalls to the \starpu task-based runtime system. This study shows thatwe can take advantage of the advanced capabilities of a fully-featuredruntime system without resorting to a specific, native runtime port,hence bridging the gap between the \omp standard and the very highperformance that was so far reserved to expert-only runtime systemAPIs.Avec l'arrivée des architectures modernes complexes, les paradigmes de parallélisation de bas niveau, longtemps considérés comme suffisant pour développer des codes numériques efficaces, ont montré leurs limites. Obtenir de l'efficacité et assurer la portabilité tout en maintenant une bonne flexibilité de programmation sur de telles architectures ont incité la communauté du calcul haute performance (HPC) à concevoir de nouveaux paradigmes de plus haut niveau.Les portages réussis de bibliothèques numériques sur plusieurs moteurs exécution récentos ont montré l'avantage des modèles de parallélisme à base de tâche en ce qui concerne la portabilité et la performance sur ces plateformes complexes. Cependant, la faiblesse de tous ces projets est de fortement coupler les applications aux experts des API des moteurs d'exécution.La spécification d'\omp, qui vise à fournir un modèle de programmation parallèle unique pour les plates-formes à mémoire partagée, semble être un bon candidat pour résoudre ce problème. Notamment, en raison des améliorations apportées à l’expressivité du modèle en tâches présentées dans sa révision 4.0.Le but de ce papier est d'évaluer l'efficacité et les limites de ce modèle pour concevoir une bibliothèque numérique performante. Nous illustrons notre discussion avec la bibliothèque \scalfmm, qui implémente les algorithmes les plus récents de la méthode des multipôles rapide (FMM). Nous avons finement adapté ces derniers pour prendre en compte les caractéristiques les plus avancées fournies par \omp4. Nous montrons qu'\omp4 donne de meilleures performances par rapport aux versions précédentes d'\omp pour les processeurs multi-coeurs récents. De plus, nous proposons des extensions au standard d’\omp4 et nous montrons comment elles peuvent améliorer la performance de la FMM. Pour évaluer notre propos, nous avons mis en oeuvre ces extensions dans le compilateur source-à -source \klanglong qui traduit les directives \omp en des appels au moteur d'exécution à base de tâches \starpu. Cette étude montre que nous pouvons tirer profit des capacités avancées du moteur d'exécution sans devoir recourir à un portage sur l'API spécifique de celui-ci.%d'un moteur d'exécution. %prouve que nous pouvons tirer profit des capacités avancées du moteur d'exécution sans recourir à un portage spécifique dans le moteur d’exécution. Par conséquent, on comble le fossé entre le standard \omp et l’approche très performante par moteur d’exécution qui est de loin réservée au seul expert son API
TBFMM: A C++ generic and parallel fast multipole method library
International audienceTBFMM, for task-based FMM, is a high-performance package that implements the parallel fast multipole method (FMM) in modern C++17. It implements parallel strategies for multicore architectures, i.e. to run on a single computing node. TBFMM was designed to be easily customized thanks to C++ templates and fine control of the C++ classes inter-dependencies. Users can implement new FMM kernels, new types of interacting elements or even new parallelization strategies. As such, it can effectively be used as a simulation toolbox for scientists in physics or applied mathematics. It enables users to perform simulations while delegating the data structure, the algorithm and the parallelization to the library. Besides, TBFMM can also provide an interesting use case for the HPC research community regarding parallelization, optimization and scheduling of applications handling irregular data structures
Design and Analysis of a Task-based Parallelization over a Runtime System of an Explicit Finite-Volume CFD Code with Adaptive Time Stepping
FLUSEPA (Registered trademark in France No. 134009261) is an advanced
simulation tool which performs a large panel of aerodynamic studies. It is the
unstructured finite-volume solver developed by Airbus Safran Launchers company
to calculate compressible, multidimensional, unsteady, viscous and reactive
flows around bodies in relative motion. The time integration in FLUSEPA is done
using an explicit temporal adaptive method. The current production version of
the code is based on MPI and OpenMP. This implementation leads to important
synchronizations that must be reduced. To tackle this problem, we present the
study of a task-based parallelization of the aerodynamic solver of FLUSEPA
using the runtime system StarPU and combining up to three levels of
parallelism. We validate our solution by the simulation (using a finite-volume
mesh with 80 million cells) of a take-off blast wave propagation for Ariane 5
launcher.Comment: Accepted manuscript of a paper in Journal of Computational Scienc
Méthode des multipôles rapide à base de tâches pour des clusters de processeurs multicoeurs
Most high-performance, scientific libraries have adopted hybrid parallelization schemes - such as the popular MPI+OpenMP hybridization - to benefit from the capacities of modern distributed-memory machines. While these approaches have shown to achieve high performance, they require a lot of effort to design and maintain sophisticated synchronization/communication strategies. On the other hand, task-based programming paradigms aim at delegating this burden to a runtime system for maximizing productivity. In this article, we assess the potential of task-based fast multipole methods (FMM) on clusters of multicore processors. We propose both a hybrid MPI+task FMM parallelization and a pure task-based parallelization where the MPI communications are implicitly handled by the runtime system. The latter approach yields a very compact code following a sequential task-based programming model. We show that task-based approaches can compete with a hybrid MPI+OpenMP highly optimized code and that furthermore the compact task-based scheme fully matches the performance of the sophisticated, hybrid MPI+task version, ensuring performance while maximizing productivity. We illustrate our discussion with the ScalFMM FMM library and the StarPU runtime system.La plupart des bibliothèques scientifiques très performantes ont adopté des parallélisations hybrides - comme l’approche MPI+OpenMP - pour profiter des capacités des machines modernes à mémoire distribuée. Ces approches permettent d’obtenir de très hautes performances, mais elles nécessitent beaucoup d’efforts pour concevoir et pour maintenir des stratégies de synchronisation/communication sophistiquées. D’un autre côté, les paradigmes de programmation à base de tâches visent à déléguer ce fardeau à un moteur d'exécution pour maximiser la productivité. Dans cet article, nous évaluons le potentiel de la méthode des multipôles rapide (FMM) à base de tâches sur les clusters de processeurs multic\oe{}urs. Nous proposons deux types de parallélisation, une première approche hybride (MPI+Tâche) à base de tâches et d’appels à MPI pour gérer explicitement les communications et la deuxième uniquement à base de tâches où les communications MPI sont implicitement postées par le moteur d'exécution. Cette dernière approche conduit à un code très compact qui suit le modèle de programmation séquentiel à base de tâches. Nous montrons que cette approche rivalise avec le code hybride MPI+OpenMP fortement optimisé et qu'en outre le code compact atteint les performances de la version hybride MPI+Tâche, assurant une très haute performance tout en maximisant la productivité. Nous illustrons notre propos avec la bibliothèque FMM ScalFMM et le moteur d'exécution StarPU
Improving the scalability of parallel N-body applications with an event driven constraint based execution model
The scalability and efficiency of graph applications are significantly
constrained by conventional systems and their supporting programming models.
Technology trends like multicore, manycore, and heterogeneous system
architectures are introducing further challenges and possibilities for emerging
application domains such as graph applications. This paper explores the space
of effective parallel execution of ephemeral graphs that are dynamically
generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The
workloads are expressed using the semantics of an Exascale computing execution
model called ParalleX. For comparison, results using conventional execution
model semantics are also presented. We find improved load balancing during
runtime and automatic parallelism discovery improving efficiency using the
advanced semantics for Exascale computing.Comment: 11 figure