140 research outputs found

    A ROSE-Based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries

    Full text link
    Abstract. OpenMP is a popular and evolving programming model for shared-memory platforms. It relies on compilers to target modern hard-ware architectures for optimal performance. A variety of extensible and robust research compilers are key to OpenMP’s sustainable success in the future. In this paper, we present our efforts to build an OpenMP 3.0 research compiler for C, C++, and Fortran using the ROSE source-to-source compiler framework. Our goal is to support OpenMP research for ourselves and others. We have extended ROSE’s internal representation to handle all OpenMP 3.0 constructs, thus facilitating experimenting with them. Since OpenMP research is often complicated by the tight coupling of the compiler translation and the runtime system, we present a set of rules to define a common OpenMP runtime library (XOMP) on top of multiple runtime libraries. These rules additionally define how to build a set of translations targeting XOMP. Our work demonstrates how to reuse OpenMP translations across different runtime libraries. This work simplifies OpenMP research by decoupling the problematic depen-dence between the compiler translations and the runtime libraries. We present an evaluation of our work by demonstrating an analysis tool for OpenMP correctness. We also show how XOMP can be defined using both GOMP and Omni. Our comparative performance results against other OpenMP compilers demonstrate that our flexible runtime support does not incur additional overhead.

    A Functional Safety OpenMP∗ for Critical Real-Time Embedded Systems

    Get PDF
    OpenMP* has recently gained attention in the embedded domain by virtue of the augmentations implemented in the last specification. Yet, the language has a minimal impact in the embedded real-time domain mostly due to the lack of reliability and resiliency mechanisms. As a result, functional safety properties cannot be guaranteed. This paper analyses in detail the latest specification to determine whether and how the compliant OpenMP implementations can guarantee functional safety. Given the conclusions drawn from the analysis, the paper describes a set of modifications to the specification, and a set of requirements for compiler and runtime systems to qualify for safety critical environments. Through the proposed solution, OpenMP can be used in critical real-time embedded systems without compromising functional safety.This work was funded by the EU project P-SOCRATES (FP7-ICT-2013- 10) and the Spanish Ministry of Science and Innovation under contract TIN2015- 65316-P.Peer ReviewedPostprint (author's final draft

    Source-to-Source Transformations for Parallel Optimizations in STAPL

    Get PDF
    Programs that use the STAPL C++ parallel programming library express their control and data flow explicitly through the use of skeletons. Skeletons can be simple parallel operations like map and reduce, or the result of composing several skeletons. Composition is implemented by tracking the dependencies among individual data elements in the STAPL runtime system. However, the operations and dependencies within a compose skeleton can be determined at compile time from the C++ abstract syntax tree. This enables the use of source-to-source transformations to fuse the composed skeletons. Transformations can also be used to replace skeletons entirely with equivalent code. Both transformations greatly reduce STAPL runtime overhead, and zip fusion also allows a compiler to optimize the work functions as a single unit. We present a Clang compiler plugin and wrapper that automatically perform these transformations, and demonstrate its ability to improve performance

    Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation

    Get PDF
    Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent a powerful and affordable tool for scientists who look to speed up simulations of complex systems. However, porting code to such devices requires a detailed understanding of heterogeneous programming tools and effective strategies for parallelization. In this paper we present a source to source compilation approach with whole-program analysis to automatically transform single-threaded FORTRAN 77 legacy code into OpenCL-accelerated programs with parallelized kernels. The main contributions of our work are: (1) whole-source refactoring to allow any subroutine in the code to be offloaded to an accelerator. (2) Minimization of the data transfer between the host and the accelerator by eliminating redundant transfers. (3) Pragmatic auto-parallelization of the code to be offloaded to the accelerator by identification of parallelizable maps and reductions. We have validated the code transformation performance of the compiler on the NIST FORTRAN 78 test suite and several real-world codes: the Large Eddy Simulator for Urban Flows, a high-resolution turbulent flow model; the shallow water component of the ocean model Gmodel; the Linear Baroclinic Model, an atmospheric climate model and Flexpart-WRF, a particle dispersion simulator. The automatic parallelization component has been tested on as 2-D Shallow Water model (2DSW) and on the Large Eddy Simulator for Urban Flows (UFLES) and produces a complete OpenCL-enabled code base. The fully OpenCL-accelerated versions of the 2DSW and the UFLES are resp. 9x and 20x faster on GPU than the original code on CPU, in both cases this is the same performance as manually ported code.Comment: 12 pages, 5 figures, submitted to "Computers and Fluids" as full paper from ParCFD conference entr

    Bridging the gap between OpenMP and task-based runtime systems for the fast multipole method

    Get PDF
    International audienceWith the advent of complex modern architectures, the low-level paradigms long considered sufficient to build High Performance Computing (HPC) numerical codes have met their limits. Achieving efficiency, ensuring portability, while preserving programming tractability on such hardware prompted the HPC community to design new, higher level paradigms while relying on runtime systems to maintain performance. However, the common weakness of these projects is to deeply tie applications to specific expert-only runtime system APIs. The OpenMP specification, which aims at providing common parallel programming means for shared-memory platforms, appears as a good candidate to address this issue thanks to the latest task-based constructs introduced in its revision 4.0. The goal of this paper is to assess the effectiveness and limits of this support for designing a high-performance numerical library, ScalFMM, implementing the fast multipole method (FMM) that we have deeply redesigned with respect to the most advanced features provided by OpenMP 4. We show that OpenMP 4 allows for significant performance improvements over previous OpenMP revisions on recent multicore processors and that extensions to the 4.0 standard allow for strongly improving the performance, bridging the gap with the very high performance that was so far reserved to expert-only runtime system APIs

    Combler l'écart de performance entre OpenMP 4.0 et les moteurs d'exécution pour la méthode des multipoles rapide

    Get PDF
    With the advent of complex modern architectures, the low-levelparadigms long considered sufficient to build High Performance Computing (HPC)numerical codes have met their limits. Achieving efficiency, ensuringportability, while preserving programming tractability on such hardwareprompted the HPC community to design new, higher level paradigms.The successful ports of fully-featured numerical libraries on severalrecent runtime system proposals have shown, indeed, the benefit oftask-based parallelism models in terms of performance portability oncomplex platforms. However, the common weakness of these projects is todeeply tie applications to specific expert-only runtime system APIs. The\omp specification, which aims at providing a common parallelprogramming means for shared-memory platforms, appears as a goodcandidate to address this issue thanks to the latest task-basedconstructs introduced as part of its revision 4.0.The goal of this paper is to assess the effectiveness and limits ofthis support for designing a high-performance numerical library. Weillustrate our discussion with the \scalfmm library, which implementsstate-of-the-art fast multipole method (FMM) algorithms, that wehave deeply re-designed with respect to the most advancedfeatures provided by \omp 4. We show that \omp 4 allows forsignificant performance improvements over previous \omp revisions onrecent multicore processors. We furthermore propose extensions to the\omp 4 standard and show how they can enhance FMM performance. Toassess our statement, we have implemented this support within the\klanglong source-to-source compiler that translates \omp directives intocalls to the \starpu task-based runtime system. This study shows thatwe can take advantage of the advanced capabilities of a fully-featuredruntime system without resorting to a specific, native runtime port,hence bridging the gap between the \omp standard and the very highperformance that was so far reserved to expert-only runtime systemAPIs.Avec l'arrivée des architectures modernes complexes, les paradigmes de parallélisation de bas niveau, longtemps considérés comme suffisant pour développer des codes numériques efficaces, ont montré leurs limites. Obtenir de l'efficacité et assurer la portabilité tout en maintenant une bonne flexibilité de programmation sur de telles architectures ont incité la communauté du calcul haute performance (HPC) à concevoir de nouveaux paradigmes de plus haut niveau.Les portages réussis de bibliothèques numériques sur plusieurs moteurs exécution récentos ont montré l'avantage des modèles de parallélisme à base de tâche en ce qui concerne la portabilité et la performance sur ces plateformes complexes. Cependant, la faiblesse de tous ces projets est de fortement coupler les applications aux experts des API des moteurs d'exécution.La spécification d'\omp, qui vise à fournir un modèle de programmation parallèle unique pour les plates-formes à mémoire partagée, semble être un bon candidat pour résoudre ce problème. Notamment, en raison des améliorations apportées à l’expressivité du modèle en tâches présentées dans sa révision 4.0.Le but de ce papier est d'évaluer l'efficacité et les limites de ce modèle pour concevoir une bibliothèque numérique performante. Nous illustrons notre discussion avec la bibliothèque \scalfmm, qui implémente les algorithmes les plus récents de la méthode des multipôles rapide (FMM). Nous avons finement adapté ces derniers pour prendre en compte les caractéristiques les plus avancées fournies par \omp4. Nous montrons qu'\omp4 donne de meilleures performances par rapport aux versions précédentes d'\omp pour les processeurs multi-coeurs récents. De plus, nous proposons des extensions au standard d’\omp4 et nous montrons comment elles peuvent améliorer la performance de la FMM. Pour évaluer notre propos, nous avons mis en oeuvre ces extensions dans le compilateur source-à-source \klanglong qui traduit les directives \omp en des appels au moteur d'exécution à base de tâches \starpu. Cette étude montre que nous pouvons tirer profit des capacités avancées du moteur d'exécution sans devoir recourir à un portage sur l'API spécifique de celui-ci.%d'un moteur d'exécution. %prouve que nous pouvons tirer profit des capacités avancées du moteur d'exécution sans recourir à un portage spécifique dans le moteur d’exécution. Par conséquent, on comble le fossé entre le standard \omp et l’approche très performante par moteur d’exécution qui est de loin réservée au seul expert son API

    Source-to-Source Transformations for Parallel Optimizations in STAPL

    Get PDF
    Programs that use the STAPL C++ parallel programming library express their control and data flow explicitly through the use of skeletons. Skeletons can be simple parallel operations like map and reduce, or the result of composing several skeletons. Composition is implemented by tracking the dependencies among individual data elements in the STAPL runtime system. However, the operations and dependencies within a compose skeleton can be determined at compile time from the C++ abstract syntax tree. This enables the use of source-to-source transformations to fuse the composed skeletons. Transformations can also be used to replace skeletons entirely with equivalent code. Both transformations greatly reduce STAPL runtime overhead, and zip fusion also allows a compiler to optimize the work functions as a single unit. We present a Clang compiler plugin and wrapper that automatically perform these transformations, and demonstrate its ability to improve performance

    X-Kaapi: a Multi Paradigm Runtime for Multicore Architectures

    Get PDF
    International audienceThe paper presents X-Kaapi, a compact runtime for multicore architec- tures that brings multi parallel paradigms (parallel independent loops, fork-join tasks and dataflow tasks) in a unified framework without performance penalty. Comparisons on independent loops with OpenMP and on dense linear algebra with QUARK/PLASMA confirm our design decisions. Applied to EUROPLEXUS, an industrial simulation code for fast transient dynamics, we show that X-Kaapi achieves high speedups on multicore architectures by efficiently parallelizing both independent loops and dataflow tasks.Ce rapport présente X-Kaapi, un support exécutif pour archi- tecture multi-cœur qui permet l'exploitation conjointe de plusieurs paradigmes de programmation parallèle (boucles indépendantes, fork-join, flot de don- nées). Les surcoûts à l'exécution sont faibles et nous présentons des compara- isons pour la programmation de boucles indépendantes avec OpenMP, et sur des problèmes en algèbre linéaire dense nous nous comparons à QUARK/- PLASMA. Enfin nous présentons les résultats obtenus lors de la parallélisa- tion du code EUROPLEXUS de dynamique rapide et qui utilise plusieurs de ces paradigmes
    corecore