37 research outputs found

    Beyond shared memory loop parallelism in the polyhedral model

    Get PDF
    2013 Spring.Includes bibliographical references.With the introduction of multi-core processors, motivated by power and energy concerns, parallel processing has become main-stream. Parallel programming is much more difficult due to its non-deterministic nature, and because of parallel programming bugs that arise from non-determinacy. One solution is automatic parallelization, where it is entirely up to the compiler to efficiently parallelize sequential programs. However, automatic parallelization is very difficult, and only a handful of successful techniques are available, even after decades of research. Automatic parallelization for distributed memory architectures is even more problematic in that it requires explicit handling of data partitioning and communication. Since data must be partitioned among multiple nodes that do not share memory, the original memory allocation of sequential programs cannot be directly used. One of the main contributions of this dissertation is the development of techniques for generating distributed memory parallel code with parametric tiling. Our approach builds on important contributions to the polyhedral model, a mathematical framework for reasoning about program transformations. We show that many affine control programs can be uniformized only with simple techniques. Being able to assume uniform dependences significantly simplifies distributed memory code generation, and also enables parametric tiling. Our approach implemented in the AlphaZ system, a system for prototyping analyses, transformations, and code generators in the polyhedral model. The key features of AlphaZ are memory re-allocation, and explicit representation of reductions. We evaluate our approach on a collection of polyhedral kernels from the PolyBench suite, and show that our approach scales as well as PLuTo, a state-of-the-art shared memory automatic parallelizer using the polyhedral model. Automatic parallelization is only one approach to dealing with the non-deterministic nature of parallel programming that leaves the difficulty entirely to the compiler. Another approach is to develop novel parallel programming languages. These languages, such as X10, aim to provide highly productive parallel programming environment by including parallelism into the language design. However, even in these languages, parallel bugs remain to be an important issue that hinders programmer productivity. Another contribution of this dissertation is to extend the array dataflow analysis to handle a subset of X10 programs. We apply the result of dataflow analysis to statically guarantee determinism. Providing static guarantees can significantly increase programmer productivity by catching questionable implementations at compile-time, or even while programming

    Constructive Synthesis of Memory-Intensive Accelerators for FPGA From Nested Loop Kernels

    Get PDF

    Automatic translation of non-repetitive OpenMP to MPI

    Get PDF
    Cluster platforms with distributed-memory architectures are becoming increasingly available low-cost solutions for high performance computing. Delivering a productive programming environment that hides the complexity of clusters and allows writing efficient programs is urgently needed. Despite multiple efforts to provide shared memory abstraction, message-passing (MPI) is still the state-of-the-art programming model for distributed-memory architectures. ^ Writing efficient MPI programs is challenging. In contrast, OpenMP is a shared-memory programming model that is known for its programming productivity. Researchers introduced automatic source-to-source translation schemes from OpenMP to MPI so that programmers can use OpenMP while targeting clusters. Those schemes limited their focus on OpenMP programs with repetitive communication patterns (where the analysis of communication can be simplified). This dissertation reduces this limitation and presents a novel OpenMP-to-MPI translation scheme that covers OpenMP programs with both repetitive and non-repetitive communication patterns. We target laboratory-size clusters of ten to hundred nodes (commonly found in research laboratories and small enterprises). ^ With our translation scheme, six non-repetitive and four repetitive OpenMP benchmarks have been efficiently scaled to a cluster of 64 cores. By contrast, the state-of-the-art translator scaled only the four repetitive benchmarks. In addition, our translation scheme was shown to outperform or perform as well as the state-of-the-art translator. We also compare the translation scheme with available hand-coded MPI and Unified Parallel C (UPC) programs

    LLOV: A Fast Static Data-Race Checker for OpenMP Programs

    Full text link
    In the era of Exascale computing, writing efficient parallel programs is indispensable and at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in these programs are an important source of bugs. In this paper, we propose LLOV, a fast, lightweight, language agnostic, and static data race checker for OpenMP programs based on the LLVM compiler framework. We compare LLOV with other state-of-the-art data race checkers on a variety of well-established benchmarks. We show that the precision, accuracy, and the F1 score of LLOV is comparable to other checkers while being orders of magnitude faster. To the best of our knowledge, LLOV is the only tool among the state-of-the-art data race checkers that can verify a C/C++ or FORTRAN program to be data race free.Comment: Accepted in ACM TACO, August 202

    Liveness Analysis in Explicitly-Parallel Programs

    Get PDF
    International audienceIn this paper, we revisit scalar and array element-wise liveness analysis for programs with parallel specifications. In earlier work on memory allocation/contraction (register allocation or intra- and inter-array reuse in the polyhedral model), a notion of ``time'' or a total order among the iteration points was used to compute the liveness of values. In general, the execution of parallel programs is not a total order, and hence the notion of time is not applicable. We first revise how conflicts are computed by using ideas from liveness analysis for register allocation, studying the structure of the corresponding conflict/interference graphs. Instead of considering the conflict between two live ranges, we only consider the conflict between a live range and a write. This simplifies the formulation from having four instances involved in the test down to three, and also improves the precision of the analysis in the general case. Then we extend the liveness analysis to work with partial orders so that it can be applied to many different parallel languages/specifications with different forms of parallelism. An important result is that the complement of the conflict graph with partial orders is directly connected to memory reuse, even in presence of races. However, programs with conditionals do not always define a partial order, and our next step will be to handle such cases with more accuracy

    On polynomial Code Generation

    Get PDF
    International audienceIn static analysis, one often has to deal with polynomials in the program control variables, either native to the source code or created by enabling analyses. We have explained elsewhere how to compute dependences in such situations and use them for building polynomial schedules. It remains to explain how to generate polynomial code. The present proposal is to target new parallel programming languages of the async/finish family, like X10 or Habanero, which are "polynomial friendly" and for which efficient compilers exists. Both these languages have barrier-like constructs-clocks for X10 and phasers for Habanero-which may be used to synchronize activities. To understand the behaviour of a clocked program, one has to count the number of clock advance operations since the creation of each activity. Advances with equal counts are synchronized , and these counts may be polynomials. The trick is therefore to insure that before executing an operation, its activity has executed as many advances as the current value of its schedule. This can be obtained by inserting auxilliary loops for executing the necessary advances. This scheme fails if the schedule is not monotone increasing with respect to the execution order in each activity. This problem may be solved by reordering the activities-which is possible since the real execution order is given by the schedule-or in extreme cases by index set splitting

    Contributions à l'optimisation de programmes et à la synthèse de circuits haut-niveau

    Get PDF
    Since the end of Dennard scaling, power efficiency is the limiting factor for large-scale computing. Hardware accelerators such as reconfigurable circuits (FPGA, CGRA) or Graphics Processing Units (GPUs) were introduced to improve the performance under a limited energy budget, resulting into complex heterogeneous platforms. This document presents a synthetic description of my research activities over the last decade on compilers for high-performance computing and high-level synthesis of circuits (HLS) for FPGA accelerators. Specifically, my contributions covers both theoretical and practical aspects of automatic parallelization and HLS in a general theoretical framework called the polyhedral model.A first chapter describes our contributions to loop tiling, a key program transformation for automatic parallelization which splits the computation atomic blocks called tiles.We rephrase loop tiling in the polyhedral model to enable any polyhedral tile shape whose size depends on a single parameter (monoparametric tiling), and we present a tiling transformation for programs with reductions – accumulations w.r.t. an associative/commutative operator. Our results open the way for semantic program transformations ; program transformations which does not preserve the computation but still lead to an equivalent program.A second chapter describes our contributions to algorithm recognition. A compiler optimization will never replace a good algorithm, hence the idea to recognize algorithm instances in a program and to substitute them by a call to a performance library. In our PhD thesis, we have addressed the recognition of templates – functionswith first-order variables – into programs and its application to program optimization. We propose a complementary algorithm recognition framework which leverages our monoparametric tiling and our reduction tiling transformations. This automates semantic tiling, a new semantic program transformation which increases the grain of operators (scalar → matrix).A third chapter presents our contributions to the synthesis of communications with an off-chip memory in the context of high-level circuit synthesis (HLS). We propose an execution model based on loop tiling, a pipelined architecture and a source-level compilation algorithm which, connected to the C2H HLS tool from Altera, ends up to a FPGA configuration with minimized data transfers. Our compilation algorithm is optimal – the data are loaded as late as possible and stored as soon as possible with a maximal reuse.A fourth chapter presents our contributions to design a unified polyhedral compilation model for high-level circuit synthesis.We present the Data-aware Process Networks (DPN), a dataflow intermediate representation which leverages the ideas developed in chapter 3 to explicit the data transfers with an off-chip memory. We propose an algorithm to compile a DPN from a sequential program, and we present our contribution to the synthesis of DPN to a circuit. In particular, we present our algorithms to compile the control, the channels and the synchronizations of a DPN. These results are used in the production compiler of the Xtremlogic start-up.Depuis la fin du Dennard scaling, l’efficacité énergétique est le facteur limitant pour le calcul haute performance. Les accélérateurs matériels comme les circuits reconfigurables (FPGA, CGRA) ou les accélérateurs graphiques (GPUs) ont été introduits pour améliorer les performances sous un budget énergétique limité, menant à des plateformes hétérogènes complexes.Mes travaux de recherche portent sur les compilateurs et la synthèse de circuits haut-niveau (High-Level Synthesis, HLS) pour le calcul haute-performance. Specifiquement, mes contributions couvrent les aspects théoriques etpratiques de la parallélisation automatique et la HLS dans le cadre général du modèle polyédrique.Un premier chapitre décrit mes contributions au tuilage de boucles, une transformation fondamentale pour la parallélisation automatique, qui découpe le calcul en sous-calculs atomiques appelés tuiles. Nous reformulons le tuilage de boucles dans le modèle polyédrique pour permettre n’importe tuile polytopique dont la taille dépend d’un facteur homothétique (tuilage monoparamétrique), et nous décrivons une transformation de tuilage pour des programmes avec des réductions – une accumulation selon un opérateur associative et commutatif. Nos résultats ouvrent la voie à des transformations de programme sémantiques ; qui ne préservent pas le calcul, mais produisent un programme équivalent.Un second chapitre décrit mes contributions à la reconnaissance d’algorithmes. Une optimisation de compilateur ne remplacera jamais un bon algorithme, d’où l’idée de reconnaître les instances d’un algorithme dans un programme et de les substituer par un appel vers une bibliothèque hauteperformance, chaque fois que c’est possible et utile.Dans notre thèse, nous avons traité la reconnaissance de templates – des fonctions avec des variables d’ordre 1 – dans un programme et son application à l’optimisation de programes. Nous proposons une approche complémentaire qui s’appuie sur notre tuilage monoparamétrique complété par une transformation pour tuiler les réductions. Ceci automatise le tuilage sémantique, une nouvelle transformation sémantique qui augmente le grain des opérateurs (scalaire → matrice).Un troisième chapitre présente mes contributions à la synthèse des communications avec une mémoire off-chip dans le contexte de la synthèse de circuits haut-niveau. Nous proposons un modèle d’exécution basé sur le tuilage de boucles, une architecture pipelinée et un algorithme de compilation source-à-source qui, connecté à l’outil de HLS C2H d’Altera, produit une configuration de circuit FPGA qui réalise un volume minimal de transferts de données. Notre algorithme est optimal – les données sont chargées le plus tard possible et stockées le plus tôt possible, avec une réutilisation maximale et sans redondances.Enfin, un quatrième chapitre présente mes contributions pour construire un modèle de compilation polyédrique unifié pour la synthèse de circuits haut-niveau.Nous présentons les réseaux de processus DPN (Data-aware Process Networks), une représentation intermédiaire dataflow qui s’appuie sur les idées développées au chapitre 3 pour expliciter les transferts de données entre le circuit et la mémoire off-chip. Nous proposons une suite d’algorithmes pour compiler un DPN à partir d’un programme séquentiel et nous présentons nos contributions à la synthèse d’un DPN en circuit. En particulier, nous présentons nos algorithmes pour compiler le contrôle, les canaux et les synchronisations d’un DPN. Ces résultats sont utilisés dans le compilateur de production de la start-up XtremLogic
    corecore