167 research outputs found

    Discovery of Potential Parallelism in Sequential Programs

    Get PDF
    In the era of multicore processors, the responsibility for performance gains has been shifted onto software developers. Once improvements of the sequential algorithm have been exhausted, software-managed parallelism is the only option left. However, writing parallel code is still difficult, especially when parallelizing sequential code written by someone else. A key task in this process is the identification of suitable parallelization targets in the source code. Parallelism discovery tools help developers to find such targets automatically. Unfortunately, tools that identify parallelism during compilation are usually conservative due to the lack of runtime information, and tools relying on runtime information primarily suffer from high overhead in terms of both time and memory. This dissertation presents a generic framework for parallelism discovery based on dynamic program analysis, supporting various types of parallelism while incurring practically affordable overhead. The framework contains two main components: an efficient data-dependence profiler and a set of parallelism discovery algorithms based on a language-independent concept called Computational Unit. The data-dependence profiler serves as the foundation of the parallelism discovery framework. Traditional dependence profiling approaches introduce a tremendous amount of time and memory overhead. To lower the overhead, current methods limit their scope to the subset of the dependence information needed for the analysis they have been created for, sacrificing generality and discouraging reuse. In contrast, the profiler shown in this thesis addresses the problem via signature-based memory management and a lock-free parallel design. It produces detailed dependences not only for sequential but also for multi-threaded code without causing prohibitive overhead, allowing it to serve as a generic base for various program analysis techniques. Computational Units (CUs) provide a language-independent foundation for parallelism discovery. CUs are computations that follow the read-compute-write pattern. Unlike other concepts, they are not restricted to predefined language constructs. A program is represented as a CU graph, in which vertexes are CUs and edges are data dependences. This allows parallelism to be detected that spreads across multiple language constructs, taking code refactoring into consideration. The parallelism discovery algorithms cover both loop and task parallelism. Results of our experiments show that 1) the efficient data-dependence profiler has a very competitive average slowdown of around 80Ă— with accuracy higher than 99.6%; 2) the framework discovers parallelism with high accuracy, identifying 92.5% of the parallel loops in NAS benchmarks; 3) when parallelizing well-known open-source software following the outputs of the framework, reasonable speedups are obtained. Finally, use cases beyond parallelism discovery are briefly demonstrated to show the generality of the framework

    Contributions à l'optimisation de programmes et à la synthèse de circuits haut-niveau

    Get PDF
    Since the end of Dennard scaling, power efficiency is the limiting factor for large-scale computing. Hardware accelerators such as reconfigurable circuits (FPGA, CGRA) or Graphics Processing Units (GPUs) were introduced to improve the performance under a limited energy budget, resulting into complex heterogeneous platforms. This document presents a synthetic description of my research activities over the last decade on compilers for high-performance computing and high-level synthesis of circuits (HLS) for FPGA accelerators. Specifically, my contributions covers both theoretical and practical aspects of automatic parallelization and HLS in a general theoretical framework called the polyhedral model.A first chapter describes our contributions to loop tiling, a key program transformation for automatic parallelization which splits the computation atomic blocks called tiles.We rephrase loop tiling in the polyhedral model to enable any polyhedral tile shape whose size depends on a single parameter (monoparametric tiling), and we present a tiling transformation for programs with reductions – accumulations w.r.t. an associative/commutative operator. Our results open the way for semantic program transformations ; program transformations which does not preserve the computation but still lead to an equivalent program.A second chapter describes our contributions to algorithm recognition. A compiler optimization will never replace a good algorithm, hence the idea to recognize algorithm instances in a program and to substitute them by a call to a performance library. In our PhD thesis, we have addressed the recognition of templates – functionswith first-order variables – into programs and its application to program optimization. We propose a complementary algorithm recognition framework which leverages our monoparametric tiling and our reduction tiling transformations. This automates semantic tiling, a new semantic program transformation which increases the grain of operators (scalar → matrix).A third chapter presents our contributions to the synthesis of communications with an off-chip memory in the context of high-level circuit synthesis (HLS). We propose an execution model based on loop tiling, a pipelined architecture and a source-level compilation algorithm which, connected to the C2H HLS tool from Altera, ends up to a FPGA configuration with minimized data transfers. Our compilation algorithm is optimal – the data are loaded as late as possible and stored as soon as possible with a maximal reuse.A fourth chapter presents our contributions to design a unified polyhedral compilation model for high-level circuit synthesis.We present the Data-aware Process Networks (DPN), a dataflow intermediate representation which leverages the ideas developed in chapter 3 to explicit the data transfers with an off-chip memory. We propose an algorithm to compile a DPN from a sequential program, and we present our contribution to the synthesis of DPN to a circuit. In particular, we present our algorithms to compile the control, the channels and the synchronizations of a DPN. These results are used in the production compiler of the Xtremlogic start-up.Depuis la fin du Dennard scaling, l’efficacité énergétique est le facteur limitant pour le calcul haute performance. Les accélérateurs matériels comme les circuits reconfigurables (FPGA, CGRA) ou les accélérateurs graphiques (GPUs) ont été introduits pour améliorer les performances sous un budget énergétique limité, menant à des plateformes hétérogènes complexes.Mes travaux de recherche portent sur les compilateurs et la synthèse de circuits haut-niveau (High-Level Synthesis, HLS) pour le calcul haute-performance. Specifiquement, mes contributions couvrent les aspects théoriques etpratiques de la parallélisation automatique et la HLS dans le cadre général du modèle polyédrique.Un premier chapitre décrit mes contributions au tuilage de boucles, une transformation fondamentale pour la parallélisation automatique, qui découpe le calcul en sous-calculs atomiques appelés tuiles. Nous reformulons le tuilage de boucles dans le modèle polyédrique pour permettre n’importe tuile polytopique dont la taille dépend d’un facteur homothétique (tuilage monoparamétrique), et nous décrivons une transformation de tuilage pour des programmes avec des réductions – une accumulation selon un opérateur associative et commutatif. Nos résultats ouvrent la voie à des transformations de programme sémantiques ; qui ne préservent pas le calcul, mais produisent un programme équivalent.Un second chapitre décrit mes contributions à la reconnaissance d’algorithmes. Une optimisation de compilateur ne remplacera jamais un bon algorithme, d’où l’idée de reconnaître les instances d’un algorithme dans un programme et de les substituer par un appel vers une bibliothèque hauteperformance, chaque fois que c’est possible et utile.Dans notre thèse, nous avons traité la reconnaissance de templates – des fonctions avec des variables d’ordre 1 – dans un programme et son application à l’optimisation de programes. Nous proposons une approche complémentaire qui s’appuie sur notre tuilage monoparamétrique complété par une transformation pour tuiler les réductions. Ceci automatise le tuilage sémantique, une nouvelle transformation sémantique qui augmente le grain des opérateurs (scalaire → matrice).Un troisième chapitre présente mes contributions à la synthèse des communications avec une mémoire off-chip dans le contexte de la synthèse de circuits haut-niveau. Nous proposons un modèle d’exécution basé sur le tuilage de boucles, une architecture pipelinée et un algorithme de compilation source-à-source qui, connecté à l’outil de HLS C2H d’Altera, produit une configuration de circuit FPGA qui réalise un volume minimal de transferts de données. Notre algorithme est optimal – les données sont chargées le plus tard possible et stockées le plus tôt possible, avec une réutilisation maximale et sans redondances.Enfin, un quatrième chapitre présente mes contributions pour construire un modèle de compilation polyédrique unifié pour la synthèse de circuits haut-niveau.Nous présentons les réseaux de processus DPN (Data-aware Process Networks), une représentation intermédiaire dataflow qui s’appuie sur les idées développées au chapitre 3 pour expliciter les transferts de données entre le circuit et la mémoire off-chip. Nous proposons une suite d’algorithmes pour compiler un DPN à partir d’un programme séquentiel et nous présentons nos contributions à la synthèse d’un DPN en circuit. En particulier, nous présentons nos algorithmes pour compiler le contrôle, les canaux et les synchronisations d’un DPN. Ces résultats sont utilisés dans le compilateur de production de la start-up XtremLogic

    Proceedings of the 3rd International Workshop on Polyhedral Compilation Techniques

    Get PDF
    IMPACT 2013 in Berlin, Germany (in conjuction with HiPEAC 2013) is the third workshop in a series of international workshops on polyhedral compilation techniques. The previous workshops were held in Chamonix, France (2011) in conjuction with CGO 2011 and Paris, France (2012) in conjuction with HiPEAC 2012

    Optimizing work stealing algorithms with scheduling constraints

    Get PDF
    The fork-join paradigm of concurrent expression has gained popularity in conjunction with work-stealing schedulers. Random work-stealing schedulers have been shown to effectively perform dynamic load balancing, yielding provably-efficient schedules and space bounds on shared-memory architectures with uniform memory models. However, the advent of hierarchical, non-uniform multicore systems and large-scale distributed-memory architectures has reduced the efficacy of these scheduling policies. Furthermore, random work stealing schedulers do not exploit persistence within iterative, scientific applications. In this thesis, we prove several properties of work-stealing schedulers that enable online tracing of the tasks with very low overhead. We then describe new scheduling policies that use online schedule introspection to understand scheduler placement and thus improve the performance on NUMA and distributed-memory architectures. Finally, by incorporating an inclusive data effect system into fork--join programs with schedule placement knowledge, we show how we can transform a fork-join program to significantly improve locality

    Estimation and Optimization of the Performance of Polyhedral Process Networks

    Get PDF
    A system-level design methodology such as Daedalus provides designers with a forward synthesis flow for automated design, programming, and implementation of multiprocessor systems-on-chip. Daedalus employs the polyhedral process network model of computation to represent applications. These networks are automatically derived from sequential C code. A forward synthesis flow greatly increases designer productivity. Still, the designer needs to perform a time-consuming forward synthesis step to learn if a network satisfies his performance constraints. Furthermore, it is not trivial to select a set of transformations and transformation parameters for a network such that performance requirements are met. A forward synthesis flow thus solves only part of a design problem, as it does not provide fast feedback on the transformations a designer should apply to meet his performance constraints. This dissertation intro duces different performance estimation techniques for polyhedral process networks. The most promising technique is the profiling-based cprof technique that works directly on the sequential application code. This makes cprof an easy-to-use, robust, and fast technique, without the need to derive a polyhedral process network. This dissertation then discusses four transformations and analyzes factors that affect the efficacy of each transformation.Computer Systems, Imagery and Medi

    Heterogeneous parallel virtual machine: A portable program representation and compiler for performance and energy optimizations on heterogeneous parallel systems

    Get PDF
    Programming heterogeneous parallel systems, such as the SoCs (System-on-Chip) on mobile and edge devices is extremely difficult; the diverse parallel hardware they contain exposes vastly different hardware instruction sets, parallelism models and memory systems. Moreover, a wide range of diverse hardware and software approximation techniques are available for applications targeting heterogeneous SoCs, further exacerbating the programmability challenges. In this thesis, we alleviate the programmability challenges of such systems using flexible compiler intermediate representation solutions, in order to benefit from the performance and superior energy efficiency of heterogeneous systems. First, we develop Heterogeneous Parallel Virtual Machine (HPVM), a parallel program representation for heterogeneous systems, designed to enable functional and performance portability across popular parallel hardware. HPVM is based on a hierarchical dataflow graph with side effects. HPVM successfully supports three important capabilities for programming heterogeneous systems: a compiler intermediate representation (IR), a virtual instruction set (ISA), and a basis for runtime scheduling. We use the HPVM representation to implement an HPVM prototype, defining the HPVM IR as an extension of the Low Level Virtual Machine (LLVM) IR. Our results show comparable performance with optimized OpenCL kernels for the target hardware from a single HPVM representation using translators from HPVM virtual ISA to native code, IR optimizations operating directly on the HPVM representation, and the capability for supporting flexible runtime scheduling schemes from a single HPVM representation. We extend HPVM to ApproxHPVM, introducing hardware-independent approximation metrics in the IR to enable maintaining accuracy information at the IR level and mapping of application-level end-to-end quality metrics to system level "knobs". The approximation metrics quantify the acceptable accuracy loss for individual computations. Application programmers only need to specify high-level, and end-to-end, quality metrics, instead of detailed parameters for individual approximation methods. The ApproxHPVM system then automatically tunes the accuracy requirements of individual computations and maps them to approximate hardware when possible. ApproxHPVM results show significant performance and energy improvements for popular deep learning benchmarks. Finally, we extend to ApproxHPVM to ApproxTuner, a compiler and runtime system for approximation. ApproxTuner extends ApproxHPVM with a wide range of hardware and software approximation techniques. It uses a three step approximation tuning strategy, a combination of development-time, install-time, and dynamic tuning. Our strategy ensures software portability, even though approximations have highly hardware-dependent performance, and enables efficient dynamic approximation tuning despite the expensive offline steps. ApproxTuner results show significant performance and energy improvements across 7 Deep Neural Networks and 3 image processing benchmarks, and ensures that high-level end-to-end quality specifications are satisfied during adaptive approximation tuning

    Techniques d'exploration architecturale de design à usage spécifique pour l'accélération de boucles

    Get PDF
    RÉSUMÉ De nos jours, les industriels privilégient les architectures flexibles afin de réduire le temps et les coûts de conception d’un système. Les processeurs à usage spécifique (ASIP) fournissent beaucoup de flexibilité, tout en atteignant des performances élevées. Une tendance qui a de plus en plus de succès dans le processus de conception d’un système sur puce consiste à spécifier le comportement du système en langage évolué tel que le C, SystemC, etc. La spécification est ensuite utilisée durant le partitionement pour déterminer les composantes logicielles et matérielles du système. Avec la maturité des générateurs automatiques de ASIP, les concepteurs peuvent rajouter dans leurs boîtes à outils un nouveau type d’architecture, à savoir les ASIP, en sachant que ces derniers sont conçus à partir d’une spécification décrite en langage évolué. D’un autre côté, dans le monde matériel, et cela depuis très longtemps, les chercheurs ont vu l’avantage de baser le processus de conception sur un langage évolué. Cette recherche a abouti à l’avénement de générateurs automatiques de matériel sur le marché qui sont des outils d’aide à la conception comme CapatultC, Forte’s Cynthetizer, etc. Ainsi, avec tous ces outils basés sur le langage C, les concepteurs ont un choix de types de design élargi mais, d’un autre côté, les options de designs possibles explosent, ce qui peut allonger au lieu de réduire le temps de conception. C’est dans ce cadre que notre thèse doctorale s’inscrit, puisqu’elle présente des méthodologies d’exploration architecturale de design à usage spécifique pour l’accélération de boucles afin de réduire le temps de conception, entre autres. Cette thèse a débuté par l’exploration de designs de ASIP. Les boucles de traitement sont de bonnes candidates à l’accélération, si elles comportent de bonnes possibilités de parallélisme et si ces dernières sont bien exploitées. Le matériel est très efficace à profiter des possibilités de parallélisme au niveau instruction, donc, une méthode de conception a été proposée. Cette dernière extrait le parallélisme d’une boucle afin d’exécuter plus d’opérations concurrentes dans des instructions spécialisées. Notre méthode se base aussi sur l’optimisation des données dans l’architecture du processeur.---------- ABSTRACT Time to market is a very important concern in industry. That is why the industry always looks for new CAD tools that contribute to reducing design time. Application-specific instruction-set processors (ASIPs) provide flexibility and they allow reaching good performance if they are well designed. One trend that gains more and more success is C-based design that uses a high level language such as C, SystemC, etc. The C-based specification is used during the partitionning phase to determine the software and hardware components of the system. Since automatic processor generators are mature now, designers have a new type of tool they can rely on during architecture design. In the hardware world, high level synthesis was and is still a hot research topic. The advances in ESL lead to commercial high-level synthesis tools such as CapatultC, Forte’s Cynthetizer, etc. The designers have more tools in their box but they have more solutions to explore, thus their use can have a reverse effect since the design time can increase instead of being reduced. Our doctoral research tackles this issue by proposing new methodologies for design space exploration of application specific architecture for loop acceleration in order to reduce the design time while reaching some targeted performances. Our thesis starts with the exploration of ASIP design. We propose a method that targets loop acceleration with highly coupled specialized-instructions executing loop operations. Loops are good candidates for acceleration when the parallelism they offer is well exploited (if they have any parallelization opportunities). Hardware components such as specialized-instructions can leverage parallelization opportunities at low level. Thus, we propose to extract loop parallelization opportunities and to execute more concurrent operations in specialized-instructions. The main contribution of this method is a new approach to specialized-instruction (SI) design based on loop acceleration where loop optimization and transformation are done in SIs directly, instead of optimizing the software code. Another contribution is the design of tightly-coupled specialized-instructions associated with loops based on a 5-pattern representation

    Feedback Driven Annotation and Refactoring of Parallel Programs

    Get PDF
    • …
    corecore