61 research outputs found

    Automatic Generation of Efficient Linear Algebra Programs

    Full text link
    The level of abstraction at which application experts reason about linear algebra computations and the level of abstraction used by developers of high-performance numerical linear algebra libraries do not match. The former is conveniently captured by high-level languages and libraries such as Matlab and Eigen, while the latter expresses the kernels included in the BLAS and LAPACK libraries. Unfortunately, the translation from a high-level computation to an efficient sequence of kernels is a task, far from trivial, that requires extensive knowledge of both linear algebra and high-performance computing. Internally, almost all high-level languages and libraries use efficient kernels; however, the translation algorithms are too simplistic and thus lead to a suboptimal use of said kernels, with significant performance losses. In order to both achieve the productivity that comes with high-level languages, and make use of the efficiency of low level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem and produces as output an efficient sequence of calls to high-performance kernels. In 25 application problems, the code generated by Linnea always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10x

    Contributions à l'optimisation de programmes et à la synthèse de circuits haut-niveau

    Get PDF
    Since the end of Dennard scaling, power efficiency is the limiting factor for large-scale computing. Hardware accelerators such as reconfigurable circuits (FPGA, CGRA) or Graphics Processing Units (GPUs) were introduced to improve the performance under a limited energy budget, resulting into complex heterogeneous platforms. This document presents a synthetic description of my research activities over the last decade on compilers for high-performance computing and high-level synthesis of circuits (HLS) for FPGA accelerators. Specifically, my contributions covers both theoretical and practical aspects of automatic parallelization and HLS in a general theoretical framework called the polyhedral model.A first chapter describes our contributions to loop tiling, a key program transformation for automatic parallelization which splits the computation atomic blocks called tiles.We rephrase loop tiling in the polyhedral model to enable any polyhedral tile shape whose size depends on a single parameter (monoparametric tiling), and we present a tiling transformation for programs with reductions – accumulations w.r.t. an associative/commutative operator. Our results open the way for semantic program transformations ; program transformations which does not preserve the computation but still lead to an equivalent program.A second chapter describes our contributions to algorithm recognition. A compiler optimization will never replace a good algorithm, hence the idea to recognize algorithm instances in a program and to substitute them by a call to a performance library. In our PhD thesis, we have addressed the recognition of templates – functionswith first-order variables – into programs and its application to program optimization. We propose a complementary algorithm recognition framework which leverages our monoparametric tiling and our reduction tiling transformations. This automates semantic tiling, a new semantic program transformation which increases the grain of operators (scalar → matrix).A third chapter presents our contributions to the synthesis of communications with an off-chip memory in the context of high-level circuit synthesis (HLS). We propose an execution model based on loop tiling, a pipelined architecture and a source-level compilation algorithm which, connected to the C2H HLS tool from Altera, ends up to a FPGA configuration with minimized data transfers. Our compilation algorithm is optimal – the data are loaded as late as possible and stored as soon as possible with a maximal reuse.A fourth chapter presents our contributions to design a unified polyhedral compilation model for high-level circuit synthesis.We present the Data-aware Process Networks (DPN), a dataflow intermediate representation which leverages the ideas developed in chapter 3 to explicit the data transfers with an off-chip memory. We propose an algorithm to compile a DPN from a sequential program, and we present our contribution to the synthesis of DPN to a circuit. In particular, we present our algorithms to compile the control, the channels and the synchronizations of a DPN. These results are used in the production compiler of the Xtremlogic start-up.Depuis la fin du Dennard scaling, l’efficacité énergétique est le facteur limitant pour le calcul haute performance. Les accélérateurs matériels comme les circuits reconfigurables (FPGA, CGRA) ou les accélérateurs graphiques (GPUs) ont été introduits pour améliorer les performances sous un budget énergétique limité, menant à des plateformes hétérogènes complexes.Mes travaux de recherche portent sur les compilateurs et la synthèse de circuits haut-niveau (High-Level Synthesis, HLS) pour le calcul haute-performance. Specifiquement, mes contributions couvrent les aspects théoriques etpratiques de la parallélisation automatique et la HLS dans le cadre général du modèle polyédrique.Un premier chapitre décrit mes contributions au tuilage de boucles, une transformation fondamentale pour la parallélisation automatique, qui découpe le calcul en sous-calculs atomiques appelés tuiles. Nous reformulons le tuilage de boucles dans le modèle polyédrique pour permettre n’importe tuile polytopique dont la taille dépend d’un facteur homothétique (tuilage monoparamétrique), et nous décrivons une transformation de tuilage pour des programmes avec des réductions – une accumulation selon un opérateur associative et commutatif. Nos résultats ouvrent la voie à des transformations de programme sémantiques ; qui ne préservent pas le calcul, mais produisent un programme équivalent.Un second chapitre décrit mes contributions à la reconnaissance d’algorithmes. Une optimisation de compilateur ne remplacera jamais un bon algorithme, d’où l’idée de reconnaître les instances d’un algorithme dans un programme et de les substituer par un appel vers une bibliothèque hauteperformance, chaque fois que c’est possible et utile.Dans notre thèse, nous avons traité la reconnaissance de templates – des fonctions avec des variables d’ordre 1 – dans un programme et son application à l’optimisation de programes. Nous proposons une approche complémentaire qui s’appuie sur notre tuilage monoparamétrique complété par une transformation pour tuiler les réductions. Ceci automatise le tuilage sémantique, une nouvelle transformation sémantique qui augmente le grain des opérateurs (scalaire → matrice).Un troisième chapitre présente mes contributions à la synthèse des communications avec une mémoire off-chip dans le contexte de la synthèse de circuits haut-niveau. Nous proposons un modèle d’exécution basé sur le tuilage de boucles, une architecture pipelinée et un algorithme de compilation source-à-source qui, connecté à l’outil de HLS C2H d’Altera, produit une configuration de circuit FPGA qui réalise un volume minimal de transferts de données. Notre algorithme est optimal – les données sont chargées le plus tard possible et stockées le plus tôt possible, avec une réutilisation maximale et sans redondances.Enfin, un quatrième chapitre présente mes contributions pour construire un modèle de compilation polyédrique unifié pour la synthèse de circuits haut-niveau.Nous présentons les réseaux de processus DPN (Data-aware Process Networks), une représentation intermédiaire dataflow qui s’appuie sur les idées développées au chapitre 3 pour expliciter les transferts de données entre le circuit et la mémoire off-chip. Nous proposons une suite d’algorithmes pour compiler un DPN à partir d’un programme séquentiel et nous présentons nos contributions à la synthèse d’un DPN en circuit. En particulier, nous présentons nos algorithmes pour compiler le contrôle, les canaux et les synchronisations d’un DPN. Ces résultats sont utilisés dans le compilateur de production de la start-up XtremLogic

    Area-power-delay trade-off in logic synthesis

    Get PDF
    This thesis introduces new concepts to perform area-power-delay trade-offs in a logic synthesis system. To achieve this, a new delay model is presented, which gives accurate delay estimations for arbitrary sets of Boolean expressions. This allows use of this delay model already during the very first steps of logic synthesis. Furthermore, new algorithms are presented for a number of different optimization tasks within logic synthesis. There are new algorithms to create prime irredundant Boo lean expressions, to perform technology mapping for use with standard cell generators, and to perform gate sizing. To prove the validity of the presented ideas, benchmark results are given throughout the thesis

    Taylor Expansion Diagrams: A Canonical Representation for Verification of Data Flow Designs

    Full text link

    Automated synthesis of delay-insensitive circuits

    Get PDF

    Synthesizing Structured CAD Models with Equality Saturation and Inverse Transformations

    Full text link
    Recent program synthesis techniques help users customize CAD models(e.g., for 3D printing) by decompiling low-level triangle meshes to Constructive Solid Geometry (CSG) expressions. Without loops or functions, editing CSG can require many coordinated changes, and existing mesh decompilers use heuristics that can obfuscate high-level structure. This paper proposes a second decompilation stage to robustly "shrink" unstructured CSG expressions into more editable programs with map and fold operators. We present Szalinski, a tool that uses Equality Saturation with semantics-preserving CAD rewrites to efficiently search for smaller equivalent programs. Szalinski relies on inverse transformations, a novel way for solvers to speculatively add equivalences to an E-graph. We qualitatively evaluate Szalinski in case studies, show how it composes with an existing mesh decompiler, and demonstrate that Szalinski can shrink large models in seconds.Comment: 14 page

    Multilayered abstractions for partial differential equations

    Get PDF
    How do we build maintainable, robust, and performance-portable scientific applications? This thesis argues that the answer to this software engineering question in the context of the finite element method is through the use of layers of Domain-Specific Languages (DSLs) to separate the various concerns in the engineering of such codes. Performance-portable software achieves high performance on multiple diverse hardware platforms without source code changes. We demonstrate that finite element solvers written in a low-level language are not performance-portable, and therefore code must be specialised to the target architecture by a code generation framework. A prototype compiler for finite element variational forms that generates CUDA code is presented, and is used to explore how good performance on many-core platforms in automatically-generated finite element applications can be achieved. The differing code generation requirements for multi- and many-core platforms motivates the design of an additional abstraction, called PyOP2, that enables unstructured mesh applications to be performance-portable. We present a runtime code generation framework comprised of the Unified Form Language (UFL), the FEniCS Form Compiler, and PyOP2. This toolchain separates the succinct expression of a numerical method from the selection and generation of efficient code for local assembly. This is further decoupled from the selection of data formats and algorithms for efficient parallel implementation on a specific target architecture. We establish the successful separation of these concerns by demonstrating the performance-portability of code generated from a single high-level source code written in UFL across sequential C, CUDA, MPI and OpenMP targets. The performance of the generated code exceeds the performance of comparable alternative toolchains on multi-core architectures.Open Acces

    Compilation and Code Optimization for Data Analytics

    Get PDF
    The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance. The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation. As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive. The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code
    • …
    corecore