160 research outputs found

    Compiler Support for Operator Overloading and Algorithmic Differentiation in C++

    Get PDF
    Multiphysics software needs derivatives for, e.g., solving a system of non-linear equations, conducting model verification, or sensitivity studies. In C++, algorithmic differentiation (AD), based on operator overloading (overloading), can be used to calculate derivatives up to machine precision. To that end, the built-in floating-point type is replaced by the user-defined AD type. It overloads all required operators, and calculates the original value and the corresponding derivative based on the chain rule of calculus. While changing the underlying type seems straightforward, several complications arise concerning software and performance engineering. This includes (1) fundamental language restrictions of C++ w.r.t. user-defined types, (2) type correctness of distributed computations with the Message Passing Interface (MPI) library, and (3) identification and mitigation of AD induced overheads. To handle these issues, AD experts may spend a significant amount of time to enhance a code with AD, verify the derivatives and ensure optimal application performance. Hence, in this thesis, we propose a modern compiler-based tooling approach to support and accelerate the AD-enhancement process of C++ target codes. In particular, we make contributions to three aspects of AD. The initial type change - While the change to the AD type in a target code is conceptually straightforward, the type change often leads to a multitude of compiler error messages. This is due to the different treatment of built-in floating-point types and user-defined types by the C++ language standard. Previously legal code constructs in the target code subsequently violate the language standard when the built-in floating-point type is replaced with a user-defined AD type. We identify and classify these problematic code constructs and their root cause is shown. Solutions by localized source transformation are proposed. To automate this rather mechanical process, we develop a static code analyser and source transformation tool, called OO-Lint, based on the Clang compiler framework. It flags instances of these problematic code constructs and applies source transformations to make the code compliant with the requirements of the language standard. To show the overall relevance of complications with user-defined types, OO-Lint is applied to several well-known scientific codes, some of which have already been AD enhanced by others. In all of these applications, except the ones manually treated for AD overloading, problematic code constructs are detected. Type correctness of MPI communication - MPI is the de-facto standard for programming high performance, distributed applications. At the same time, MPI has a complex interface whose usage can be error-prone. For instance, MPI derived data types require manual construction by specifying memory locations of the underlying data. Specifying wrong offsets can lead to subtle bugs that are hard to detect. In the context of AD, special libraries exist that handle the required derivative book-keeping by replacing the MPI communication calls with overloaded variants. However, on top of the AD type change, the MPI communication routines have to be changed manually. In addition, the AD type fundamentally changes memory layout assumptions as it has a different extent than the built-in types. Previously legal layout assumptions have, thus, to be reverified. As a remedy, to detect any type-related errors, we developed a memory sanitizer tool, called TypeART, based on the LLVM compiler framework and the MPI correctness checker MUST. It tracks all memory allocations relevant to MPI communication to allow for checking the underlying type and extent of the typeless memory buffer address passed to any MPI routine. The overhead induced by TypeART w.r.t. several target applications is manageable. AD domain-specific profiling - Applying AD in a black-box manner, without consideration of the target code structure, can have a significant impact on both runtime and memory consumption. An AD expert is usually required to apply further AD-related optimizations for the reduction of these induced overheads. Traditional profiling techniques are, however, insufficient as they do not reveal any AD domain-specific metrics. Of interest for AD code optimization are, e.g., specific code patterns, especially on a function level, that can be treated efficiently with AD. To that end, we developed a static profiling tool, called ProAD, based on the LLVM compiler framework. For each function, it generates the computational graph based on the static data flow of the floating-point variables. The framework supports pattern analysis on the computational graph to identify the optimal application of the chain rule. We show the potential of the optimal application of AD with two case studies. In both cases, significant runtime improvements can be achieved when the knowledge of the code structure, provided by our tool, is exploited. For instance, with a stencil code, a speedup factor of about 13 is achieved compared to a naive application of AD and a factor of 1.2 compared to hand-written derivative code

    Discrete adjoints on many cores Algorithmic differentiation of accelerated fluid simulations

    Get PDF
    PhDSimulations are used in science and industry to predict the performance of technical systems. Adjoint derivatives of these simulations can reveal the sensitivity of the system performance to changes in design or operating conditions, and are increasingly used in shape optimisation and uncertainty quantification. Algorithmic differentiation (AD) by source-transformation is an efficient method to compute such derivatives. AD requires an analysis of the computation and its data flow to produce efficient adjoint code. One important step is the activity analysis that detects operations that need to be differentiated. An improved activity analysis is investigated in this thesis that simplifies build procedures for certain adjoint programs, and is demonstrated to improve the speed of an adjoint fluid dynamics solver. The method works by allowing a context-dependent analysis of routines. The ongoing trend towards multi- and many-core architectures such as the Intel XeonPhi is creating challenges for AD. Two novel approaches are presented that replicate the parallelisation of a program in its corresponding adjoint program. The first approach detects loops that naturally result in a parallelisable adjoint loop, while the second approach uses loop transformation and the aforementioned context-dependent analysis to enforce parallelisable data access in the adjoint loop. A case study shows that both approaches yield adjoints that are as scalable as their underlying primal programs. Adjoint computations are limited by their memory footprint, particularly in unsteady simulations, for which this work presents incomplete checkpointing as a method to reduce memory usage at the cost of a slight reduction in accuracy. Finally, convergence of iterative linear solvers is discussed, which is especially relevant on accelerator cards, where single precision floating point numbers are frequently used and the choice of solvers is limited by the small memory size. Some problems that are particular to adjoint computations are discussed.European Union


    Get PDF
    A long-standing challenge in High-Performance Computing (HPC) is the simultaneous achievement of programmer productivity and hardware computational efficiency. The challenge has been exacerbated by the onset of multi- and many-core CPUs and accelerators. Only a few expert programmers have been able to hand-code domain-specific data transformations and vectorization schemes needed to extract the best possible performance on such architectures. In this research, we examined the possibility of automating these methods by developing a Domain-Specific Language (DSL) framework. Our DSL approach extends C++14 by embedding into it a high-level data-parallel array language, and by using a domain-specific compiler to compile to hybrid-parallel code. We also implemented an array index-space transformation algebra within this high-level array language to manipulate array data-layouts and data-distributions. The compiler introduces a novel method for SIMD auto-vectorization based on array data-layouts. Our new auto-vectorization technique is shown to outperform the default auto-vectorization strategy by up to 40% for stencil computations. The compiler also automates distributed data movement with overlapping of local compute with remote data movement using polyhedral integer set analysis. Along with these main innovations, we developed a new technique using C++ template metaprogramming for developing embedded DSLs using C++. We also proposed a domain-specific compiler intermediate representation that simplifies data flow analysis of abstract DSL constructs. We evaluated our framework by constructing a DSL for the HPC grand-challenge domain of lattice quantum chromodynamics. Our DSL yielded performance gains of up to twice the flop rate over existing production C code for selected kernels. This gain in performance was obtained while using less than one-tenth the lines of code. The performance of this DSL was also competitive with the best hand-optimized and hand-vectorized code, and is an order of magnitude better than existing production DSLs.Doctor of Philosoph

    A high-performance open-source framework for multiphysics simulation and adjoint-based shape and topology optimization

    Get PDF
    The first part of this thesis presents the advances made in the Open-Source software SU2, towards transforming it into a high-performance framework for design and optimization of multiphysics problems. Through this work, and in collaboration with other authors, a tenfold performance improvement was achieved for some problems. More importantly, problems that had previously been impossible to solve in SU2, can now be used in numerical optimization with shape or topology variables. Furthermore, it is now exponentially simpler to study new multiphysics applications, and to develop new numerical schemes taking advantage of modern high-performance-computing systems. In the second part of this thesis, these capabilities allowed the application of topology optimiza- tion to medium scale fluid-structure interaction problems, using high-fidelity models (nonlinear elasticity and Reynolds-averaged Navier-Stokes equations), which had not been done before in the literature. This showed that topology optimization can be used to target aerodynamic objectives, by tailoring the interaction between fluid and structure. However, it also made ev- ident the limitations of density-based methods for this type of problem, in particular, reliably converging to discrete solutions. This was overcome with new strategies to both guarantee and accelerate (i.e. reduce the overall computational cost) the convergence to discrete solutions in fluid-structure interaction problems.Open Acces

    Advanced Concepts for Automatic Differentiation based on Operator Overloading

    Get PDF
    Mit Hilfe der Technik des Automatischen Differenzierens (AD) lassen sich für Funktionen, die als Programmquellcode gegeben sind, Ableitungsinformationen rechentechnisch effizient und mit geringem Aufwand für den Nutzer bereitstellen. Eine Variante der Implementierung von AD basiert auf der Überladung von Operatoren und Funktionen, die von vielen modernen Programmiersprachen ermöglicht wird. Durch Ausnutzung des Konzepts der Überladung wird eine interne Funktions-Repräsentation (Tape) generiert, die anschließend für die Ableitungsberechnung herangezogen wird. In der Dissertation werden neue Techniken erarbeitet, die eine effizientere Tape-Erstellung und die parallele Tape-Auswertung ermöglichen. Anhand von Laufzeituntersuchungen für numerische Beispiele werden die Möglichkeiten der neuen Techniken verdeutlicht.Using the technique of Automatic Differentiation (AD), derivative information can be computed efficiently for any function that is given as source code in a supported programming languages. One basic implementation strategy is based on the concept of operator overloading that is available for many programming languages. Due the overloading of operators, an internal representation of the function can be generated at runtime. This so-called tape can then be used for computing derivatives. In the thesis, new techniques are introduced that allow a more efficient tape creation and the parallel evaluation of tapes. Advantages of the new techniques are demonstrated by means of runtime analyses for numerical examples

    Survey of Template-Based Code Generation

    Full text link
    L'automatisation de la génération des artefacts textuels à partir des modèles est une étape critique dans l'Ingénierie Dirigée par les Modèles (IDM). C'est une transformation de modèles utile pour générer le code source, sérialiser les modèles dans de stockages persistents, générer les rapports ou encore la documentation. Parmi les différents paradigmes de transformation de modèle-au-texte, la génération de code basée sur les templates (TBCG) est la plus utilisée en IDM. La TBCG est une technique de génération qui produit du code à partir des spécifications de haut niveau appelées templates. Compte tenu de la diversité des outils et des approches, il est nécessaire de classifier et de comparer les techniques de TBCG existantes afin d'apporter un soutien approprié aux développeurs. L'objectif de ce mémoire est de mieux comprendre les caractéristiques des techniques de TBCG, identifier les tendances dans la recherche, et éxaminer l'importance du rôle de l'IDM par rapport à cette approche. J'évalue également l'expressivité, la performance et la mise à l'échelle des outils associés selon une série de modèles. Je propose une étude systématique de cartographie de la littérature qui décrit une intéressante vue d'ensemble de la TBCG et une étude comparitive des outils de la TBCG pour mieux guider les dévloppeurs dans leur choix. Cette étude montre que les outils basés sur les modèles offrent plus d'expressivité tandis que les outils basés sur le code sont les plus performants. Enfin, Xtend2 offre le meilleur compromis entre l'expressivité et la performance.A critical step in model-driven engineering (MDE) is the automatic synthesis of a textual artifact from models. This is a very useful model transformation to generate application code, to serialize the model in persistent storage, generate documentation or reports. Among the various model-to-text transformation paradigms, Template-Based Code Generation (TBCG) is the most popular in MDE. TBCG is a synthesis technique that produces code from high-level specifications, called templates. It is a popular technique in MDE given that they both emphasize abstraction and automation. Given the diversity of tools and approaches, it is necessary to classify and compare existing TBCG techniques to provide appropriate support to developers. The goal of this thesis is to better understand the characteristics of TBCG techniques, identify research trends, and assess the importance of the role of MDE in this code synthesis approach. We also evaluate the expressiveness, performance and scalability of the associated tools based on a range of models that implement critical patterns. To this end, we conduct a systematic mapping study of the literature that paints an interesting overview of TBCG and a comparative study on TBCG tools to better guide developers in their choices. This study shows that model-based tools offer more expressiveness whereas code-based tools performed much faster. Xtend2 offers the best compromise between the expressiveness and the performance

    Sequence-to-sequence learning for machine translation and automatic differentiation for machine learning software tools

    Full text link
    Cette thèse regroupe des articles d'apprentissage automatique et s'articule autour de deux thématiques complémentaires. D'une part, les trois premiers articles examinent l'application des réseaux de neurones artificiels aux problèmes du traitement automatique du langage naturel (TALN). Le premier article introduit une structure codificatrice-décodificatrice avec des réseaux de neurones récurrents pour traduire des segments de phrases de longueur variable. Le deuxième article analyse la performance de ces modèles de `traduction neuronale automatique' de manière qualitative et quantitative, tout en soulignant les difficultés posées par les phrases longues et les mots rares. Le troisième article s'adresse au traitement des mots rares et hors du vocabulaire commun en combinant des algorithmes de compression par dictionnaire et des réseaux de neurones récurrents. D'autre part, la deuxième partie de cette thèse fait abstraction de modèles particuliers de réseaux de neurones afin d'aborder l'infrastructure logicielle nécessaire à leur définition et entraînement. Les infrastructures modernes d'apprentissage profond doivent avoir la capacité d'exécuter efficacement des programmes d'algèbre linéaire et par tableaux, tout en étant capable de différentiation automatique (DA) pour calculer des dérivées multiples. Le premier article aborde les défis généraux posés par la conciliation de ces deux objectifs et propose la solution d'une représentation intermédiaire fondée sur les graphes. Le deuxième article attaque le même problème d'une manière différente: en implémentant un code source par bande dans un langage de programmation dynamique par tableau (Python et NumPy).This thesis consists of a series of articles that contribute to the field of machine learning. In particular, it covers two distinct and loosely related fields. The first three articles consider the use of neural network models for problems in natural language processing (NLP). The first article introduces the use of an encoder-decoder structure involving recurrent neural networks (RNNs) to translate from and to variable length phrases and sentences. The second article contains a quantitative and qualitative analysis of the performance of these `neural machine translation' models, laying bare the difficulties posed by long sentences and rare words. The third article deals with handling rare and out-of-vocabulary words in neural network models by using dictionary coder compression algorithms and multi-scale RNN models. The second half of this thesis does not deal with specific neural network models, but with the software tools and frameworks that can be used to define and train them. Modern deep learning frameworks need to be able to efficiently execute programs involving linear algebra and array programming, while also being able to employ automatic differentiation (AD) in order to calculate a variety of derivatives. The first article provides an overview of the difficulties posed in reconciling these two objectives, and introduces a graph-based intermediate representation that aims to tackle these difficulties. The second article considers a different approach to the same problem, implementing a tape-based source-code transformation approach to AD on a dynamically typed array programming language (Python and NumPy)

    Achieving Highly Reliable Embedded Software: An Empirical Evaluation of Different Approaches

    Full text link
    • …