278 research outputs found
Building Efficient Query Engines in a High-Level Language
Abstraction without regret refers to the vision of using high-level
programming languages for systems development without experiencing a negative
impact on performance. A database system designed according to this vision
offers both increased productivity and high performance, instead of sacrificing
the former for the latter as is the case with existing, monolithic
implementations that are hard to maintain and extend. In this article, we
realize this vision in the domain of analytical query processing. We present
LegoBase, a query engine written in the high-level language Scala. The key
technique to regain efficiency is to apply generative programming: LegoBase
performs source-to-source compilation and optimizes the entire query engine by
converting the high-level Scala code to specialized, low-level C code. We show
how generative programming allows to easily implement a wide spectrum of
optimizations, such as introducing data partitioning or switching from a row to
a column data layout, which are difficult to achieve with existing low-level
query compilers that handle only queries. We demonstrate that sufficiently
powerful abstractions are essential for dealing with the complexity of the
optimization effort, shielding developers from compiler internals and
decoupling individual optimizations from each other. We evaluate our approach
with the TPC-H benchmark and show that: (a) With all optimizations enabled,
LegoBase significantly outperforms a commercial database and an existing query
compiler. (b) Programmers need to provide just a few hundred lines of
high-level code for implementing the optimizations, instead of complicated
low-level code that is required by existing query compilation approaches. (c)
The compilation overhead is low compared to the overall execution time, thus
making our approach usable in practice for compiling query engines
Efficient Implementation of Parametric Polymorphism using Reified Types
Parametric polymorphism is a language feature that lets programmers define code that behaves independently of the types of values it operates on.
Using parametric polymorphism enables code reuse and improves the maintainability of software projects.
The approach that a language implementation uses to support parametric polymorphism can have important performance implications.
One such approach, erasure, converts generic code to non-generic code that uses a uniform representation for generic data.
Erasure is notorious for introducing primitive boxing and other indirections that harm the performance of generic code.
More generally, erasure destroys type information that could be used by the language implementation to optimize generic code.
This thesis presents TASTyTruffle, a new interpreter for the Scala language.
Whereas the standard Scala implementation executes erased Java Virtual Machine (JVM) bytecode, TASTyTruffle interprets TASTy, a different representation that has precise type information.
This thesis explores how the type information present in TASTy empowers TASTyTruffle to implement generic code more effectively.
In particular, TASTy's type information allows TASTyTruffle to reify types as objects that can be passed around the interpreter.
These reified types are used to support heterogeneous box-free representations of generic values.
Reified types also enable TASTyTruffle to create specialized, monomorphic copies of generic code that can be easily and reliably optimized by its just-in-time (JIT) compiler.
Empirically, TASTyTruffle is competitive with the standard JVM implementation.
Both implementations perform similarly on monomorphic workloads, but when generic code is used with multiple types, TASTyTruffle consistently outperforms the JVM.
TASTy's type information enables TASTyTruffle to find additional optimization opportunities that could not be uncovered with erased JVM bytecode alone
Reducing Library Overheads through Source-to-Source Translation
AbstractObject oriented application libraries targeted to a specific application domain are an attractive means of reducing the software development time for sophisticated high performance applications. However, libraries can have the drawback of high abstraction penalties. We describe a domain specific, source-to-source translator that eliminates abstraction penalties in an array class library used to analyze turbulent flow simulation data. Our translator effectively flattens the abstractions, yielding performance within 75% of C code that uses primitive C arrays and no user-defined abstractions
Synthesis of Recursive ADT Transformations from Reusable Templates
Recent work has proposed a promising approach to improving scalability of
program synthesis by allowing the user to supply a syntactic template that
constrains the space of potential programs. Unfortunately, creating templates
often requires nontrivial effort from the user, which impedes the usability of
the synthesizer. We present a solution to this problem in the context of
recursive transformations on algebraic data-types. Our approach relies on
polymorphic synthesis constructs: a small but powerful extension to the
language of syntactic templates, which makes it possible to define a program
space in a concise and highly reusable manner, while at the same time retains
the scalability benefits of conventional templates. This approach enables
end-users to reuse predefined templates from a library for a wide variety of
problems with little effort. The paper also describes a novel optimization that
further improves the performance and scalability of the system. We evaluated
the approach on a set of benchmarks that most notably includes desugaring
functions for lambda calculus, which force the synthesizer to discover Church
encodings for pairs and boolean operations
Simple and Effective Type Check Removal through Lazy Basic Block Versioning
Dynamically typed programming languages such as JavaScript and Python defer
type checking to run time. In order to maximize performance, dynamic language
VM implementations must attempt to eliminate redundant dynamic type checks.
However, type inference analyses are often costly and involve tradeoffs between
compilation time and resulting precision. This has lead to the creation of
increasingly complex multi-tiered VM architectures.
This paper introduces lazy basic block versioning, a simple JIT compilation
technique which effectively removes redundant type checks from critical code
paths. This novel approach lazily generates type-specialized versions of basic
blocks on-the-fly while propagating context-dependent type information. This
does not require the use of costly program analyses, is not restricted by the
precision limitations of traditional type analyses and avoids the
implementation complexity of speculative optimization techniques.
We have implemented intraprocedural lazy basic block versioning in a
JavaScript JIT compiler. This approach is compared with a classical flow-based
type analysis. Lazy basic block versioning performs as well or better on all
benchmarks. On average, 71% of type tests are eliminated, yielding speedups of
up to 50%. We also show that our implementation generates more efficient
machine code than TraceMonkey, a tracing JIT compiler for JavaScript, on
several benchmarks. The combination of implementation simplicity, low
algorithmic complexity and good run time performance makes basic block
versioning attractive for baseline JIT compilers
On the fly type specialization without type analysis
Les langages de programmation typés dynamiquement tels que JavaScript et Python repoussent la vérification de typage jusqu’au moment de l’exécution. Afin d’optimiser la performance de ces langages, les implémentations de machines virtuelles pour langages dynamiques doivent tenter d’éliminer les tests de typage dynamiques redondants. Cela se fait habituellement en utilisant une analyse d’inférence de types. Cependant, les analyses de ce genre sont souvent coûteuses et impliquent des compromis entre le temps de compilation et la précision des résultats obtenus. Ceci a conduit à la conception d’architectures de VM de plus en plus complexes.
Nous proposons le versionnement paresseux de blocs de base, une technique de compilation à la volée simple qui élimine efficacement les tests de typage dynamiques redondants sur les chemins d’exécution critiques. Cette nouvelle approche génère paresseusement des versions spécialisées des blocs de base tout en propageant de l’information de typage contextualisée. Notre technique ne nécessite pas l’utilisation d’analyses de programme coûteuses, n’est pas contrainte par les limitations de précision des analyses d’inférence de types traditionnelles et évite la complexité des techniques d’optimisation spéculatives.
Trois extensions sont apportées au versionnement de blocs de base afin de lui donner des capacités d’optimisation interprocédurale. Une première extension lui donne la possibilité de joindre des informations de typage aux propriétés des objets et aux variables globales. Puis, la spécialisation de points d’entrée lui permet de passer de l’information de typage des fonctions appellantes aux fonctions appellées. Finalement, la spécialisation des continuations d’appels permet de transmettre le type des valeurs de retour des fonctions appellées aux appellants sans coût dynamique. Nous démontrons empiriquement que ces extensions permettent au versionnement de blocs de base d’éliminer plus de tests de typage dynamiques que toute analyse d’inférence de typage statique.Dynamically typed programming languages such as JavaScript and Python defer type checking to run time. In order to maximize performance, dynamic language virtual
machine implementations must attempt to eliminate redundant dynamic type checks. This is typically done using type inference analysis. However, type inference analyses
are often costly and involve tradeoffs between compilation time and resulting precision. This has lead to the creation of increasingly complex multi-tiered VM architectures.
We introduce lazy basic block versioning, a simple just-in-time compilation technique which effectively removes redundant type checks from critical code paths. This
novel approach lazily generates type-specialized versions of basic blocks on the fly while propagating context-dependent type information. This does not require the use of costly
program analyses, is not restricted by the precision limitations of traditional type analyses and avoids the implementation complexity of speculative optimization techniques.
Three extensions are made to the basic block versioning technique in order to give it interprocedural optimization capabilities. Typed object shapes give it the ability to
attach type information to object properties and global variables. Entry point specialization allows it to pass type information from callers to callees, and call continuation
specialization makes it possible to pass return value type information back to callers without dynamic overhead. We empirically demonstrate that these extensions enable
basic block versioning to exceed the capabilities of static whole-program type analyses
AUTOMATING DATA-LAYOUT DECISIONS IN DOMAIN-SPECIFIC LANGUAGES
A long-standing challenge in High-Performance Computing (HPC) is the simultaneous achievement of programmer productivity and hardware computational efficiency. The challenge has been exacerbated by the onset of multi- and many-core CPUs and accelerators. Only a few expert programmers have been able to hand-code domain-specific data transformations and vectorization schemes needed to extract the best possible performance on such architectures. In this research, we examined the possibility of automating these methods by developing a Domain-Specific Language (DSL) framework. Our DSL approach extends C++14 by embedding into it a high-level data-parallel array language, and by using a domain-specific compiler to compile to hybrid-parallel code. We also implemented an array index-space transformation algebra within this high-level array language to manipulate array data-layouts and data-distributions. The compiler introduces a novel method for SIMD auto-vectorization based on array data-layouts. Our new auto-vectorization technique is shown to outperform the default auto-vectorization strategy by up to 40% for stencil computations. The compiler also automates distributed data movement with overlapping of local compute with remote data movement using polyhedral integer set analysis. Along with these main innovations, we developed a new technique using C++ template metaprogramming for developing embedded DSLs using C++. We also proposed a domain-specific compiler intermediate representation that simplifies data flow analysis of abstract DSL constructs. We evaluated our framework by constructing a DSL for the HPC grand-challenge domain of lattice quantum chromodynamics. Our DSL yielded performance gains of up to twice the flop rate over existing production C code for selected kernels. This gain in performance was obtained while using less than one-tenth the lines of code. The performance of this DSL was also competitive with the best hand-optimized and hand-vectorized code, and is an order of magnitude better than existing production DSLs.Doctor of Philosoph
- …