    Query Flattening and the Nested Data Parallelism Paradigm

    This work is based on the observation that languages for two seemingly distant domains are closely related. Orthogonal query languages based on comprehension syntax admit various forms of query nesting to construct nested query results and express complex predicates. Languages for nested data parallelism allow to nest parallel iterators and thereby admit the parallel evaluation of computations that are themselves parallel. Both kinds of languages center around the application of side-effect-free functions to each element of a collection. The motivation for this work is the seamless integration of relational database queries with programming languages. In frameworks for language-integrated database queries, a host language's native collection-programming API is used to express queries. To mediate between native collection programming and relational queries, we define an expressive, orthogonal query calculus that supports nesting and order. The challenge of query flattening is to translate this calculus to bundles of efficient relational queries restricted to flat, unordered multisets. Prior approaches to query flattening either support only query languages that lack in expressiveness or employ a complex, monolithic translation that is hard to comprehend and generates inefficient code that is hard to optimize. To improve on those approaches, we draw on the similarity to nested data parallelism. Blelloch's flattening transformation is a static program transformation that translates nested data parallelism to flat data parallel programs over flat arrays. Based on the flattening transformation, we describe a pipeline of small, comprehensible lowering steps that translates our nested query calculus to a bundle of relational queries. The pipeline is based on a number of well-defined intermediate languages. Our translation adopts the key concepts of the flattening transformation but is designed with specifics of relational query processing in mind. Based on this translation, we revisit all aspects of query flattening. Our translation is fully compositional and can translate any term of the input language. Like prior work, the translation by itself produces inefficient code due to compositionality that is not fit for execution without optimization. In contrast to prior work, we show that query optimization is orthogonal to flattening and can be performed before flattening. We employ well-known work on logical query optimization for nested query languages and demonstrate that this body of work integrates well with our approach. Furthermore, we describe an improved encoding of ordered and nested collections in terms of flat, unordered multisets. Our approach emits idiomatic relational queries in which the effort required to maintain the non-relational semantics of the source language (order and nesting) is minimized. A set of experiments provides evidence that our approach to query flattening can handle complex, list-based queries with nested results and nested intermediate data well. We apply our approach to a number of flat and nested benchmark queries and compare their runtime with hand-written SQL queries. In these experiments, our SQL code generated from a list-based nested query language usually performs as well as hand-written queries

    Query Lifting: Language-integrated query for heterogeneous nested collections

    Language-integrated query based on comprehension syntax is a powerful technique for safe database programming, and provides a basis for advanced techniques such as query shredding or query flattening that allow efficient programming with complex nested collections. However, the foundations of these techniques are lacking: although SQL, the most widely-used database query language, supports heterogeneous queries that mix set and multiset semantics, these important capabilities are not supported by known correctness results or implementations that assume homogeneous collections. In this paper we study language-integrated query for a heterogeneous query language NRCλ(Set,Bag)NRC_\lambda(Set,Bag) that combines set and multiset constructs. We show how to normalize and translate queries to SQL, and develop a novel approach to querying heterogeneous nested collections, based on the insight that ``local'' query subexpressions that calculate nested subcollections can be ``lifted'' to the top level analogously to lambda-lifting for local function definitions.Comment: Full version of ESOP 2021 conference pape

    Domain-specific languages for modeling and simulation

    Simulation models and simulation experiments are increasingly complex. One way to handle this complexity is developing software languages tailored to specific application domains, so-called domain-specific languages (DSLs). This thesis explores the potential of employing DSLs in modeling and simulation. We study different DSL design and implementation techniques and illustrate their benefits for expressing simulation models as well as simulation experiments with several examples.Simulationsmodelle und -experimente werden immer komplexer. Eine Möglichkeit, dieser Komplexität zu begegnen, ist, auf bestimmte Anwendungsgebiete spezialisierte Softwaresprachen, sogenannte domänenspezifische Sprachen (\emph{DSLs, domain-specific languages}), zu entwickeln. Die vorliegende Arbeit untersucht, wie DSLs in der Modellierung und Simulation eingesetzt werden können. Wir betrachten verschiedene Techniken für Entwicklung und Implementierung von DSLs und illustrieren ihren Nutzen für das Ausdrücken von Simulationsmodellen und -experimenten anhand einiger Beispiele

    Scalable Automated Incrementalization for Real-Time Static Analyses

    This thesis proposes a framework for easy development of static analyses, whose results are incrementalized to provide instantaneous feedback in an integrated development environment (IDE). Today, IDEs feature many tools that have static analyses as their foundation to assess software quality and catch correctness problems. Yet, these tools often fail to provide instantaneous feedback and are thus restricted to nightly build processes. This precludes developers from fixing issues at their inception time, i.e., when the problem and the developed solution are both still fresh in mind. In order to provide instantaneous feedback, incrementalization is a well-known technique that utilizes the fact that developers make only small changes to the code and, hence, analysis results can be re-computed fast based on these changes. Yet, incrementalization requires carefully crafted static analyses. Thus, a manual approach to incrementalization is unattractive. Automated incrementalization can alleviate these problems and allows analyses writers to formulate their analyses as queries with the full data set in mind, without worrying over the semantics of incremental changes. Existing approaches to automated incrementalization utilize standard technologies, such as deductive databases, that provide declarative query languages, yet also require to materialize the full dataset in main-memory, i.e., the memory is permanently blocked by the data required for the analyses. Other standard technologies such as relational databases offer better scalability due to persistence, yet require large transaction times for data. Both technologies are not a perfect match for integrating static analyses into an IDE, since the underlying data, i.e., the code base, is already persisted and managed by the IDE. Hence, transitioning the data into a database is redundant work. In this thesis a novel approach is proposed that provides a declarative query language and automated incrementalization, yet retains in memory only a necessary minimum of data, i.e., only the data that is required for the incrementalization. The approach allows to declare static analyses as incrementally maintained views, where the underlying formalism for incrementalization is the relational algebra with extensions for object-orientation and recursion. The algebra allows to deduce which data is the necessary minimum for incremental maintenance and indeed shows that many views are self-maintainable, i.e., do not require to materialize memory at all. In addition an optimization for the algebra is proposed that allows to widen the range of self-maintainable views, based on domain knowledge of the underlying data. The optimization works similar to declaring primary keys for databases, i.e., the optimization is declared on the schema of the data, and defines which data is incrementally maintained in the same scope. The scope makes all analyses (views) that correlate only data within the boundaries of the scope self-maintainable. The approach is implemented as an embedded domain specific language in a general-purpose programming language. The implementation can be understood as a database-like engine with an SQL-style query language and the execution semantics of the relational algebra. As such the system is a general purpose database-like query engine and can be used to incrementalize other domains than static analyses. To evaluate the approach a large variety of static analyses were sampled from real-world tools and formulated as incrementally maintained views in the implemented engine

    Verified Code Generation for the Polyhedral Model

    International audienceThe polyhedral model is a high-level intermediate representation for loop nests that supports elegantly a great many loop optimizations. In a compiler, after polyhedral loop optimizations have been performed, it is necessary and difficult to regenerate sequential or parallel loop nests before continuing compilation. This paper reports on the formalization and proof of semantic preservation of such a code generator that produces sequential code from a polyhedral representation. The formalization and proofs are mechanized using the Coq proof assistant

    Optimizing and Incrementalizing Higher-order Collection Queries by AST Transformation

    In modernen, universellen Programmiersprachen sind Abfragen auf Speicher-basierten Kollektionen oft rechenintensiver als erforderlich. Während Datenbankenabfragen vergleichsweise einfach optimiert werden können, fällt dies bei Speicher-basierten Kollektionen oft schwer, denn universelle Programmiersprachen sind in aller Regel ausdrucksstärker als Datenbanken. Insbesondere unterstützen diese Sprachen meistens verschachtelte, rekursive Datentypen und Funktionen höherer Ordnung. Kollektionsabfragen können per Hand optimiert und inkrementalisiert werden, jedoch verringert dies häufig die Modularität und ist oft zu fehleranfällig, um realisierbar zu sein oder um Instandhaltung von entstandene Programm zu gewährleisten. Die vorliegende Doktorarbeit demonstriert, wie Abfragen auf Kollektionen systematisch und automatisch optimiert und inkrementalisiert werden können, um Programmierer von dieser Last zu befreien. Die so erzeugten Programme werden in derselben Kernsprache ausgedrückt, um weitere Standardoptimierungen zu ermöglichen. Teil I entwickelt eine Variante der Scala API für Kollektionen, die Staging verwendet um Abfragen als abstrakte Syntaxbäume zu reifizieren. Auf Basis dieser Schnittstelle werden anschließend domänenspezifische Optimierungen von Programmiersprachen und Datenbanken angewandt; unter anderem werden Abfragen umgeschrieben, um vom Programmierer ausgewählte Indizes zu benutzen. Dank dieser Indizes kann eine erhebliche Beschleunigung der Ausführungsgeschwindigkeit gezeigt werden; eine experimentelle Auswertung zeigt hierbei Beschleunigungen von durchschnittlich 12x bis zu einem Maximum von 12800x. Um Programme mit Funktionen höherer Ordnung durch Programmtransformation zu inkrementalisieren, wird in Teil II eine Erweiterung der Finite-Differenzen-Methode vorgestellt [Paige and Koenig, 1982; Blakeley et al., 1986; Gupta and Mumick, 1999] und ein erster Ansatz zur Inkrementalisierung durch Programmtransformation für Programme mit Funktionen höherer Ordnung entwickelt. Dabei werden Programme zu Ableitungen transformiert, d.h. zu Programmen die Eingangsdifferenzen in Ausgangdifferenzen umwandeln. Weiterhin werden in den Kapiteln 12–13 die Korrektheit des Inkrementalisierungsansatzes für einfach-getypten und ungetypten λ-Kalkül bewiesen und Erweiterungen zu System F besprochen. Ableitungen müssen oft Ergebnisse der ursprünglichen Programme wiederverwenden. Um eine solche Wiederverwendung zu ermöglichen, erweitert Kapitel 17 die Arbeit von Liu and Teitelbaum [1995] zu Programmen mit Funktionen höherer Ordnung und entwickeln eine Programmtransformation solcher Programme im Cache-Transfer-Stil. Für eine effiziente Inkrementalisierung ist es weiterhin notwendig, passende Grundoperationen auszuwählen und manuell zu inkrementalisieren. Diese Arbeit deckt einen Großteil der wichtigsten Grundoperationen auf Kollektionen ab. Die Durchführung von Fallstudien zeigt deutliche Laufzeitverbesserungen sowohl in Praxis als auch in der asymptotischen Komplexität.In modern programming languages, queries on in-memory collections are often more expensive than needed. While database queries can be readily optimized, it is often not trivial to use them to express collection queries which employ nested data and first-class functions, as enabled by functional programming languages. Collection queries can be optimized and incrementalized by hand, but this reduces modularity, and is often too error-prone to be feasible or to enable maintenance of resulting programs. To free programmers from such burdens, in this thesis we study how to optimize and incrementalize such collection queries. Resulting programs are expressed in the same core language, so that they can be subjected to other standard optimizations. To enable optimizing collection queries which occur inside programs, we develop a staged variant of the Scala collection API that reifies queries as ASTs. On top of this interface, we adapt domain-specific optimizations from the fields of programming languages and databases; among others, we rewrite queries to use indexes chosen by programmers. Thanks to the use of indexes we show significant speedups in our experimental evaluation, with an average of 12x and a maximum of 12800x. To incrementalize higher-order programs by program transformation, we extend finite differencing [Paige and Koenig, 1982; Blakeley et al., 1986; Gupta and Mumick, 1999] and develop the first approach to incrementalization by program transformation for higher-order programs. Base programs are transformed to derivatives, programs that transform input changes to output changes. We prove that our incrementalization approach is correct: We develop the theory underlying incrementalization for simply-typed and untyped λ-calculus, and discuss extensions to System F. Derivatives often need to reuse results produced by base programs: to enable such reuse, we extend work by Liu and Teitelbaum [1995] to higher-order programs, and develop and prove correct a program transformation, converting higher-order programs to cache-transfer-style. For efficient incrementalization, it is necessary to choose and incrementalize by hand appropriate primitive operations. We incrementalize a significant subset of collection operations and perform case studies, showing order-of-magnitude speedups both in practice and in asymptotic complexity

    Building Efficient Query Engines using High-Level Languages

    We are currently witnessing a shift towards the use of high-level programming languages for systems development. These approaches collide with the traditional wisdom which calls for using low-level languages for building efficient software systems. This shift is necessary as billions of dollars are spent annually on the maintenance and debugging of performance-critical software. High-level languages promise faster development of higher-quality software; by offering advanced software features, they help to reduce the number of software errors of the systems and facilitate their verification. Despite these benefits, database systems development seems to be lagging behind as DBMSes are still written in low-level languages. The reason is that the increased productivity offered by high-level languages comes at the cost of a pronounced negative performance impact. In this thesis, we argue that it is now time for a radical rethinking of how database systems are designed. We show that, by using high-level languages, it is indeed possible to build databases that allow for both productivity and high performance. More concretely, in this thesis we follow this abstraction without regret vision and use high-level languages to address the following two problems of database development. First, the introduction of a new storage or memory technology typically requires the development of new versions of most out-of-core algorithms employed by the database system. Given the increasing popularity of hardware specialization, this leads to an arms race for the developers. To make things even worse, there exists no clear methodology for creating such algorithms and we must rely on significant creative effort to serve our need for out-of-core algorithms. To address this issue, we present the OCAS framework for the automatic synthesis of efficient out-of-core algorithms. The developer provides two independent inputs: 1) a memory-hierarchy-oblivious algorithm, expressed using a high-level specification language; and 2) a description of the target memory hierarchy. Using these specifications, our system is then able to automatically synthesize memory-hierarchy and storage-device-aware algorithms for tasks such as joins and sorting. The framework is extensible and quickly synthesizes custom out-of-core algorithms as new storage technologies become available. Second, from a software engineering point of view, years of performance-driven DBMS development have led to complicated, monolithic, low-level code bases, which are hard to maintain and extend. In particular, the introduction of new innovative approaches can be a very time-consuming task. To overcome such limitations, we present LegoBase, a query engine written in the high-level language, Scala. LegoBase realizes the abstraction without regret vision in the domain of analytical query processing. We show how by offering sufficiently powerful abstractions our system allows to easily implement a broad spectrum of optimizations which are difficult to achieve with existing approaches. Then, the key technique to regain efficiency is to apply generative programming and source-to-source compile the entire high-level Scala code to specialized, low-level C code. Our architecture significantly outperforms a commercial in-memory database system and an existing query compiler. LegoBase is the first step towards providing a full DBMS written in a high-level language

    Compilation Techniques for Incremental Collection Processing

    Many map-reduce frameworks as well as NoSQL systems rely on collection programming as their interface of choice due to its rich semantics along with an easily parallelizable set of primitives. Unfortunately, the potential of collection programming is not entirely fulfilled by current systems as they lack efficient incremental view maintenance (IVM) techniques for queries producing large nested results. This comes as a consequence of the fact that the nesting of collections does not enjoy the same algebraic properties underscoring the optimization potential of typical collection processing constructs. We propose the first solution for the efficient incrementalization of collection programming in terms of its core constructs as captured by the positive nested relational calculus (NRC+) on bags (with integer multiplicities). We take an approach based on delta query derivation, whose goal is to generate delta queries which, given a small change in the input, can update the materialized view more efficiently than via recomputation. More precisely, we model the cost of NRC+ operators and classify queries as efficiently incrementalizable if their delta has a strictly lower cost than full re-evaluation. Then, we identify IncNRC+, a large fragment of NRC+ that is efficiently incrementalizable and we provide a semantics-preserving translation that takes any NRC+ query to a collection of IncNRC+ queries. Furthermore, we prove that incrementalmaintenance for NRC+ is within the complexity class NC0 and we showcase how Recursive IVM, a technique that has provided significant speedups over traditional IVM in the case of flat queries, can also be applied to IncNRC+ . Existing systems are also limited wrt. the size of inner collections that they can effectively handle before running into severe performance bottlenecks. In particular, in the face of nested collections with skewed cardinalities developers typically have to undergo a painful process of manual query re-writes in order to ensure that the largest inner collections in their workloads are not impacted by these limitations. To address these issues we developed SLeNDer, a compilation framework that given a nested query generates a set of semantically equivalent (partially) shredded queries that can be efficiently evaluated and incrementalized using state of the art techniques for handling skew and applying delta changes, respectively. The derived queries expose nested collections to the same opportunities for distributing their processing and incrementally updating their contents as those enjoyed by top-level collections, leading on our benchmark to up to 16.8x and 21.9x speedups in terms of offline and online processing, respectively. In order to enable efficient IVM for the increasingly common case of collection programming with functional values as in Links, we also discuss the efficient incrementalization of simplytyped lambda calculi, under the constraint that their primitives are themselves efficiently incrementalizable