167 research outputs found
Building Efficient Query Engines in a High-Level Language
Abstraction without regret refers to the vision of using high-level
programming languages for systems development without experiencing a negative
impact on performance. A database system designed according to this vision
offers both increased productivity and high performance, instead of sacrificing
the former for the latter as is the case with existing, monolithic
implementations that are hard to maintain and extend. In this article, we
realize this vision in the domain of analytical query processing. We present
LegoBase, a query engine written in the high-level language Scala. The key
technique to regain efficiency is to apply generative programming: LegoBase
performs source-to-source compilation and optimizes the entire query engine by
converting the high-level Scala code to specialized, low-level C code. We show
how generative programming allows to easily implement a wide spectrum of
optimizations, such as introducing data partitioning or switching from a row to
a column data layout, which are difficult to achieve with existing low-level
query compilers that handle only queries. We demonstrate that sufficiently
powerful abstractions are essential for dealing with the complexity of the
optimization effort, shielding developers from compiler internals and
decoupling individual optimizations from each other. We evaluate our approach
with the TPC-H benchmark and show that: (a) With all optimizations enabled,
LegoBase significantly outperforms a commercial database and an existing query
compiler. (b) Programmers need to provide just a few hundred lines of
high-level code for implementing the optimizations, instead of complicated
low-level code that is required by existing query compilation approaches. (c)
The compilation overhead is low compared to the overall execution time, thus
making our approach usable in practice for compiling query engines
New Directions in Cloud Programming
Nearly twenty years after the launch of AWS, it remains difficult for most
developers to harness the enormous potential of the cloud. In this paper we lay
out an agenda for a new generation of cloud programming research aimed at
bringing research ideas to programmers in an evolutionary fashion. Key to our
approach is a separation of distributed programs into a PACT of four facets:
Program semantics, Availablity, Consistency and Targets of optimization. We
propose to migrate developers gradually to PACT programming by lifting familiar
code into our more declarative level of abstraction. We then propose a
multi-stage compiler that emits human-readable code at each stage that can be
hand-tuned by developers seeking more control. Our agenda raises numerous
research challenges across multiple areas including language design, query
optimization, transactions, distributed consistency, compilers and program
synthesis
Parallel programming paradigms and frameworks in big data era
With Cloud Computing emerging as a promising new approach for ad-hoc parallel data processing, major companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. We have entered the Era of Big Data. The explosion and profusion of available data in a wide range of application domains rise up new challenges and opportunities in a plethora of disciplines-ranging from science and engineering to biology and business. One major challenge is how to take advantage of the unprecedented scale of data-typically of heterogeneous nature-in order to acquire further insights and knowledge for improving the quality of the offered services. To exploit this new resource, we need to scale up and scale out both our infrastructures and standard techniques. Our society is already data-rich, but the question remains whether or not we have the conceptual tools to handle it. In this paper we discuss and analyze opportunities and challenges for efficient parallel data processing. Big Data is the next frontier for innovation, competition, and productivity, and many solutions continue to appear, partly supported by the considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. We review various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era, and present modern emerging paradigms and frameworks. To better support practitioners interesting in this domain, we end with an analysis of on-going research challenges towards the truly fourth generation data-intensive science.Peer ReviewedPostprint (author's final draft
How to Architect a Query Compiler
This paper studies architecting query compilers. The state of the art in query compiler construction is lagging behind that in the compilers field. We attempt to remedy this by exploring the key causes of technical challenges in need of well founded solutions, and by gathering the most relevant ideas and approaches from the PL and compilers communities for easy digestion by database researchers. All query compilers known to us are more or less monolithic template expanders that do the bulk of the compilation task in one large leap. Such systems are hard to build and maintain. We propose to use a stack of multiple DSLs on different levels of abstraction with lowering in multiple steps to make query compilers easier to build and extend, ultimately allowing us to create more convincing and sustainable compiler-based data management systems. We attempt to derive our advice for creating such DSL stacks from widely acceptable principles. We have also re-created a well-known query compiler following these ideas and report on this effort
Efficient query processing in managed runtimes
This thesis presents strategies to improve the query evaluation performance over
huge volumes of relational-like data that is stored in the memory space of managed
applications. Storing and processing application data in the memory space of managed
applications is motivated by the convergence of two recent trends in data management.
First, dropping DRAM prices have led to memory capacities that allow the entire working
set of an application to fit into main memory and to the emergence of in-memory
database systems (IMDBs). Second, language-integrated query transparently integrates
query processing syntax into programming languages and, therefore, allows complex
queries to be composed in the application. IMDBs typically serve as data stores to applications
written in an object-oriented language running on a managed runtime. In
this thesis, we propose a deeper integration of the two by storing all application data in
the memory space of the application and using language-integrated query, combined
with query compilation techniques, to provide fast query processing.
As a starting point, we look into storing data as runtime-managed objects in collection
types provided by the programming language. Queries are formulated using
language-integrated query and dynamically compiled to specialized functions that produce
the result of the query in a more efficient way by leveraging query compilation
techniques similar to those used in modern database systems. We show that the generated
query functions significantly improve query processing performance compared to
the default execution model for language-integrated query. However, we also identify
additional inefficiencies that can only be addressed by processing queries using low-level
techniques which cannot be applied to runtime-managed objects. To address this,
we introduce a staging phase in the generated code that makes query-relevant managed
data accessible to low-level query code. Our experiments in .NET show an improvement
in query evaluation performance of up to an order of magnitude over the default
language-integrated query implementation.
Motivated by additional inefficiencies caused by automatic garbage collection, we
introduce a new collection type, the black-box collection. Black-box collections integrate
the in-memory storage layer of a relational database system to store data and hide
the internal storage layout from the application by employing existing object-relational
mapping techniques (hence, the name black-box). Our experiments show that black-box
collections provide better query performance than runtime-managed collections
by allowing the generated query code to directly access the underlying relational in-memory
data store using low-level techniques. Black-box collections also outperform
a modern commercial database system. By removing huge volumes of collection data
from the managed heap, black-box collections further improve the overall performance
and response time of the application and improve the applicationâs scalability when
facing huge volumes of collection data.
To enable a deeper integration of the data store with the application, we introduce
self-managed collections. Self-managed collections are a new type of collection for
managed applications that, in contrast to black-box collections, store objects. As the
data elements stored in the collection are objects, they are directly accessible from the
application using references which allows for better integration of the data store with
the application. Self-managed collections manually manage the memory of objects
stored within them in a private heap that is excluded from garbage collection. We introduce
a special collection syntax and a novel type-safe manual memory management
system for this purpose. As was the case for black-box collections, self-managed collections
improve query performance by utilizing a database-inspired data layout and
allowing the use of low-level techniques. By also supporting references between collection
objects, they outperform black-box collections
Abstraction without regret in database systems building: a manifesto
It has been said that all problems in computer science can be solved by adding another level of indirection, except for performance problems, which are solved by removing levels of indirection. Compilers are our tools for removing levels of indirection automatically. However, we do not trust them when it comes to systems building. Most performance-critical systems are built in low-level programming languages such as C. Some of the downsides of this compared to using modern high-level programming languages are very well known: bugs, poor programmer productivity, a talent bottleneck, and cruelty to programming language researchers. In the future we might even add suboptimal performance to this list. In this article, I argue that compilers can be competitive with and outperform human experts at low-level database systems programming. Performance-critical database systems are a limited-enough domain for us to encode systems programming skills as compiler optimizations. In a large system, a human expert's occasional stroke of creativity producing an original and very specific coding trick is outweighed by a compiler's superior stamina, optimizing code at a level of consistency that is absent even in very mature codebases. However, mainstream compilers cannot do this: We need to work on optimizing compilers specialized for the systems programming domain. Recent progress makes their creation eminently feasible
Compilation and Code Optimization for Data Analytics
The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance.
The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation.
As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive.
The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code
- âŠ