9,371 research outputs found
Building Efficient Query Engines in a High-Level Language
Abstraction without regret refers to the vision of using high-level
programming languages for systems development without experiencing a negative
impact on performance. A database system designed according to this vision
offers both increased productivity and high performance, instead of sacrificing
the former for the latter as is the case with existing, monolithic
implementations that are hard to maintain and extend. In this article, we
realize this vision in the domain of analytical query processing. We present
LegoBase, a query engine written in the high-level language Scala. The key
technique to regain efficiency is to apply generative programming: LegoBase
performs source-to-source compilation and optimizes the entire query engine by
converting the high-level Scala code to specialized, low-level C code. We show
how generative programming allows to easily implement a wide spectrum of
optimizations, such as introducing data partitioning or switching from a row to
a column data layout, which are difficult to achieve with existing low-level
query compilers that handle only queries. We demonstrate that sufficiently
powerful abstractions are essential for dealing with the complexity of the
optimization effort, shielding developers from compiler internals and
decoupling individual optimizations from each other. We evaluate our approach
with the TPC-H benchmark and show that: (a) With all optimizations enabled,
LegoBase significantly outperforms a commercial database and an existing query
compiler. (b) Programmers need to provide just a few hundred lines of
high-level code for implementing the optimizations, instead of complicated
low-level code that is required by existing query compilation approaches. (c)
The compilation overhead is low compared to the overall execution time, thus
making our approach usable in practice for compiling query engines
Toward Entity-Aware Search
As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability
TID Hash Joins
TID hash joins are a simple and memory-efficient method for processing large join queries. They are based on standard hash join algorithms but only store TID/key pairs in the hash table instead of entire tuples. This typically reduces memory requirements by more than an order of magnitude bringing substantial benefits. In particular, performance for joins on Giga-Byte relations can substantially be
improved by reducing the amount of disk I/O to a large extent. Furthermore efficient processing of mixed multi-user workloads consisting of both join queries and OLTP transactions is supported. We present a detailed simulation study to analyze the performance of TID hash joins. In particular, we identify the conditions under which TID hash joins are most beneficial. Furthermore, we compare TID hash join with adaptive hash join algorithms that have been proposed to deal with mixed workloads
MxTasks: a novel processing model to support data processing on modern hardware
The hardware landscape has changed rapidly in recent years. Modern hardware in today's servers is characterized by many CPU cores, multiple sockets, and vast amounts of main memory structured in NUMA hierarchies.
In order to benefit from these highly parallel systems, the software has to adapt and actively engage with newly available features.
However, the processing models forming the foundation for many performance-oriented applications have remained essentially unchanged.
Threads, which serve as the central processing abstractions, can be considered a "black box" that hardly allows any transparency between the application and the system underneath.
On the one hand, applications are aware of the knowledge that could assist the system in optimizing the execution, such as accessed data objects and access patterns.
On the other hand, the limited opportunities for information exchange cause operating systems to make assumptions about the applications' intentions to optimize their execution, e.g., for local data access.
Applications, on the contrary, implement optimizations tailored to specific situations, such as sophisticated synchronization mechanisms and hardware-conscious data structures.
This work presents MxTasking, a task-based runtime environment that assists the design of data structures and applications for contemporary hardware.
MxTasking rethinks the interfaces between performance-oriented applications and the execution substrate, streamlining the information exchange between both layers.
By breaking patterns of processing models designed with past generations of hardware in mind, MxTasking creates novel opportunities to manage resources in a hardware- and application-conscious way.
Accordingly, we question the granularity of "conventional" threads and show that fine-granular MxTasks are a viable abstraction unit for characterizing and optimizing the execution in a general way.
Using various demonstrators in the context of database management systems, we illustrate the practical benefits and explore how challenges like memory access latencies and error-prone synchronization of concurrency can be addressed straightforwardly and effectively
QCLab: a framework for query compilation on modern hardware platforms
As modern in-memory database systems achieve higher and higher processing
speeds, the performance of memory becomes an increasingly limiting factor. Although there has been significant progress, the bottleneck only has shifted. While
earlier systems were optimized for memory latencies, current systems are rather
affected by the limited memory bandwidth.
Query compilation is a proven technique to address bandwidth limitations.
It translates queries via Just-In-Time compilation to native programs for the target
hardware. The compiled queries execute with very high efficiency and only
with a bare minimum of communication via memory. Despite these important
improvements, the benefit of query compilation in certain scenarios is limited.
On the one hand query compilers typically use standard compiler technology
with relatively long compilation times. Therefore the overall execution time can be
prolonged by the additional compilation time. On the other hand, not all emerging
database technology is compatible with the approach. Query compilation uses a
tuple-at-a-time processing style that departs from the column-at-a-time or vector-at-
a-time approaches that in-memory systems typically use. Especially data-parallel
processing techniques, e.g. SIMD or coprocessing-techniques, are challenging to
use in combination with the approach.
This work presents QCLab, a framework for query compilation on modern hardware
platforms. The framework contains several new query compilation techniques
that allow us to address the mentioned shortcomings and ultimately to extend the
benefit of query compilation to new workloads and platforms. The techniques
cover three aspects: compilation, communication, and processing. Together they
serve as basis for building highly efficient query compilers. The techniques make
efficient use of communication channels and of the large processing capacities
of modern systems. They were designed for practical use and enable efficient
processing, even when workload characteristics are challenging
- …