9,371 research outputs found

    Building Efficient Query Engines in a High-Level Language

    Get PDF
    Abstraction without regret refers to the vision of using high-level programming languages for systems development without experiencing a negative impact on performance. A database system designed according to this vision offers both increased productivity and high performance, instead of sacrificing the former for the latter as is the case with existing, monolithic implementations that are hard to maintain and extend. In this article, we realize this vision in the domain of analytical query processing. We present LegoBase, a query engine written in the high-level language Scala. The key technique to regain efficiency is to apply generative programming: LegoBase performs source-to-source compilation and optimizes the entire query engine by converting the high-level Scala code to specialized, low-level C code. We show how generative programming allows to easily implement a wide spectrum of optimizations, such as introducing data partitioning or switching from a row to a column data layout, which are difficult to achieve with existing low-level query compilers that handle only queries. We demonstrate that sufficiently powerful abstractions are essential for dealing with the complexity of the optimization effort, shielding developers from compiler internals and decoupling individual optimizations from each other. We evaluate our approach with the TPC-H benchmark and show that: (a) With all optimizations enabled, LegoBase significantly outperforms a commercial database and an existing query compiler. (b) Programmers need to provide just a few hundred lines of high-level code for implementing the optimizations, instead of complicated low-level code that is required by existing query compilation approaches. (c) The compilation overhead is low compared to the overall execution time, thus making our approach usable in practice for compiling query engines

    Toward Entity-Aware Search

    Get PDF
    As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability

    TID Hash Joins

    Get PDF
    TID hash joins are a simple and memory-efficient method for processing large join queries. They are based on standard hash join algorithms but only store TID/key pairs in the hash table instead of entire tuples. This typically reduces memory requirements by more than an order of magnitude bringing substantial benefits. In particular, performance for joins on Giga-Byte relations can substantially be improved by reducing the amount of disk I/O to a large extent. Furthermore efficient processing of mixed multi-user workloads consisting of both join queries and OLTP transactions is supported. We present a detailed simulation study to analyze the performance of TID hash joins. In particular, we identify the conditions under which TID hash joins are most beneficial. Furthermore, we compare TID hash join with adaptive hash join algorithms that have been proposed to deal with mixed workloads

    MxTasks: a novel processing model to support data processing on modern hardware

    Get PDF
    The hardware landscape has changed rapidly in recent years. Modern hardware in today's servers is characterized by many CPU cores, multiple sockets, and vast amounts of main memory structured in NUMA hierarchies. In order to benefit from these highly parallel systems, the software has to adapt and actively engage with newly available features. However, the processing models forming the foundation for many performance-oriented applications have remained essentially unchanged. Threads, which serve as the central processing abstractions, can be considered a "black box" that hardly allows any transparency between the application and the system underneath. On the one hand, applications are aware of the knowledge that could assist the system in optimizing the execution, such as accessed data objects and access patterns. On the other hand, the limited opportunities for information exchange cause operating systems to make assumptions about the applications' intentions to optimize their execution, e.g., for local data access. Applications, on the contrary, implement optimizations tailored to specific situations, such as sophisticated synchronization mechanisms and hardware-conscious data structures. This work presents MxTasking, a task-based runtime environment that assists the design of data structures and applications for contemporary hardware. MxTasking rethinks the interfaces between performance-oriented applications and the execution substrate, streamlining the information exchange between both layers. By breaking patterns of processing models designed with past generations of hardware in mind, MxTasking creates novel opportunities to manage resources in a hardware- and application-conscious way. Accordingly, we question the granularity of "conventional" threads and show that fine-granular MxTasks are a viable abstraction unit for characterizing and optimizing the execution in a general way. Using various demonstrators in the context of database management systems, we illustrate the practical benefits and explore how challenges like memory access latencies and error-prone synchronization of concurrency can be addressed straightforwardly and effectively

    QCLab: a framework for query compilation on modern hardware platforms

    Get PDF
    As modern in-memory database systems achieve higher and higher processing speeds, the performance of memory becomes an increasingly limiting factor. Although there has been significant progress, the bottleneck only has shifted. While earlier systems were optimized for memory latencies, current systems are rather affected by the limited memory bandwidth. Query compilation is a proven technique to address bandwidth limitations. It translates queries via Just-In-Time compilation to native programs for the target hardware. The compiled queries execute with very high efficiency and only with a bare minimum of communication via memory. Despite these important improvements, the benefit of query compilation in certain scenarios is limited. On the one hand query compilers typically use standard compiler technology with relatively long compilation times. Therefore the overall execution time can be prolonged by the additional compilation time. On the other hand, not all emerging database technology is compatible with the approach. Query compilation uses a tuple-at-a-time processing style that departs from the column-at-a-time or vector-at- a-time approaches that in-memory systems typically use. Especially data-parallel processing techniques, e.g. SIMD or coprocessing-techniques, are challenging to use in combination with the approach. This work presents QCLab, a framework for query compilation on modern hardware platforms. The framework contains several new query compilation techniques that allow us to address the mentioned shortcomings and ultimately to extend the benefit of query compilation to new workloads and platforms. The techniques cover three aspects: compilation, communication, and processing. Together they serve as basis for building highly efficient query compilers. The techniques make efficient use of communication channels and of the large processing capacities of modern systems. They were designed for practical use and enable efficient processing, even when workload characteristics are challenging
    • …
    corecore