    Limitations of Intra-operator Parallelism Using Heterogeneous Computing Resources

    The hardware landscape is changing from homogeneous multi-core systems towards wildly heterogeneous systems combining different computing units, like CPUs and GPUs. To utilize these heterogeneous environments, database query execution has to adapt to cope with different architectures and computing behaviors. In this paper, we investigate the simple idea of partitioning an operator’s input data and processing all data partitions in parallel, one partition per computing unit. For heterogeneous systems, data has to be partitioned according to the performance of the computing units. We define a way to calculate the partition sizes, analyze the parallel execution exemplarily for two database operators, and present limitations that could hinder significant performance improvements. The findings in this paper can help system developers to assess the possibilities and limitations of intra-operator parallelism in heterogeneous environments, leading to more informed decisions if this approach is beneficial for a given workload and hardware environment

    Heterogeneity-Aware Placement Strategies for Query Optimization

    Computing hardware is changing from systems with homogeneous CPUs to systems with heterogeneous computing units like GPUs, Many Integrated Cores, or FPGAs. This trend is caused by scaling problems of homogeneous systems, where heat dissipation and energy consumption is limiting further growths in compute-performance. Heterogeneous systems provide differently optimized computing hardware, which allows different operations to be computed on the most appropriate computing unit, resulting in faster execution and less energy consumption. For database systems, this is a new opportunity to accelerate query processing, allowing faster and more interactive querying of large amounts of data. However, the current hardware trend is also a challenge as most database systems do not support heterogeneous computing resources and it is not clear how to support these systems best. In the past, mainly single operators were ported to different computing units showing great results, while missing a system wide application. To efficiently support heterogeneous systems, a systems approach for query processing and query optimization is needed. In this thesis, we tackle the optimization challenge in detail. As a starting point, we evaluate three different approaches on isolated use-cases to assess their advantages and limitations. First, we evaluate a fork-join approach of intra-operator parallelism, where the same operator is executed on multiple computing units at the same time, each execution with different data partitions. Second, we evaluate using one computing unit statically to accelerate one operator, which provides high code-optimization potential, due to this static and pre-known usage of hardware and software. Third, we evaluate dynamically placing operators onto computing units, depending on the operator, the available computing hardware, and the given data sizes. We argue that the first and second approach suffer from multiple overheads or high implementation costs. The third approach, dynamic placement, shows good performance, while being highly extensible to different computing units and different operator implementations. To automate this dynamic approach, we first propose general placement optimization for query processing. This general approach includes runtime estimation of operators on different computing units as well as two approaches for defining the actual operator placement according to the estimated runtimes. The two placement approaches are local optimization, which decides the placement locally at run-time, and global optimization, where the placement is decided at compile-time, while allowing a global view for enhanced data sharing. The main limitation of the latter is the high dependency on cardinality estimation of intermediate results, as estimation errors for the cardinalities propagate to the operator runtime estimation and placement optimization. Therefore, we propose adaptive placement optimization, allowing the placement optimization to become fully independent of cardinalities estimation, effectively eliminating the main source of inaccuracy for runtime estimation and placement optimization. Finally, we define an adaptive placement sequence, incorporating all our proposed techniques of placement optimization. We implement this sequence as a virtualization layer between the database system and the heterogeneous hardware. Our implementation approach bases on preexisting interfaces to the database system and the hardware, allowing non-intrusive integration into existing database systems. We evaluate our techniques using two different database systems and two different OLAP benchmarks, accelerating the query processing through heterogeneous execution

    Heterogeneity-Aware Operator Placement in Column-Store DBMS

    Due to the tremendous increase in the amount of data efficiently managed by current database systems, optimization is still one of the most challenging issues in database research. Today’s query optimizer determine the most efficient composition of physical operators to execute a given SQL query, whereas the underlying hardware consists of a multi-core CPU. However, hardware systems are more and more shifting towards heterogeneity, combining a multi-core CPU with various computing units, e.g., GPU or FPGA cores. In order to efficiently utilize the provided performance capability of such heterogeneous hardware, the assignment of physical operators to computing units gains importance. In this paper, we propose a heterogeneity-aware physical operator placement strategy (HOP) for in-memory columnar database systems in a heterogeneous environment. Our placement approach takes operators from the physical query execution plan as an input and assigns them to computing units using a cost model at runtime. To enable this runtime decision, our cost model uses the characteristics of the computing units, execution properties of the operators, as well as runtime data to estimate execution costs for each unit. We evaluated our approach on full TPC-H queries within a prototype database engine. As we are going to show, the placement in a heterogeneous hardware system has a high influence on query performance

    The HELLS-Join: A Heterogeneous Stream join for ExtremeLy Large windows

    Upcoming processors are combining different computing units in a tightly-coupled approach using a unified shared memory hierarchy. This tightly-coupled combination leads to novel properties with regard to cooperation and interaction. This paper demonstrates the advantages of those processors for a stream-join operator as an important data-intensive example. In detail, we propose our HELLS-Join approach employing all heterogeneous devices by outsourcing parts of the algorithm on the appropriate device. Our HELLS-Join performs better than CPU stream joins, allowing wider time windows, higher stream frequencies, and more streams to be joined as before

    Big Data causing Big (TLB) Problems: Taming Random Memory Accesses on the GPU

    GPUs are increasingly adopted for large-scale database processing, where data accesses represent the major part of the computation. If the data accesses are irregular, like hash table accesses or random sampling, the GPU performance can suffer. Especially when scaling such accesses beyond 2GB of data, a performance decrease of an order of magnitude is encountered. This paper analyzes the source of the slowdown through extensive micro-benchmarking, attributing the root cause to the Translation Lookaside Buffer (TLB). Using the micro-benchmarks, the TLB hierarchy and structure are fully analyzed on two different GPU architectures, identifying never-before-published TLB sizes that can be used for efficient large-scale application tuning. Based on the gained knowledge, we propose a TLB-conscious approach to mitigate the slowdown for algorithms with irregular memory access. The proposed approach is applied to two fundamental database operations - random sampling and hash-based grouping - showing that the slowdown can be dramatically reduced, and resulting in a performance increase of up to 13Ă—

    HW/SW-database-codesign for compressed bitmap index processing

    Compressed bitmap indices are heavily used in scientific and commercial database systems because they largely improve query performance for various workloads. Early research focused on finding tailor-made index compression schemes that are amenable for modern processors. Improving performance further typically comes at the expense of a lower compression rate, which is in many applications not acceptable because of memory limitations. Alternatively, tailor-made hardware allows to achieve a performance that can only hardly be reached with software running on general-purpose CPUs. In this paper, we will show how to create a custom instruction set framework for compressed bitmap processing that is generic enough to implement most of the major compressed bitmap indices. For evaluation, we implemented WAH, PLWAH, and COMPAX operations using our framework and compared the resulting implementation to multiple state-of-the-art processors. We show that the custom-made bitmap processor achieves speedups of up to one order of magnitude by also using two orders of magnitude less energy compared to a modern energy-efficient Intel processor. Finally, we discuss how to embed our processor with database-specific instruction sets into database system environments

    Query processing on low-energy many-core processors

    Aside from performance, energy efficiency is an increasing challenge in database systems. To tackle both aspects in an integrated fashion, we pursue a hardware/software co-design approach. To fulfill the energy requirement from the hardware perspective, we utilize a low-energy processor design offering the possibility to us to place hundreds to millions of chips on a single board without any thermal restrictions. Furthermore, we address the performance requirement by the development of several database-specific instruction set extensions to customize each core, whereas each core does not have all extensions. Therefore, our hardware foundation is a low-energy processor consisting of a high number of heterogeneous cores. In this paper, we introduce our hardware setup on a system level and present several challenges for query processing. Based on these challenges, we describe two implementation concepts and a comparison between these concepts. Finally, we conclude the paper with some lessons learned and an outlook on our upcoming research directions

    The Orchestration Stack: The Impossible Task of Designing Software for Unknown Future Post-CMOS Hardware

    Future systems based on post-CMOS technologies will be wildly heterogeneous, with properties largely unknown today. This paper presents our design of a new hardware/software stack to address the challenge of preparing software development for such systems. It combines well-understood technologies from different areas, e.g., network-on-chips, capability operating systems, flexible programming models and model checking. We describe our approach and provide details on key technologies

    No full text
    No full text
