9 research outputs found

    SIMD-Conscious Optimization of Star Schema Query Processing

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 차상균.Most modern CPUs today come equipped with SIMD (Single Instruction, Multiple Data) registers and instructions, which allow for data-level parallelism by offering the ability to execute a given instruction on multiple elements of data. With its wide availability and compiler support, lack of need for hardware changes and potential for boosting performance, exploiting SIMD instructions in database query processing has been the subject of some attention in literature. Star schemas are a popular method of data mart modeling, and with the sharp rise in the need for efficient big data analysis, star schemas serve as an important case study for OLAP performance optimization. Whilst literature on SIMD optimization of star schema queries exists for the GPGPU domain - where the GPGPU method of execution is synonymous with the SIMD paradigm - none has explored the topic using SIMD instructions on CPUs. In this paper, we show that by optimizing star schema query processing for SIMD instructions, speedup in excess of four times can be achieved in performance. Instead of relying on the traditional operator-based query processing model, we focus on the so-called invisible joinan algorithm specialized for star schema joins. We describe the steps and procedures involved in the SIMD-conscious optimization of the invisible join algorithm, and demonstrate that our SIMD optimization methods achieve up to 4.8x overall speedup over its scalar equivalent, and up to 6.4x speedup for specific operations.Abstract I Table of Contents III List of Figures V Chapter 1. Introduction 1 Chapter 2. Related Work 5 Chapter 3. Star Schema and Invisible Join 7 2.1 The Star Schema 7 2.2 The Invisible Join 8 Chapter 4. SIMDification of Invisible Join 13 4.1 Extending the Invisible Join 13 4.2 SIMDification of the Invisible Join 15 Chapter 5. Experimental Results 21 5.1 Experimental Setup 21 5.2 Overall Results 22 5.3 Breakdown of Results 23 Chapter 6. Conclusion and Future Work 30 References 32 국문 초록 35 Acknowledgements 37Maste

    Scaling Up Concurrent Main-Memory Column-Store Scans: Towards Adaptive NUMA-aware Data and Task Placement

    Get PDF
    Main-memory column-stores are called to efficiently use modern non-uniform memory access (NUMA) architectures to service concurrent clients on big data. The efficient usage of NUMA architectures depends on the data placement and scheduling strategy of the column-store. Most column-stores choose a static strategy that involves partitioning all data across the NUMA architecture, and employing a stealing-based task scheduler. In this paper, we implement different strategies for data placement and task scheduling for the case of concurrent scans. We compare these strategies with an extensive sensitivity analysis. Our most significant findings include that unnecessary partitioning can hurt throughput by up to 70%, and that stealing memory-intensive tasks can hurt throughput by up to 58%. Based on our analysis, we envision a design that adapts the data placement and task scheduling strategy to the workload

    Optimizing Query Predicates with Disjunctions for Column Stores

    Full text link
    Since its inception, database research has given limited attention to optimizing predicates with disjunctions. What little past work there is has focused on optimizations for traditional row-oriented databases. A key difference in predicate evaluation for row stores and column stores is that while row stores apply predicates to one record at a time, column stores apply predicates to sets of records. Not only must the execution engine decide the order in which to apply the predicates, but it must also decide how many times each predicate should be applied and on which sets of records it should be applied to. In our work, we tackle exactly this problem. We formulate, analyze, and solve the predicate evaluation problem for column stores. Our results include proofs about various properties of the problem, and in turn, these properties have allowed us to derive the first polynomial-time (i.e., O(n log n)) algorithm ShallowFish which evaluates predicates optimally for all predicate expressions with a depth of 2 or less. We capture the exact property which makes the problem more difficult for predicate expressions of depth 3 or greater and propose an approximate algorithm DeepFish which outperforms ShallowFish in these situations. Finally, we show that both ShallowFish and DeepFish outperform the corresponding state of the art by two orders of magnitude

    Helmholtz Portfolio Theme Large-Scale Data Management and Analysis (LSDMA)

    Get PDF
    The Helmholtz Association funded the "Large-Scale Data Management and Analysis" portfolio theme from 2012-2016. Four Helmholtz centres, six universities and another research institution in Germany joined to enable data-intensive science by optimising data life cycles in selected scientific communities. In our Data Life cycle Labs, data experts performed joint R&D together with scientific communities. The Data Services Integration Team focused on generic solutions applied by several communities

    Resource-efficient processing of large data volumes

    Get PDF
    The complex system environment of data processing applications makes it very challenging to achieve high resource efficiency. In this thesis, we develop solutions that improve resource efficiency at multiple system levels by focusing on three scenarios that are relevant—but not limited—to database management systems. First, we address the challenge of understanding complex systems by analyzing memory access characteristics via efficient memory tracing. Second, we leverage information about memory access characteristics to optimize the cache usage of algorithms and to avoid cache pollution by applying hardware-based cache partitioning. Third, after optimizing resource usage within a multicore processor, we optimize resource usage across multiple computer systems by addressing the problem of resource contention for bulk loading, i.e., ingesting large volumes of data into the system. We develop a distributed bulk loading mechanism, which utilizes network bandwidth and compute power more efficiently and improves both bulk loading throughput and query processing performance

    Engineering Aggregation Operators for Relational In-Memory Database Systems

    Get PDF
    In this thesis we study the design and implementation of Aggregation operators in the context of relational in-memory database systems. In particular, we identify and address the following challenges: cache-efficiency, CPU-friendliness, parallelism within and across processors, robust handling of skewed data, adaptive processing, processing with constrained memory, and integration with modern database architectures. Our resulting algorithm outperforms the state-of-the-art by up to 3.7x

    Scaling Up Concurrent Analytical Workloads on Multi-Core Servers

    Get PDF
    Today, an ever-increasing number of researchers, businesses, and data scientists collect and analyze massive amounts of data in database systems. The database system needs to process the resulting highly concurrent analytical workloads by exploiting modern multi-socket multi-core processor systems with non-uniform memory access (NUMA) architectures and increasing memory sizes. Conventional execution engines, however, are not designed for many cores, and neither scale nor perform efficiently on modern multi-core NUMA architectures. Firstly, their query-centric approach, where each query is optimized and evaluated independently, can result in unnecessary contention for hardware resources due to redundant work found across queries in highly concurrent workloads. Secondly, they are unaware of the non-uniform memory access costs and the underlying hardware topology, incurring unnecessarily expensive memory accesses and bandwidth saturation. In this thesis, we show how these scalability and performance impediments can be solved by exploiting sharing among concurrent queries and incorporating NUMA-aware adaptive task scheduling and data placement strategies in the execution engine. Regarding sharing, we identify and categorize state-of-the-art techniques for sharing data and work across concurrent queries at run-time into two categories: reactive sharing, which shares intermediate results across common query sub-plans, and proactive sharing, which builds a global query plan with shared operators to evaluate queries. We integrate the original research prototypes that introduce reactive and proactive sharing, perform a sensitivity analysis, and show how and when each technique benefits performance. Our most significant finding is that reactive and proactive sharing can be combined to exploit the advantages of both sharing techniques for highly concurrent analytical workloads. Regarding NUMA-awareness, we identify, implement, and compare various combinations of task scheduling and data placement strategies under a diverse set of highly concurrent analytical workloads. We develop a prototype based on a commercial main-memory column-store database system. Our most significant finding is that there is no single strategy for task scheduling and data placement that is best for all workloads. In specific, inter-socket stealing of memory-intensive tasks can hurt overall performance, and unnecessary partitioning of data across sockets involves an overhead. For this reason, we implement algorithms that adapt task scheduling and data placement to the workload at run-time. Our experiments show that both sharing and NUMA-awareness can significantly improve the performance and scalability of highly concurrent analytical workloads on modern multi-core servers. Thus, we argue that sharing and NUMA-awareness are key factors for supporting faster processing of big data analytical applications, fully exploiting the hardware resources of modern multi-core servers, and for more responsive user experience
    corecore