9 research outputs found
SIMD-Conscious Optimization of Star Schema Query Processing
학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 차상균.Most modern CPUs today come equipped with SIMD (Single Instruction, Multiple Data) registers and instructions, which allow for data-level parallelism by offering the ability to execute a given instruction on multiple elements of data. With its wide availability and compiler support, lack of need for hardware changes and potential for boosting performance, exploiting SIMD instructions in database query processing has been the subject of some attention in literature.
Star schemas are a popular method of data mart modeling, and with the sharp rise in the need for efficient big data analysis, star schemas serve as an important case study for OLAP performance optimization. Whilst literature on SIMD optimization of star schema queries exists for the GPGPU domain - where the GPGPU method of execution is synonymous with the SIMD paradigm - none has explored the topic using SIMD instructions on CPUs.
In this paper, we show that by optimizing star schema query processing for SIMD instructions, speedup in excess of four times can be achieved in performance. Instead of relying on the traditional operator-based query processing model, we focus on the so-called invisible joinan algorithm specialized for star schema joins. We describe the steps and procedures involved in the SIMD-conscious optimization of the invisible join algorithm, and demonstrate that our SIMD optimization methods achieve up to 4.8x overall speedup over its scalar equivalent, and up to 6.4x speedup for specific operations.Abstract I
Table of Contents III
List of Figures V
Chapter 1. Introduction 1
Chapter 2. Related Work 5
Chapter 3. Star Schema and Invisible Join 7
2.1 The Star Schema 7
2.2 The Invisible Join 8
Chapter 4. SIMDification of Invisible Join 13
4.1 Extending the Invisible Join 13
4.2 SIMDification of the Invisible Join 15
Chapter 5. Experimental Results 21
5.1 Experimental Setup 21
5.2 Overall Results 22
5.3 Breakdown of Results 23
Chapter 6. Conclusion and Future Work 30
References 32
국문 초록 35
Acknowledgements 37Maste
Scaling Up Concurrent Main-Memory Column-Store Scans: Towards Adaptive NUMA-aware Data and Task Placement
Main-memory column-stores are called to efficiently use modern non-uniform memory access (NUMA) architectures to service concurrent clients on big data. The efficient usage of NUMA architectures depends on the data placement and scheduling strategy of the column-store. Most column-stores choose a static strategy that involves partitioning all data across the NUMA architecture, and employing a stealing-based task scheduler. In this paper, we implement different strategies for data placement and task scheduling for the case of concurrent scans. We compare these strategies with an extensive sensitivity analysis. Our most significant findings include that unnecessary partitioning can hurt throughput by up to 70%, and that stealing memory-intensive tasks can hurt throughput by up to 58%. Based on our analysis, we envision a design that adapts the data placement and task scheduling strategy to the workload
Optimizing Query Predicates with Disjunctions for Column Stores
Since its inception, database research has given limited attention to
optimizing predicates with disjunctions. What little past work there is has
focused on optimizations for traditional row-oriented databases. A key
difference in predicate evaluation for row stores and column stores is that
while row stores apply predicates to one record at a time, column stores apply
predicates to sets of records. Not only must the execution engine decide the
order in which to apply the predicates, but it must also decide how many times
each predicate should be applied and on which sets of records it should be
applied to. In our work, we tackle exactly this problem. We formulate, analyze,
and solve the predicate evaluation problem for column stores. Our results
include proofs about various properties of the problem, and in turn, these
properties have allowed us to derive the first polynomial-time (i.e., O(n log
n)) algorithm ShallowFish which evaluates predicates optimally for all
predicate expressions with a depth of 2 or less. We capture the exact property
which makes the problem more difficult for predicate expressions of depth 3 or
greater and propose an approximate algorithm DeepFish which outperforms
ShallowFish in these situations. Finally, we show that both ShallowFish and
DeepFish outperform the corresponding state of the art by two orders of
magnitude
Helmholtz Portfolio Theme Large-Scale Data Management and Analysis (LSDMA)
The Helmholtz Association funded the "Large-Scale Data Management and Analysis" portfolio theme from 2012-2016. Four Helmholtz centres, six universities and another research institution in Germany joined to enable data-intensive science by optimising data life cycles in selected scientific communities. In our Data Life cycle Labs, data experts performed joint R&D together with scientific communities. The Data Services Integration Team focused on generic solutions applied by several communities
Resource-efficient processing of large data volumes
The complex system environment of data processing applications makes it very challenging to achieve high resource efficiency. In this thesis, we develop solutions that improve resource efficiency at multiple system levels by focusing on three scenarios that are relevant—but not limited—to database management systems. First, we address the challenge of understanding complex systems by analyzing memory access characteristics via efficient memory tracing. Second, we leverage information about memory access characteristics to optimize the cache usage of algorithms and to avoid cache pollution by applying hardware-based cache partitioning. Third, after optimizing resource usage within a multicore processor, we optimize resource usage across multiple computer systems by addressing the problem of resource contention for bulk loading, i.e., ingesting large volumes of data into the system. We develop a distributed bulk loading mechanism, which utilizes network bandwidth and compute power more efficiently and improves both bulk loading throughput and query processing performance
Engineering Aggregation Operators for Relational In-Memory Database Systems
In this thesis we study the design and implementation of Aggregation operators in the context of relational in-memory database systems. In particular, we identify and address the following challenges: cache-efficiency, CPU-friendliness, parallelism within and across processors, robust handling of skewed data, adaptive processing, processing with constrained memory, and integration with modern database architectures. Our resulting algorithm outperforms the state-of-the-art by up to 3.7x
Scaling Up Concurrent Analytical Workloads on Multi-Core Servers
Today, an ever-increasing number of researchers, businesses, and data scientists collect and analyze massive amounts of data in database systems. The database system needs to process the resulting highly concurrent analytical workloads by exploiting modern multi-socket multi-core processor systems with non-uniform memory access (NUMA) architectures and increasing memory sizes. Conventional execution engines, however, are not designed for many cores, and neither scale nor perform efficiently on modern multi-core NUMA architectures. Firstly, their query-centric approach, where each query is optimized and evaluated independently, can result in unnecessary contention for hardware resources due to redundant work found across queries in highly concurrent workloads. Secondly, they are unaware of the non-uniform memory access costs and the underlying hardware topology, incurring unnecessarily expensive memory accesses and bandwidth saturation. In this thesis, we show how these scalability and performance impediments can be solved by exploiting sharing among concurrent queries and incorporating NUMA-aware adaptive task scheduling and data placement strategies in the execution engine. Regarding sharing, we identify and categorize state-of-the-art techniques for sharing data and work across concurrent queries at run-time into two categories: reactive sharing, which shares intermediate results across common query sub-plans, and proactive sharing, which builds a global query plan with shared operators to evaluate queries. We integrate the original research prototypes that introduce reactive and proactive sharing, perform a sensitivity analysis, and show how and when each technique benefits performance. Our most significant finding is that reactive and proactive sharing can be combined to exploit the advantages of both sharing techniques for highly concurrent analytical workloads. Regarding NUMA-awareness, we identify, implement, and compare various combinations of task scheduling and data placement strategies under a diverse set of highly concurrent analytical workloads. We develop a prototype based on a commercial main-memory column-store database system. Our most significant finding is that there is no single strategy for task scheduling and data placement that is best for all workloads. In specific, inter-socket stealing of memory-intensive tasks can hurt overall performance, and unnecessary partitioning of data across sockets involves an overhead. For this reason, we implement algorithms that adapt task scheduling and data placement to the workload at run-time. Our experiments show that both sharing and NUMA-awareness can significantly improve the performance and scalability of highly concurrent analytical workloads on modern multi-core servers. Thus, we argue that sharing and NUMA-awareness are key factors for supporting faster processing of big data analytical applications, fully exploiting the hardware resources of modern multi-core servers, and for more responsive user experience