1,979 research outputs found
Robust Query Optimization for Analytical Database Systems
Querying and efficiently analyzing complex data is required to gain valuable business insights, to support machine learning applications, and to make up-to-date information available. Therefore, this thesis investigates opportunities and challenges of selecting the most efficient execution strategy for analytical queries. These challenges include hard-to-capture data characteristics such as skew and correlation, the support of arbitrary data types, and the optimization time overhead of complex queries. Existing approaches often rely on optimistic assumptions about the data distribution, which can result in significant response time delays when these assumptions are not met. On the contrary, we focus on robust query optimization, emphasizing consistent query performance and applicability. Our presentation follows the general select-project-join query pattern, representing the fundamental stages of analytical query processing. To support arbitrary data types and complex filter expressions in the select stage, a novel sampling-based selectivity estimator is developed. Our approach exploits information from filter subexpressions and estimates correlations that are not captured by existing sampling-based methods. We demonstrate improved estimation accuracy and query execution time. Further, to minimize the runtime overhead of sampling, we propose new techniques that exploit access patterns and auxiliary database objects such as indices. For the join stage, we introduce a robust optimization approach by developing an upper-bound join enumeration strategy that connects accurate filter selectivity estimates –e.g., using our sampling-based approach– to join ordering. We demonstrate that join orders based on our upper-bound join ordering strategy achieve more consistent performance and faster workload execution on state-of-the-art database systems. However, besides identifying good logical join orders, it is crucial to determine appropriate physical join operators before query plan execution. To understand the importance of fine-grained physical operator selections, we exhaustively execute fixed join orders with all possible operator combinations. This analysis reveals that none of the investigated query optimizers fully reaches the potential of optimal operator decisions. Based on these insights and to achieve fine-grained operator selections for the previously determined join orders, the thesis presents a lightweight learning-based physical execution plan refinement component called. We show that this refinement component consistently outperforms existing approaches for physical operator selection while enabling a novel two-stage optimizer design. We conclude the thesis by providing a framework for the two-stage optimizer design that allows users to modify, replicate, and further analyze the concepts discussed throughout this thesis.:1 INTRODUCTION
1.1 Analytical Query Processing . . . . . . . . . . . . . . . . . . . 12
1.2 Select-Project-Join Queries . . . . . . . . . . . . . . . . . . . 13
1.3 Basics of SPJ Query Optimization . . . . . . . . . . . . . . . . . 14
1.3.1 Plan Enumeration . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.3 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . 15
1.4 Robust SPJ Query Optimization . . . . . . . . . . . . . . . . . . 16
1.4.1 Tail Latency Root Cause Analysis . . . . . . . . . . . . . . . . 17
1.4.2 Tenets of Robust Query Optimization . . . . . . . . . . . . . . 19
1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 SELECT (-PROJECT) STAGE
2.1 Sampling for Selectivity Estimation . . . . . . . . . . . . . . . 24
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Combined Selectivity Estimation (CSE) . . . . . . . . . . . . . 29
2.2.2 Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . 31
2.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Beta Estimator for 0-Tuple-Situations . . . . . . . . . . . . . . 33
2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Beta Distribution in Non-0-TS . . . . . . . . . . . . . . . . . 35
2.3.3 Parameter Estimation in 0-TS . . . . . . . . . . . . . . . . . . 37
2.3.4 Selectivity Estimation and Predicate Ordering . . . . . . . . . 39
2.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4 Customized Sampling Techniques . . . . . . . . . . . . . . . . . . 53
2.4.1 Focused Sampling . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . 56
2.4.3 Zone Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3 JOIN STAGE: LOGICAL ENUMERATION
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.2 Join Cardinality Upper Bound . . . . . . . . . . . . . . . . . . 64
3.2 Upper Bound Join Enumeration with Synopsis (UES) . . . . . . . . . 66
3.2.1 U-Block: Simple Upper Bound for Joins . . . . . . . . . . . . . 67
3.2.2 E-Block: Customized Enumeration Scheme . . . . . . . . . . . . . 68
3.2.3 UES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.1 General Performance . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 JOIN STAGE: PHYSICAL OPERATOR SELECTION
4.1 Operator Selection vs Join Ordering . . . . . . . . . . . . . . . 77
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.1 Adaptive Query Processing . . . . . . . . . . . . . . . . . . . 80
4.2.2 Bandit Optimizer (Bao) . . . . . . . . . . . . . . . . . . . . . 81
4.3 TONIC: Learned Physical Join Operator Selection . . . . . . . . . 82
4.3.1 Query Execution Plan Synopsis (QEP-S) . . . . . . . . . . . . . 83
4.3.2 QEP-S Life-Cycle . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.3 QEP-S Design Considerations . . . . . . . . . . . . . . . . . . 87
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.1 Performance Factors . . . . . . . . . . . . . . . . . . . . . . 90
4.4.2 Rate of Improvement . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3 Data Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.4 TONIC - Runtime Traits . . . . . . . . . . . . . . . . . . . . . 97
4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5 TWO-STAGE OPTIMIZER FRAMEWORK
5.1 Upper-Bound-Driven Join Ordering Component . . . . . . . . . . . . 101
5.2 Physical Operator Selection Component . . . . . . . . . . . . . . 103
5.3 Example Query Optimization . . . . . . . . . . . . . . . . . . . . 103
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A APPENDIX
A.1 Basics of Query Execution . . . . . . . . . . . . . . . . . . . . 123
A.2 Why Q? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.3 0-TS Proof of Unbiased Estimate . . . . . . . . . . . . . . . . . 125
A.4 UES Upper Bound Property . . . . . . . . . . . . . . . . . . . . . 127
A.5 TONIC – Selectivity-Aware Branching . . . . . . . . . . . . . . . 128
A.6 TONIC – Sequences of Query Execution . . . . . . . . . . . . . . . 12
Robust Query Optimization for Analytical Database Systems
Querying and efficiently analyzing complex data is required to gain valuable business insights, to support machine learning applications, and to make up-to-date information available. Therefore, this thesis investigates opportunities and challenges of selecting the most efficient execution strategy for analytical queries. These challenges include hard-to-capture data characteristics such as skew and correlation, the support of arbitrary data types, and the optimization time overhead of complex queries. Existing approaches often rely on optimistic assumptions about the data distribution, which can result in significant response time delays when these assumptions are not met. On the contrary, we focus on robust query optimization, emphasizing consistent query performance and applicability. Our presentation follows the general select-project-join query pattern, representing the fundamental stages of analytical query processing. To support arbitrary data types and complex filter expressions in the select stage, a novel sampling-based selectivity estimator is developed. Our approach exploits information from filter subexpressions and estimates correlations that are not captured by existing sampling-based methods. We demonstrate improved estimation accuracy and query execution time. Further, to minimize the runtime overhead of sampling, we propose new techniques that exploit access patterns and auxiliary database objects such as indices. For the join stage, we introduce a robust optimization approach by developing an upper-bound join enumeration strategy that connects accurate filter selectivity estimates –e.g., using our sampling-based approach– to join ordering. We demonstrate that join orders based on our upper-bound join ordering strategy achieve more consistent performance and faster workload execution on state-of-the-art database systems. However, besides identifying good logical join orders, it is crucial to determine appropriate physical join operators before query plan execution. To understand the importance of fine-grained physical operator selections, we exhaustively execute fixed join orders with all possible operator combinations. This analysis reveals that none of the investigated query optimizers fully reaches the potential of optimal operator decisions. Based on these insights and to achieve fine-grained operator selections for the previously determined join orders, the thesis presents a lightweight learning-based physical execution plan refinement component called. We show that this refinement component consistently outperforms existing approaches for physical operator selection while enabling a novel two-stage optimizer design. We conclude the thesis by providing a framework for the two-stage optimizer design that allows users to modify, replicate, and further analyze the concepts discussed throughout this thesis.:1 INTRODUCTION
1.1 Analytical Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Select-Project-Join Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Basics of SPJ Query Optimization . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Plan Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.3 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Robust SPJ Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Tail Latency Root Cause Analysis . . . . . . . . . . . . . . . . . . . 17
1.4.2 Tenets of Robust Query Optimization . . . . . . . . . . . . . . . . . 19
1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 SELECT (-PROJECT) STAGE
2.1 Sampling for Selectivity Estimation . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Combined Selectivity Estimation (CSE) . . . . . . . . . . . . . . . . 29
2.2.2 Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Beta Estimator for 0-Tuple-Situations . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Beta Distribution in Non-0-TS . . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 Parameter Estimation in 0-TS . . . . . . . . . . . . . . . . . . . . . . 37
2.3.4 Selectivity Estimation and Predicate Ordering . . . . . . . . . . . 39
2.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4 Customized Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . 53
2.4.1 Focused Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.3 Zone Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3 JOIN STAGE: LOGICAL ENUMERATION
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.2 Join Cardinality Upper Bound . . . . . . . . . . . . . . . . . . . . . 64
3.2 Upper Bound Join Enumeration with Synopsis (UES) . . . . . . . . . . . . 66
3.2.1 U-Block: Simple Upper Bound for Joins . . . . . . . . . . . . . . . . 67
3.2.2 E-Block: Customized Enumeration Scheme . . . . . . . . . . . . . 68
3.2.3 UES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.1 General Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 JOIN STAGE: PHYSICAL OPERATOR SELECTION
4.1 Operator Selection vs Join Ordering . . . . . . . . . . . . . . . . . . . . . 77
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.1 Adaptive Query Processing . . . . . . . . . . . . . . . . . . . . . . 80
4.2.2 Bandit Optimizer (Bao) . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 TONIC: Learned Physical Join Operator Selection . . . . . . . . . . . . . 82
4.3.1 Query Execution Plan Synopsis (QEP-S) . . . . . . . . . . . . . . . 83
4.3.2 QEP-S Life-Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.3 QEP-S Design Considerations . . . . . . . . . . . . . . . . . . . . . . 87
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.1 Performance Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.2 Rate of Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3 Data Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.4 TONIC - Runtime Traits . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5 TWO-STAGE OPTIMIZER FRAMEWORK
5.1 Upper-Bound-Driven Join Ordering Component . . . . . . . . . . . . . 101
5.2 Physical Operator Selection Component . . . . . . . . . . . . . . . . . . 103
5.3 Example Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 103
6 CONCLUSION 107
BIBLIOGRAPHY 109
LIST OF FIGURES 117
LIST OF TABLES 121
A APPENDIX
A.1 Basics of Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.2 Why Q? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.3 0-TS Proof of Unbiased Estimate . . . . . . . . . . . . . . . . . . . . . . . . 125
A.4 UES Upper Bound Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.5 TONIC – Selectivity-Aware Branching . . . . . . . . . . . . . . . . . . . . . 128
A.6 TONIC – Sequences of Query Execution . . . . . . . . . . . . . . . . . . . 12
Incremental Processing and Optimization of Update Streams
Over the recent years, we have seen an increasing number of applications in networking, sensor networks, cloud computing, and environmental monitoring, which monitor, plan, control, and make decisions over data streams from multiple sources. We are interested in extending traditional stream processing techniques to meet the new challenges of these applications. Generally, in order to support genuine continuous query optimization and processing over data streams, we need to systematically understand how to address incremental optimization and processing of update streams for a rich class of queries commonly used in the applications.
Our general thesis is that efficient incremental processing and re-optimization of update streams can be achieved by various incremental view maintenance techniques if we cast the problems as incremental view maintenance problems over data streams. We focus on two incremental processing of update streams challenges currently not addressed in existing work on stream query processing: incremental processing of transitive closure queries over data streams, and incremental re-optimization of queries. In addition to addressing these specific challenges, we also develop a working prototype system Aspen, which serves as an end-to-end stream processing system that has been deployed as the foundation for a case study of our SmartCIS application. We validate our solutions both analytically and empirically on top of our prototype system Aspen, over a variety of benchmark workloads such as TPC-H and LinearRoad Benchmarks
Cost-Based Optimization of Integration Flows
Integration flows are increasingly used to specify and execute data-intensive integration tasks between heterogeneous systems and applications. There are many different application areas such as real-time ETL and data synchronization between operational systems. For the reasons of an increasing amount of data, highly distributed IT infrastructures, and high requirements for data consistency and up-to-dateness of query results, many instances of integration flows are executed over time. Due to this high load and blocking synchronous source systems, the performance of the central integration platform is crucial for an IT infrastructure. To tackle these high performance requirements, we introduce the concept of cost-based optimization of imperative integration flows that relies on incremental statistics maintenance and inter-instance plan re-optimization. As a foundation, we introduce the concept of periodical re-optimization including novel cost-based optimization techniques that are tailor-made for integration flows. Furthermore, we refine the periodical re-optimization to on-demand re-optimization in order to overcome the problems of many unnecessary re-optimization steps and adaptation delays, where we miss optimization opportunities. This approach ensures low optimization overhead and fast workload adaptation
Database Learning: Toward a Database that Becomes Smarter Every Time
In today's databases, previous query answers rarely benefit answering future
queries. For the first time, to the best of our knowledge, we change this
paradigm in an approximate query processing (AQP) context. We make the
following observation: the answer to each query reveals some degree of
knowledge about the answer to another query because their answers stem from the
same underlying distribution that has produced the entire dataset. Exploiting
and refining this knowledge should allow us to answer queries more
analytically, rather than by reading enormous amounts of raw data. Also,
processing more queries should continuously enhance our knowledge of the
underlying distribution, and hence lead to increasingly faster response times
for future queries.
We call this novel idea---learning from past query answers---Database
Learning. We exploit the principle of maximum entropy to produce answers, which
are in expectation guaranteed to be more accurate than existing sample-based
approximations. Empowered by this idea, we build a query engine on top of Spark
SQL, called Verdict. We conduct extensive experiments on real-world query
traces from a large customer of a major database vendor. Our results
demonstrate that Verdict supports 73.7% of these queries, speeding them up by
up to 23.0x for the same accuracy level compared to existing AQP systems.Comment: This manuscript is an extended report of the work published in ACM
SIGMOD conference 201
Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications
MapReduce is a popular programming paradigm for developing large-scale,
data-intensive computation. Many frameworks that implement this paradigm have
recently been developed. To leverage these frameworks, however, developers must
become familiar with their APIs and rewrite existing code. Casper is a new tool
that automatically translates sequential Java programs into the MapReduce
paradigm. Casper identifies potential code fragments to rewrite and translates
them in two steps: (1) Casper uses program synthesis to search for a program
summary (i.e., a functional specification) of each code fragment. The summary
is expressed using a high-level intermediate language resembling the MapReduce
paradigm and verified to be semantically equivalent to the original using a
theorem prover. (2) Casper generates executable code from the summary, using
either the Hadoop, Spark, or Flink API. We evaluated Casper by automatically
converting real-world, sequential Java benchmarks to MapReduce. The resulting
benchmarks perform up to 48.2x faster compared to the original.Comment: 12 pages, additional 4 pages of references and appendi
- …