2,886 research outputs found
VerdictDB: Universalizing Approximate Query Processing
Despite 25 years of research in academia, approximate query processing (AQP)
has had little industrial adoption. One of the major causes of this slow
adoption is the reluctance of traditional vendors to make radical changes to
their legacy codebases, and the preoccupation of newer vendors (e.g.,
SQL-on-Hadoop products) with implementing standard features. Additionally, the
few AQP engines that are available are each tied to a specific platform and
require users to completely abandon their existing databases---an unrealistic
expectation given the infancy of the AQP technology. Therefore, we argue that a
universal solution is needed: a database-agnostic approximation engine that
will widen the reach of this emerging technology across various platforms.
Our proposal, called VerdictDB, uses a middleware architecture that requires
no changes to the backend database, and thus, can work with all off-the-shelf
engines. Operating at the driver-level, VerdictDB intercepts analytical queries
issued to the database and rewrites them into another query that, if executed
by any standard relational engine, will yield sufficient information for
computing an approximate answer. VerdictDB uses the returned result set to
compute an approximate answer and error estimates, which are then passed on to
the user or application. However, lack of access to the query execution layer
introduces significant challenges in terms of generality, correctness, and
efficiency. This paper shows how VerdictDB overcomes these challenges and
delivers up to 171 speedup (18.45 on average) for a variety of
existing engines, such as Impala, Spark SQL, and Amazon Redshift, while
incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache
License.Comment: Extended technical report of the paper that appeared in Proceedings
of the 2018 International Conference on Management of Data, pp. 1461-1476.
ACM, 201
Novel Selectivity Estimation Strategy for Modern DBMS
Selectivity estimation is important in query optimization, however accurate
estimation is difficult when predicates are complex. Instead of existing
database synopses and statistics not helpful for such cases, we introduce a new
approach to compute the exact selectivity by running an aggregate query during
the optimization phase. Exact selectivity can be achieved without significant
overhead for in-memory and GPU-accelerated databases by adding extra query
execution calls. We implement a selection push-down extension based on the
novel selectivity estimation strategy in the MapD database system. Our approach
records constant and less than 30 millisecond overheads in any circumstances
while running on GPU. The novel strategy successfully generates better query
execution plans which result in performance improvement up to 4.8 times from
TPC-H benchmark SF-50 queries and 7.3 times from star schema benchmark SF-80
queries
Robust Query Optimization for Analytical Database Systems
Querying and efficiently analyzing complex data is required to gain valuable business insights, to support machine learning applications, and to make up-to-date information available. Therefore, this thesis investigates opportunities and challenges of selecting the most efficient execution strategy for analytical queries. These challenges include hard-to-capture data characteristics such as skew and correlation, the support of arbitrary data types, and the optimization time overhead of complex queries. Existing approaches often rely on optimistic assumptions about the data distribution, which can result in significant response time delays when these assumptions are not met. On the contrary, we focus on robust query optimization, emphasizing consistent query performance and applicability. Our presentation follows the general select-project-join query pattern, representing the fundamental stages of analytical query processing. To support arbitrary data types and complex filter expressions in the select stage, a novel sampling-based selectivity estimator is developed. Our approach exploits information from filter subexpressions and estimates correlations that are not captured by existing sampling-based methods. We demonstrate improved estimation accuracy and query execution time. Further, to minimize the runtime overhead of sampling, we propose new techniques that exploit access patterns and auxiliary database objects such as indices. For the join stage, we introduce a robust optimization approach by developing an upper-bound join enumeration strategy that connects accurate filter selectivity estimates –e.g., using our sampling-based approach– to join ordering. We demonstrate that join orders based on our upper-bound join ordering strategy achieve more consistent performance and faster workload execution on state-of-the-art database systems. However, besides identifying good logical join orders, it is crucial to determine appropriate physical join operators before query plan execution. To understand the importance of fine-grained physical operator selections, we exhaustively execute fixed join orders with all possible operator combinations. This analysis reveals that none of the investigated query optimizers fully reaches the potential of optimal operator decisions. Based on these insights and to achieve fine-grained operator selections for the previously determined join orders, the thesis presents a lightweight learning-based physical execution plan refinement component called. We show that this refinement component consistently outperforms existing approaches for physical operator selection while enabling a novel two-stage optimizer design. We conclude the thesis by providing a framework for the two-stage optimizer design that allows users to modify, replicate, and further analyze the concepts discussed throughout this thesis.:1 INTRODUCTION
1.1 Analytical Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Select-Project-Join Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Basics of SPJ Query Optimization . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.1 Plan Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.3 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Robust SPJ Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Tail Latency Root Cause Analysis . . . . . . . . . . . . . . . . . . . 17
1.4.2 Tenets of Robust Query Optimization . . . . . . . . . . . . . . . . . 19
1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 SELECT (-PROJECT) STAGE
2.1 Sampling for Selectivity Estimation . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Combined Selectivity Estimation (CSE) . . . . . . . . . . . . . . . . 29
2.2.2 Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Beta Estimator for 0-Tuple-Situations . . . . . . . . . . . . . . . . . . . . . 33
2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Beta Distribution in Non-0-TS . . . . . . . . . . . . . . . . . . . . . . 35
2.3.3 Parameter Estimation in 0-TS . . . . . . . . . . . . . . . . . . . . . . 37
2.3.4 Selectivity Estimation and Predicate Ordering . . . . . . . . . . . 39
2.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4 Customized Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . 53
2.4.1 Focused Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.3 Zone Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3 JOIN STAGE: LOGICAL ENUMERATION
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.2 Join Cardinality Upper Bound . . . . . . . . . . . . . . . . . . . . . 64
3.2 Upper Bound Join Enumeration with Synopsis (UES) . . . . . . . . . . . . 66
3.2.1 U-Block: Simple Upper Bound for Joins . . . . . . . . . . . . . . . . 67
3.2.2 E-Block: Customized Enumeration Scheme . . . . . . . . . . . . . 68
3.2.3 UES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.1 General Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 JOIN STAGE: PHYSICAL OPERATOR SELECTION
4.1 Operator Selection vs Join Ordering . . . . . . . . . . . . . . . . . . . . . 77
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.1 Adaptive Query Processing . . . . . . . . . . . . . . . . . . . . . . 80
4.2.2 Bandit Optimizer (Bao) . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3 TONIC: Learned Physical Join Operator Selection . . . . . . . . . . . . . 82
4.3.1 Query Execution Plan Synopsis (QEP-S) . . . . . . . . . . . . . . . 83
4.3.2 QEP-S Life-Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.3 QEP-S Design Considerations . . . . . . . . . . . . . . . . . . . . . . 87
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.1 Performance Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.2 Rate of Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3 Data Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.4 TONIC - Runtime Traits . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5 TWO-STAGE OPTIMIZER FRAMEWORK
5.1 Upper-Bound-Driven Join Ordering Component . . . . . . . . . . . . . 101
5.2 Physical Operator Selection Component . . . . . . . . . . . . . . . . . . 103
5.3 Example Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 103
6 CONCLUSION 107
BIBLIOGRAPHY 109
LIST OF FIGURES 117
LIST OF TABLES 121
A APPENDIX
A.1 Basics of Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
A.2 Why Q? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.3 0-TS Proof of Unbiased Estimate . . . . . . . . . . . . . . . . . . . . . . . . 125
A.4 UES Upper Bound Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.5 TONIC – Selectivity-Aware Branching . . . . . . . . . . . . . . . . . . . . . 128
A.6 TONIC – Sequences of Query Execution . . . . . . . . . . . . . . . . . . . 12
Robust Query Optimization for Analytical Database Systems
Querying and efficiently analyzing complex data is required to gain valuable business insights, to support machine learning applications, and to make up-to-date information available. Therefore, this thesis investigates opportunities and challenges of selecting the most efficient execution strategy for analytical queries. These challenges include hard-to-capture data characteristics such as skew and correlation, the support of arbitrary data types, and the optimization time overhead of complex queries. Existing approaches often rely on optimistic assumptions about the data distribution, which can result in significant response time delays when these assumptions are not met. On the contrary, we focus on robust query optimization, emphasizing consistent query performance and applicability. Our presentation follows the general select-project-join query pattern, representing the fundamental stages of analytical query processing. To support arbitrary data types and complex filter expressions in the select stage, a novel sampling-based selectivity estimator is developed. Our approach exploits information from filter subexpressions and estimates correlations that are not captured by existing sampling-based methods. We demonstrate improved estimation accuracy and query execution time. Further, to minimize the runtime overhead of sampling, we propose new techniques that exploit access patterns and auxiliary database objects such as indices. For the join stage, we introduce a robust optimization approach by developing an upper-bound join enumeration strategy that connects accurate filter selectivity estimates –e.g., using our sampling-based approach– to join ordering. We demonstrate that join orders based on our upper-bound join ordering strategy achieve more consistent performance and faster workload execution on state-of-the-art database systems. However, besides identifying good logical join orders, it is crucial to determine appropriate physical join operators before query plan execution. To understand the importance of fine-grained physical operator selections, we exhaustively execute fixed join orders with all possible operator combinations. This analysis reveals that none of the investigated query optimizers fully reaches the potential of optimal operator decisions. Based on these insights and to achieve fine-grained operator selections for the previously determined join orders, the thesis presents a lightweight learning-based physical execution plan refinement component called. We show that this refinement component consistently outperforms existing approaches for physical operator selection while enabling a novel two-stage optimizer design. We conclude the thesis by providing a framework for the two-stage optimizer design that allows users to modify, replicate, and further analyze the concepts discussed throughout this thesis.:1 INTRODUCTION
1.1 Analytical Query Processing . . . . . . . . . . . . . . . . . . . 12
1.2 Select-Project-Join Queries . . . . . . . . . . . . . . . . . . . 13
1.3 Basics of SPJ Query Optimization . . . . . . . . . . . . . . . . . 14
1.3.1 Plan Enumeration . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.3 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . 15
1.4 Robust SPJ Query Optimization . . . . . . . . . . . . . . . . . . 16
1.4.1 Tail Latency Root Cause Analysis . . . . . . . . . . . . . . . . 17
1.4.2 Tenets of Robust Query Optimization . . . . . . . . . . . . . . 19
1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 SELECT (-PROJECT) STAGE
2.1 Sampling for Selectivity Estimation . . . . . . . . . . . . . . . 24
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Combined Selectivity Estimation (CSE) . . . . . . . . . . . . . 29
2.2.2 Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . 31
2.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Beta Estimator for 0-Tuple-Situations . . . . . . . . . . . . . . 33
2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.2 Beta Distribution in Non-0-TS . . . . . . . . . . . . . . . . . 35
2.3.3 Parameter Estimation in 0-TS . . . . . . . . . . . . . . . . . . 37
2.3.4 Selectivity Estimation and Predicate Ordering . . . . . . . . . 39
2.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.4 Customized Sampling Techniques . . . . . . . . . . . . . . . . . . 53
2.4.1 Focused Sampling . . . . . . . . . . . . . . . . . . . . . . . . 54
2.4.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . 56
2.4.3 Zone Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3 JOIN STAGE: LOGICAL ENUMERATION
3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.2 Join Cardinality Upper Bound . . . . . . . . . . . . . . . . . . 64
3.2 Upper Bound Join Enumeration with Synopsis (UES) . . . . . . . . . 66
3.2.1 U-Block: Simple Upper Bound for Joins . . . . . . . . . . . . . 67
3.2.2 E-Block: Customized Enumeration Scheme . . . . . . . . . . . . . 68
3.2.3 UES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.1 General Performance . . . . . . . . . . . . . . . . . . . . . . 72
3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 JOIN STAGE: PHYSICAL OPERATOR SELECTION
4.1 Operator Selection vs Join Ordering . . . . . . . . . . . . . . . 77
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.1 Adaptive Query Processing . . . . . . . . . . . . . . . . . . . 80
4.2.2 Bandit Optimizer (Bao) . . . . . . . . . . . . . . . . . . . . . 81
4.3 TONIC: Learned Physical Join Operator Selection . . . . . . . . . 82
4.3.1 Query Execution Plan Synopsis (QEP-S) . . . . . . . . . . . . . 83
4.3.2 QEP-S Life-Cycle . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.3 QEP-S Design Considerations . . . . . . . . . . . . . . . . . . 87
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.1 Performance Factors . . . . . . . . . . . . . . . . . . . . . . 90
4.4.2 Rate of Improvement . . . . . . . . . . . . . . . . . . . . . . 92
4.4.3 Data Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.4 TONIC - Runtime Traits . . . . . . . . . . . . . . . . . . . . . 97
4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5 TWO-STAGE OPTIMIZER FRAMEWORK
5.1 Upper-Bound-Driven Join Ordering Component . . . . . . . . . . . . 101
5.2 Physical Operator Selection Component . . . . . . . . . . . . . . 103
5.3 Example Query Optimization . . . . . . . . . . . . . . . . . . . . 103
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A APPENDIX
A.1 Basics of Query Execution . . . . . . . . . . . . . . . . . . . . 123
A.2 Why Q? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.3 0-TS Proof of Unbiased Estimate . . . . . . . . . . . . . . . . . 125
A.4 UES Upper Bound Property . . . . . . . . . . . . . . . . . . . . . 127
A.5 TONIC – Selectivity-Aware Branching . . . . . . . . . . . . . . . 128
A.6 TONIC – Sequences of Query Execution . . . . . . . . . . . . . . . 12
On the selection of secondary indices in relational databases
An important problem in the physical design of databases is the selection of secondary indices. In general, this problem cannot be solved in an optimal way due to the complexity of the selection process. Often use is made of heuristics such as the well-known ADD and DROP algorithms. In this paper it will be shown that frequently used cost functions can be classified as super- or submodular functions. For these functions several mathematical properties have been derived which reduce the complexity of the index selection problem. These properties will be used to develop a tool for physical database design and also give a mathematical foundation for the success of the before-mentioned ADD and DROP algorithms
- …