Search CORE

56 research outputs found

QuickSel: Quick Selectivity Learning with Mixture Models

Author: Aboulnaga A.
Agrawal S.
Anagnostopoulos C.
Asparouhov T.
Chaudhuri S.
Gupta A.
Jagadish H.
Jagadish H. V.
Khachatryan A.
Kraska T.
Lam E.
Lim L.
Lin X.
Lynch C. A.
Markl V.
Markl V.
Rubner Y.
Ré C.
Ré C.
Stillger M.
Sun J.
Tzoumas K.
Van Gelder A.
Wu Y.
Yang J.
Zhang Q.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 10/04/2020
Field of study

Estimating the selectivity of a query is a key step in almost any cost-based query optimizer. Most of today's databases rely on histograms or samples that are periodically refreshed by re-scanning the data as the underlying data changes. Since frequent scans are costly, these statistics are often stale and lead to poor selectivity estimates. As an alternative to scans, query-driven histograms have been proposed, which refine the histograms based on the actual selectivities of the observed queries. Unfortunately, these approaches are either too costly to use in practice---i.e., require an exponential number of buckets---or quickly lose their advantage as they observe more queries. In this paper, we propose a selectivity learning framework, called QuickSel, which falls into the query-driven paradigm but does not use histograms. Instead, it builds an internal model of the underlying data, which can be refined significantly faster (e.g., only 1.9 milliseconds for 300 queries). This fast refinement allows QuickSel to continuously learn from each query and yield increasingly more accurate selectivity estimates over time. Unlike query-driven histograms, QuickSel relies on a mixture model and a new optimization algorithm for training its model. Our extensive experiments on two real-world datasets confirm that, given the same target accuracy, QuickSel is 34.0x-179.4x faster than state-of-the-art query-driven histograms, including ISOMER and STHoles. Further, given the same space budget, QuickSel is 26.8%-91.8% more accurate than periodically-updated histograms and samples, respectively

arXiv.org e-Print Archive

Crossref

Robust Query Optimization for Analytical Database Systems

Author: Hertzschuch Axel
Publication venue
Publication date: 09/08/2023
Field of study

Querying and efficiently analyzing complex data is required to gain valuable business insights, to support machine learning applications, and to make up-to-date information available. Therefore, this thesis investigates opportunities and challenges of selecting the most efficient execution strategy for analytical queries. These challenges include hard-to-capture data characteristics such as skew and correlation, the support of arbitrary data types, and the optimization time overhead of complex queries. Existing approaches often rely on optimistic assumptions about the data distribution, which can result in significant response time delays when these assumptions are not met. On the contrary, we focus on robust query optimization, emphasizing consistent query performance and applicability. Our presentation follows the general select-project-join query pattern, representing the fundamental stages of analytical query processing. To support arbitrary data types and complex filter expressions in the select stage, a novel sampling-based selectivity estimator is developed. Our approach exploits information from filter subexpressions and estimates correlations that are not captured by existing sampling-based methods. We demonstrate improved estimation accuracy and query execution time. Further, to minimize the runtime overhead of sampling, we propose new techniques that exploit access patterns and auxiliary database objects such as indices. For the join stage, we introduce a robust optimization approach by developing an upper-bound join enumeration strategy that connects accurate filter selectivity estimates –e.g., using our sampling-based approach– to join ordering. We demonstrate that join orders based on our upper-bound join ordering strategy achieve more consistent performance and faster workload execution on state-of-the-art database systems. However, besides identifying good logical join orders, it is crucial to determine appropriate physical join operators before query plan execution. To understand the importance of fine-grained physical operator selections, we exhaustively execute fixed join orders with all possible operator combinations. This analysis reveals that none of the investigated query optimizers fully reaches the potential of optimal operator decisions. Based on these insights and to achieve fine-grained operator selections for the previously determined join orders, the thesis presents a lightweight learning-based physical execution plan refinement component called. We show that this refinement component consistently outperforms existing approaches for physical operator selection while enabling a novel two-stage optimizer design. We conclude the thesis by providing a framework for the two-stage optimizer design that allows users to modify, replicate, and further analyze the concepts discussed throughout this thesis.:1 INTRODUCTION 1.1 Analytical Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 Select-Project-Join Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Basics of SPJ Query Optimization . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.1 Plan Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.3 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Robust SPJ Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.1 Tail Latency Root Cause Analysis . . . . . . . . . . . . . . . . . . . 17 1.4.2 Tenets of Robust Query Optimization . . . . . . . . . . . . . . . . . 19 1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 SELECT (-PROJECT) STAGE 2.1 Sampling for Selectivity Estimation . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.1 Combined Selectivity Estimation (CSE) . . . . . . . . . . . . . . . . 29 2.2.2 Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Beta Estimator for 0-Tuple-Situations . . . . . . . . . . . . . . . . . . . . . 33 2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.2 Beta Distribution in Non-0-TS . . . . . . . . . . . . . . . . . . . . . . 35 2.3.3 Parameter Estimation in 0-TS . . . . . . . . . . . . . . . . . . . . . . 37 2.3.4 Selectivity Estimation and Predicate Ordering . . . . . . . . . . . 39 2.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4 Customized Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . 53 2.4.1 Focused Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.4.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.4.3 Zone Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3 JOIN STAGE: LOGICAL ENUMERATION 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.1.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.1.2 Join Cardinality Upper Bound . . . . . . . . . . . . . . . . . . . . . 64 3.2 Upper Bound Join Enumeration with Synopsis (UES) . . . . . . . . . . . . 66 3.2.1 U-Block: Simple Upper Bound for Joins . . . . . . . . . . . . . . . . 67 3.2.2 E-Block: Customized Enumeration Scheme . . . . . . . . . . . . . 68 3.2.3 UES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.1 General Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 JOIN STAGE: PHYSICAL OPERATOR SELECTION 4.1 Operator Selection vs Join Ordering . . . . . . . . . . . . . . . . . . . . . 77 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.1 Adaptive Query Processing . . . . . . . . . . . . . . . . . . . . . . 80 4.2.2 Bandit Optimizer (Bao) . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3 TONIC: Learned Physical Join Operator Selection . . . . . . . . . . . . . 82 4.3.1 Query Execution Plan Synopsis (QEP-S) . . . . . . . . . . . . . . . 83 4.3.2 QEP-S Life-Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.3 QEP-S Design Considerations . . . . . . . . . . . . . . . . . . . . . . 87 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.1 Performance Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Rate of Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.3 Data Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.4 TONIC - Runtime Traits . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 TWO-STAGE OPTIMIZER FRAMEWORK 5.1 Upper-Bound-Driven Join Ordering Component . . . . . . . . . . . . . 101 5.2 Physical Operator Selection Component . . . . . . . . . . . . . . . . . . 103 5.3 Example Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 103 6 CONCLUSION 107 BIBLIOGRAPHY 109 LIST OF FIGURES 117 LIST OF TABLES 121 A APPENDIX A.1 Basics of Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.2 Why Q? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.3 0-TS Proof of Unbiased Estimate . . . . . . . . . . . . . . . . . . . . . . . . 125 A.4 UES Upper Bound Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.5 TONIC – Selectivity-Aware Branching . . . . . . . . . . . . . . . . . . . . . 128 A.6 TONIC – Sequences of Query Execution . . . . . . . . . . . . . . . . . . . 12

Technische Universität Dresden: Qucosa

Robust Query Optimization for Analytical Database Systems

Author: Hertzschuch Axel
Publication venue
Publication date: 25/09/2023
Field of study

Querying and efficiently analyzing complex data is required to gain valuable business insights, to support machine learning applications, and to make up-to-date information available. Therefore, this thesis investigates opportunities and challenges of selecting the most efficient execution strategy for analytical queries. These challenges include hard-to-capture data characteristics such as skew and correlation, the support of arbitrary data types, and the optimization time overhead of complex queries. Existing approaches often rely on optimistic assumptions about the data distribution, which can result in significant response time delays when these assumptions are not met. On the contrary, we focus on robust query optimization, emphasizing consistent query performance and applicability. Our presentation follows the general select-project-join query pattern, representing the fundamental stages of analytical query processing. To support arbitrary data types and complex filter expressions in the select stage, a novel sampling-based selectivity estimator is developed. Our approach exploits information from filter subexpressions and estimates correlations that are not captured by existing sampling-based methods. We demonstrate improved estimation accuracy and query execution time. Further, to minimize the runtime overhead of sampling, we propose new techniques that exploit access patterns and auxiliary database objects such as indices. For the join stage, we introduce a robust optimization approach by developing an upper-bound join enumeration strategy that connects accurate filter selectivity estimates –e.g., using our sampling-based approach– to join ordering. We demonstrate that join orders based on our upper-bound join ordering strategy achieve more consistent performance and faster workload execution on state-of-the-art database systems. However, besides identifying good logical join orders, it is crucial to determine appropriate physical join operators before query plan execution. To understand the importance of fine-grained physical operator selections, we exhaustively execute fixed join orders with all possible operator combinations. This analysis reveals that none of the investigated query optimizers fully reaches the potential of optimal operator decisions. Based on these insights and to achieve fine-grained operator selections for the previously determined join orders, the thesis presents a lightweight learning-based physical execution plan refinement component called. We show that this refinement component consistently outperforms existing approaches for physical operator selection while enabling a novel two-stage optimizer design. We conclude the thesis by providing a framework for the two-stage optimizer design that allows users to modify, replicate, and further analyze the concepts discussed throughout this thesis.:1 INTRODUCTION 1.1 Analytical Query Processing . . . . . . . . . . . . . . . . . . . 12 1.2 Select-Project-Join Queries . . . . . . . . . . . . . . . . . . . 13 1.3 Basics of SPJ Query Optimization . . . . . . . . . . . . . . . . . 14 1.3.1 Plan Enumeration . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.3 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . 15 1.4 Robust SPJ Query Optimization . . . . . . . . . . . . . . . . . . 16 1.4.1 Tail Latency Root Cause Analysis . . . . . . . . . . . . . . . . 17 1.4.2 Tenets of Robust Query Optimization . . . . . . . . . . . . . . 19 1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 SELECT (-PROJECT) STAGE 2.1 Sampling for Selectivity Estimation . . . . . . . . . . . . . . . 24 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.1 Combined Selectivity Estimation (CSE) . . . . . . . . . . . . . 29 2.2.2 Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . 31 2.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Beta Estimator for 0-Tuple-Situations . . . . . . . . . . . . . . 33 2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.2 Beta Distribution in Non-0-TS . . . . . . . . . . . . . . . . . 35 2.3.3 Parameter Estimation in 0-TS . . . . . . . . . . . . . . . . . . 37 2.3.4 Selectivity Estimation and Predicate Ordering . . . . . . . . . 39 2.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4 Customized Sampling Techniques . . . . . . . . . . . . . . . . . . 53 2.4.1 Focused Sampling . . . . . . . . . . . . . . . . . . . . . . . . 54 2.4.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . 56 2.4.3 Zone Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3 JOIN STAGE: LOGICAL ENUMERATION 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.1.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . 63 3.1.2 Join Cardinality Upper Bound . . . . . . . . . . . . . . . . . . 64 3.2 Upper Bound Join Enumeration with Synopsis (UES) . . . . . . . . . 66 3.2.1 U-Block: Simple Upper Bound for Joins . . . . . . . . . . . . . 67 3.2.2 E-Block: Customized Enumeration Scheme . . . . . . . . . . . . . 68 3.2.3 UES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.1 General Performance . . . . . . . . . . . . . . . . . . . . . . 72 3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 JOIN STAGE: PHYSICAL OPERATOR SELECTION 4.1 Operator Selection vs Join Ordering . . . . . . . . . . . . . . . 77 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.1 Adaptive Query Processing . . . . . . . . . . . . . . . . . . . 80 4.2.2 Bandit Optimizer (Bao) . . . . . . . . . . . . . . . . . . . . . 81 4.3 TONIC: Learned Physical Join Operator Selection . . . . . . . . . 82 4.3.1 Query Execution Plan Synopsis (QEP-S) . . . . . . . . . . . . . 83 4.3.2 QEP-S Life-Cycle . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.3 QEP-S Design Considerations . . . . . . . . . . . . . . . . . . 87 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.1 Performance Factors . . . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Rate of Improvement . . . . . . . . . . . . . . . . . . . . . . 92 4.4.3 Data Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.4 TONIC - Runtime Traits . . . . . . . . . . . . . . . . . . . . . 97 4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 TWO-STAGE OPTIMIZER FRAMEWORK 5.1 Upper-Bound-Driven Join Ordering Component . . . . . . . . . . . . 101 5.2 Physical Operator Selection Component . . . . . . . . . . . . . . 103 5.3 Example Query Optimization . . . . . . . . . . . . . . . . . . . . 103 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A APPENDIX A.1 Basics of Query Execution . . . . . . . . . . . . . . . . . . . . 123 A.2 Why Q? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.3 0-TS Proof of Unbiased Estimate . . . . . . . . . . . . . . . . . 125 A.4 UES Upper Bound Property . . . . . . . . . . . . . . . . . . . . . 127 A.5 TONIC – Selectivity-Aware Branching . . . . . . . . . . . . . . . 128 A.6 TONIC – Sequences of Query Execution . . . . . . . . . . . . . . . 12

Technische Universität Dresden: Qucosa

Flow-Loss: Learning Cardinality Estimates That Matter

Author: Alizadeh Mohammad
Kipf Andreas
Kraska Tim
Mao Hongzi
Marcus Ryan
Negi Parimarjan
Tatbul Nesime
Publication venue
Publication date: 13/01/2021
Field of study

Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer's cost model and dynamic programming search algorithm with analytical functions. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain plan graph in which paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark, which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost model) and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, both models improve performance significantly over PostgreSQL and are close to the optimal performance (using true cardinalities). However, the Q-Error trained model degrades significantly when evaluated on queries that are slightly different (e.g., similar but not identical query templates), while the Flow-Loss trained model generalizes better to such situations. For example, the Flow-Loss model achieves up to 1.5x better runtimes on unseen templates compared to the Q-Error model, despite leveraging the same model architecture and training data

arXiv.org e-Print Archive

DSpace@MIT