2,886 research outputs found

    VerdictDB: Universalizing Approximate Query Processing

    Full text link
    Despite 25 years of research in academia, approximate query processing (AQP) has had little industrial adoption. One of the major causes of this slow adoption is the reluctance of traditional vendors to make radical changes to their legacy codebases, and the preoccupation of newer vendors (e.g., SQL-on-Hadoop products) with implementing standard features. Additionally, the few AQP engines that are available are each tied to a specific platform and require users to completely abandon their existing databases---an unrealistic expectation given the infancy of the AQP technology. Therefore, we argue that a universal solution is needed: a database-agnostic approximation engine that will widen the reach of this emerging technology across various platforms. Our proposal, called VerdictDB, uses a middleware architecture that requires no changes to the backend database, and thus, can work with all off-the-shelf engines. Operating at the driver-level, VerdictDB intercepts analytical queries issued to the database and rewrites them into another query that, if executed by any standard relational engine, will yield sufficient information for computing an approximate answer. VerdictDB uses the returned result set to compute an approximate answer and error estimates, which are then passed on to the user or application. However, lack of access to the query execution layer introduces significant challenges in terms of generality, correctness, and efficiency. This paper shows how VerdictDB overcomes these challenges and delivers up to 171×\times speedup (18.45×\times on average) for a variety of existing engines, such as Impala, Spark SQL, and Amazon Redshift, while incurring less than 2.6% relative error. VerdictDB is open-sourced under Apache License.Comment: Extended technical report of the paper that appeared in Proceedings of the 2018 International Conference on Management of Data, pp. 1461-1476. ACM, 201

    Novel Selectivity Estimation Strategy for Modern DBMS

    Full text link
    Selectivity estimation is important in query optimization, however accurate estimation is difficult when predicates are complex. Instead of existing database synopses and statistics not helpful for such cases, we introduce a new approach to compute the exact selectivity by running an aggregate query during the optimization phase. Exact selectivity can be achieved without significant overhead for in-memory and GPU-accelerated databases by adding extra query execution calls. We implement a selection push-down extension based on the novel selectivity estimation strategy in the MapD database system. Our approach records constant and less than 30 millisecond overheads in any circumstances while running on GPU. The novel strategy successfully generates better query execution plans which result in performance improvement up to 4.8 times from TPC-H benchmark SF-50 queries and 7.3 times from star schema benchmark SF-80 queries

    Robust Query Optimization for Analytical Database Systems

    Get PDF
    Querying and efficiently analyzing complex data is required to gain valuable business insights, to support machine learning applications, and to make up-to-date information available. Therefore, this thesis investigates opportunities and challenges of selecting the most efficient execution strategy for analytical queries. These challenges include hard-to-capture data characteristics such as skew and correlation, the support of arbitrary data types, and the optimization time overhead of complex queries. Existing approaches often rely on optimistic assumptions about the data distribution, which can result in significant response time delays when these assumptions are not met. On the contrary, we focus on robust query optimization, emphasizing consistent query performance and applicability. Our presentation follows the general select-project-join query pattern, representing the fundamental stages of analytical query processing. To support arbitrary data types and complex filter expressions in the select stage, a novel sampling-based selectivity estimator is developed. Our approach exploits information from filter subexpressions and estimates correlations that are not captured by existing sampling-based methods. We demonstrate improved estimation accuracy and query execution time. Further, to minimize the runtime overhead of sampling, we propose new techniques that exploit access patterns and auxiliary database objects such as indices. For the join stage, we introduce a robust optimization approach by developing an upper-bound join enumeration strategy that connects accurate filter selectivity estimates –e.g., using our sampling-based approach– to join ordering. We demonstrate that join orders based on our upper-bound join ordering strategy achieve more consistent performance and faster workload execution on state-of-the-art database systems. However, besides identifying good logical join orders, it is crucial to determine appropriate physical join operators before query plan execution. To understand the importance of fine-grained physical operator selections, we exhaustively execute fixed join orders with all possible operator combinations. This analysis reveals that none of the investigated query optimizers fully reaches the potential of optimal operator decisions. Based on these insights and to achieve fine-grained operator selections for the previously determined join orders, the thesis presents a lightweight learning-based physical execution plan refinement component called. We show that this refinement component consistently outperforms existing approaches for physical operator selection while enabling a novel two-stage optimizer design. We conclude the thesis by providing a framework for the two-stage optimizer design that allows users to modify, replicate, and further analyze the concepts discussed throughout this thesis.:1 INTRODUCTION 1.1 Analytical Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 Select-Project-Join Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Basics of SPJ Query Optimization . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.1 Plan Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.3 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Robust SPJ Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.1 Tail Latency Root Cause Analysis . . . . . . . . . . . . . . . . . . . 17 1.4.2 Tenets of Robust Query Optimization . . . . . . . . . . . . . . . . . 19 1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 SELECT (-PROJECT) STAGE 2.1 Sampling for Selectivity Estimation . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.1 Combined Selectivity Estimation (CSE) . . . . . . . . . . . . . . . . 29 2.2.2 Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Beta Estimator for 0-Tuple-Situations . . . . . . . . . . . . . . . . . . . . . 33 2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.2 Beta Distribution in Non-0-TS . . . . . . . . . . . . . . . . . . . . . . 35 2.3.3 Parameter Estimation in 0-TS . . . . . . . . . . . . . . . . . . . . . . 37 2.3.4 Selectivity Estimation and Predicate Ordering . . . . . . . . . . . 39 2.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4 Customized Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . 53 2.4.1 Focused Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.4.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.4.3 Zone Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3 JOIN STAGE: LOGICAL ENUMERATION 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.1.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.1.2 Join Cardinality Upper Bound . . . . . . . . . . . . . . . . . . . . . 64 3.2 Upper Bound Join Enumeration with Synopsis (UES) . . . . . . . . . . . . 66 3.2.1 U-Block: Simple Upper Bound for Joins . . . . . . . . . . . . . . . . 67 3.2.2 E-Block: Customized Enumeration Scheme . . . . . . . . . . . . . 68 3.2.3 UES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.1 General Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 JOIN STAGE: PHYSICAL OPERATOR SELECTION 4.1 Operator Selection vs Join Ordering . . . . . . . . . . . . . . . . . . . . . 77 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.1 Adaptive Query Processing . . . . . . . . . . . . . . . . . . . . . . 80 4.2.2 Bandit Optimizer (Bao) . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3 TONIC: Learned Physical Join Operator Selection . . . . . . . . . . . . . 82 4.3.1 Query Execution Plan Synopsis (QEP-S) . . . . . . . . . . . . . . . 83 4.3.2 QEP-S Life-Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.3 QEP-S Design Considerations . . . . . . . . . . . . . . . . . . . . . . 87 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.1 Performance Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Rate of Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.3 Data Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.4 TONIC - Runtime Traits . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 TWO-STAGE OPTIMIZER FRAMEWORK 5.1 Upper-Bound-Driven Join Ordering Component . . . . . . . . . . . . . 101 5.2 Physical Operator Selection Component . . . . . . . . . . . . . . . . . . 103 5.3 Example Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 103 6 CONCLUSION 107 BIBLIOGRAPHY 109 LIST OF FIGURES 117 LIST OF TABLES 121 A APPENDIX A.1 Basics of Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.2 Why Q? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.3 0-TS Proof of Unbiased Estimate . . . . . . . . . . . . . . . . . . . . . . . . 125 A.4 UES Upper Bound Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.5 TONIC – Selectivity-Aware Branching . . . . . . . . . . . . . . . . . . . . . 128 A.6 TONIC – Sequences of Query Execution . . . . . . . . . . . . . . . . . . . 12

    Robust Query Optimization for Analytical Database Systems

    Get PDF
    Querying and efficiently analyzing complex data is required to gain valuable business insights, to support machine learning applications, and to make up-to-date information available. Therefore, this thesis investigates opportunities and challenges of selecting the most efficient execution strategy for analytical queries. These challenges include hard-to-capture data characteristics such as skew and correlation, the support of arbitrary data types, and the optimization time overhead of complex queries. Existing approaches often rely on optimistic assumptions about the data distribution, which can result in significant response time delays when these assumptions are not met. On the contrary, we focus on robust query optimization, emphasizing consistent query performance and applicability. Our presentation follows the general select-project-join query pattern, representing the fundamental stages of analytical query processing. To support arbitrary data types and complex filter expressions in the select stage, a novel sampling-based selectivity estimator is developed. Our approach exploits information from filter subexpressions and estimates correlations that are not captured by existing sampling-based methods. We demonstrate improved estimation accuracy and query execution time. Further, to minimize the runtime overhead of sampling, we propose new techniques that exploit access patterns and auxiliary database objects such as indices. For the join stage, we introduce a robust optimization approach by developing an upper-bound join enumeration strategy that connects accurate filter selectivity estimates –e.g., using our sampling-based approach– to join ordering. We demonstrate that join orders based on our upper-bound join ordering strategy achieve more consistent performance and faster workload execution on state-of-the-art database systems. However, besides identifying good logical join orders, it is crucial to determine appropriate physical join operators before query plan execution. To understand the importance of fine-grained physical operator selections, we exhaustively execute fixed join orders with all possible operator combinations. This analysis reveals that none of the investigated query optimizers fully reaches the potential of optimal operator decisions. Based on these insights and to achieve fine-grained operator selections for the previously determined join orders, the thesis presents a lightweight learning-based physical execution plan refinement component called. We show that this refinement component consistently outperforms existing approaches for physical operator selection while enabling a novel two-stage optimizer design. We conclude the thesis by providing a framework for the two-stage optimizer design that allows users to modify, replicate, and further analyze the concepts discussed throughout this thesis.:1 INTRODUCTION 1.1 Analytical Query Processing . . . . . . . . . . . . . . . . . . . 12 1.2 Select-Project-Join Queries . . . . . . . . . . . . . . . . . . . 13 1.3 Basics of SPJ Query Optimization . . . . . . . . . . . . . . . . . 14 1.3.1 Plan Enumeration . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.3 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . 15 1.4 Robust SPJ Query Optimization . . . . . . . . . . . . . . . . . . 16 1.4.1 Tail Latency Root Cause Analysis . . . . . . . . . . . . . . . . 17 1.4.2 Tenets of Robust Query Optimization . . . . . . . . . . . . . . 19 1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 SELECT (-PROJECT) STAGE 2.1 Sampling for Selectivity Estimation . . . . . . . . . . . . . . . 24 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.1 Combined Selectivity Estimation (CSE) . . . . . . . . . . . . . 29 2.2.2 Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . 31 2.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Beta Estimator for 0-Tuple-Situations . . . . . . . . . . . . . . 33 2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.2 Beta Distribution in Non-0-TS . . . . . . . . . . . . . . . . . 35 2.3.3 Parameter Estimation in 0-TS . . . . . . . . . . . . . . . . . . 37 2.3.4 Selectivity Estimation and Predicate Ordering . . . . . . . . . 39 2.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4 Customized Sampling Techniques . . . . . . . . . . . . . . . . . . 53 2.4.1 Focused Sampling . . . . . . . . . . . . . . . . . . . . . . . . 54 2.4.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . 56 2.4.3 Zone Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3 JOIN STAGE: LOGICAL ENUMERATION 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.1.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . 63 3.1.2 Join Cardinality Upper Bound . . . . . . . . . . . . . . . . . . 64 3.2 Upper Bound Join Enumeration with Synopsis (UES) . . . . . . . . . 66 3.2.1 U-Block: Simple Upper Bound for Joins . . . . . . . . . . . . . 67 3.2.2 E-Block: Customized Enumeration Scheme . . . . . . . . . . . . . 68 3.2.3 UES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.1 General Performance . . . . . . . . . . . . . . . . . . . . . . 72 3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 JOIN STAGE: PHYSICAL OPERATOR SELECTION 4.1 Operator Selection vs Join Ordering . . . . . . . . . . . . . . . 77 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.1 Adaptive Query Processing . . . . . . . . . . . . . . . . . . . 80 4.2.2 Bandit Optimizer (Bao) . . . . . . . . . . . . . . . . . . . . . 81 4.3 TONIC: Learned Physical Join Operator Selection . . . . . . . . . 82 4.3.1 Query Execution Plan Synopsis (QEP-S) . . . . . . . . . . . . . 83 4.3.2 QEP-S Life-Cycle . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.3 QEP-S Design Considerations . . . . . . . . . . . . . . . . . . 87 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.1 Performance Factors . . . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Rate of Improvement . . . . . . . . . . . . . . . . . . . . . . 92 4.4.3 Data Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.4 TONIC - Runtime Traits . . . . . . . . . . . . . . . . . . . . . 97 4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 TWO-STAGE OPTIMIZER FRAMEWORK 5.1 Upper-Bound-Driven Join Ordering Component . . . . . . . . . . . . 101 5.2 Physical Operator Selection Component . . . . . . . . . . . . . . 103 5.3 Example Query Optimization . . . . . . . . . . . . . . . . . . . . 103 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A APPENDIX A.1 Basics of Query Execution . . . . . . . . . . . . . . . . . . . . 123 A.2 Why Q? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.3 0-TS Proof of Unbiased Estimate . . . . . . . . . . . . . . . . . 125 A.4 UES Upper Bound Property . . . . . . . . . . . . . . . . . . . . . 127 A.5 TONIC – Selectivity-Aware Branching . . . . . . . . . . . . . . . 128 A.6 TONIC – Sequences of Query Execution . . . . . . . . . . . . . . . 12

    On the selection of secondary indices in relational databases

    Get PDF
    An important problem in the physical design of databases is the selection of secondary indices. In general, this problem cannot be solved in an optimal way due to the complexity of the selection process. Often use is made of heuristics such as the well-known ADD and DROP algorithms. In this paper it will be shown that frequently used cost functions can be classified as super- or submodular functions. For these functions several mathematical properties have been derived which reduce the complexity of the index selection problem. These properties will be used to develop a tool for physical database design and also give a mathematical foundation for the success of the before-mentioned ADD and DROP algorithms
    • …
    corecore