231 research outputs found

    Extending Event Sequence Processing:New Models and Optimization Techniques

    Get PDF
    Many modern applications, including online financial feeds, tag-based mass transit systems and RFID-based supply chain management systems transmit real-time data streams. There is a need for event stream processing technology to analyze this vast amount of sequential data to enable online operational decision making. This dissertation focuses on innovating several techniques at the core of a scalable E-Analytic system to achieve efficient, scalable and robust methods for in-memory multi-dimensional nested pattern analysis over high-speed event streams. First, I address the problem of processing flat pattern queries on event streams with out-of-order data arrival. I design two alternate solutions: aggressive and conservative strategies respectively. The aggressive strategy produces maximal output under the optimistic assumption that out-of-order event arrival is rare. The conservative method works under the assumption that out-of-order data may be common, and thus produces output only when its correctness can be guaranteed. Second, I design the integration of CEP and OLAP techniques (ECube model) for efficient multi-dimensional event pattern analysis at different abstraction levels. Strategies of drill-down (refinement from abstract to specific patterns) and of roll-up (generalization from specific to abstract patterns) are developed for the efficient workload evaluation. I design a cost-driven adaptive optimizer called Chase that exploits reuse strategies for optimal E-Cube hierarchy execution. Then, I explore novel optimization techniques to support the high- performance processing of powerful nested CEP patterns. A CEP query language called NEEL, is designed to express nested CEP pattern queries composed of sequence, negation, AND and OR operators. To allow flexible execution ordering, I devise a normalization procedure that employs rewriting rules for flattening a nested complex event expression. To conserve CPU and memory consumption, I propose several strategies for efficient shared processing of groups of normalized NEEL subexpressions. Our comprehensive experimental studies, using both synthetic as well as real data streams demonstrate superiority of our proposed strategies over alternate methods in the literature in both effectiveness and efficiency

    Robust Query Optimization for Analytical Database Systems

    Get PDF
    Querying and efficiently analyzing complex data is required to gain valuable business insights, to support machine learning applications, and to make up-to-date information available. Therefore, this thesis investigates opportunities and challenges of selecting the most efficient execution strategy for analytical queries. These challenges include hard-to-capture data characteristics such as skew and correlation, the support of arbitrary data types, and the optimization time overhead of complex queries. Existing approaches often rely on optimistic assumptions about the data distribution, which can result in significant response time delays when these assumptions are not met. On the contrary, we focus on robust query optimization, emphasizing consistent query performance and applicability. Our presentation follows the general select-project-join query pattern, representing the fundamental stages of analytical query processing. To support arbitrary data types and complex filter expressions in the select stage, a novel sampling-based selectivity estimator is developed. Our approach exploits information from filter subexpressions and estimates correlations that are not captured by existing sampling-based methods. We demonstrate improved estimation accuracy and query execution time. Further, to minimize the runtime overhead of sampling, we propose new techniques that exploit access patterns and auxiliary database objects such as indices. For the join stage, we introduce a robust optimization approach by developing an upper-bound join enumeration strategy that connects accurate filter selectivity estimates –e.g., using our sampling-based approach– to join ordering. We demonstrate that join orders based on our upper-bound join ordering strategy achieve more consistent performance and faster workload execution on state-of-the-art database systems. However, besides identifying good logical join orders, it is crucial to determine appropriate physical join operators before query plan execution. To understand the importance of fine-grained physical operator selections, we exhaustively execute fixed join orders with all possible operator combinations. This analysis reveals that none of the investigated query optimizers fully reaches the potential of optimal operator decisions. Based on these insights and to achieve fine-grained operator selections for the previously determined join orders, the thesis presents a lightweight learning-based physical execution plan refinement component called. We show that this refinement component consistently outperforms existing approaches for physical operator selection while enabling a novel two-stage optimizer design. We conclude the thesis by providing a framework for the two-stage optimizer design that allows users to modify, replicate, and further analyze the concepts discussed throughout this thesis.:1 INTRODUCTION 1.1 Analytical Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2 Select-Project-Join Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Basics of SPJ Query Optimization . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.1 Plan Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.3 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Robust SPJ Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.1 Tail Latency Root Cause Analysis . . . . . . . . . . . . . . . . . . . 17 1.4.2 Tenets of Robust Query Optimization . . . . . . . . . . . . . . . . . 19 1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 SELECT (-PROJECT) STAGE 2.1 Sampling for Selectivity Estimation . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.1 Combined Selectivity Estimation (CSE) . . . . . . . . . . . . . . . . 29 2.2.2 Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Beta Estimator for 0-Tuple-Situations . . . . . . . . . . . . . . . . . . . . . 33 2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.2 Beta Distribution in Non-0-TS . . . . . . . . . . . . . . . . . . . . . . 35 2.3.3 Parameter Estimation in 0-TS . . . . . . . . . . . . . . . . . . . . . . 37 2.3.4 Selectivity Estimation and Predicate Ordering . . . . . . . . . . . 39 2.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4 Customized Sampling Techniques . . . . . . . . . . . . . . . . . . . . . . 53 2.4.1 Focused Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.4.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.4.3 Zone Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3 JOIN STAGE: LOGICAL ENUMERATION 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.1.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.1.2 Join Cardinality Upper Bound . . . . . . . . . . . . . . . . . . . . . 64 3.2 Upper Bound Join Enumeration with Synopsis (UES) . . . . . . . . . . . . 66 3.2.1 U-Block: Simple Upper Bound for Joins . . . . . . . . . . . . . . . . 67 3.2.2 E-Block: Customized Enumeration Scheme . . . . . . . . . . . . . 68 3.2.3 UES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.1 General Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 JOIN STAGE: PHYSICAL OPERATOR SELECTION 4.1 Operator Selection vs Join Ordering . . . . . . . . . . . . . . . . . . . . . 77 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.1 Adaptive Query Processing . . . . . . . . . . . . . . . . . . . . . . 80 4.2.2 Bandit Optimizer (Bao) . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3 TONIC: Learned Physical Join Operator Selection . . . . . . . . . . . . . 82 4.3.1 Query Execution Plan Synopsis (QEP-S) . . . . . . . . . . . . . . . 83 4.3.2 QEP-S Life-Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.3 QEP-S Design Considerations . . . . . . . . . . . . . . . . . . . . . . 87 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.1 Performance Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Rate of Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4.3 Data Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.4 TONIC - Runtime Traits . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 TWO-STAGE OPTIMIZER FRAMEWORK 5.1 Upper-Bound-Driven Join Ordering Component . . . . . . . . . . . . . 101 5.2 Physical Operator Selection Component . . . . . . . . . . . . . . . . . . 103 5.3 Example Query Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 103 6 CONCLUSION 107 BIBLIOGRAPHY 109 LIST OF FIGURES 117 LIST OF TABLES 121 A APPENDIX A.1 Basics of Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 A.2 Why Q? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.3 0-TS Proof of Unbiased Estimate . . . . . . . . . . . . . . . . . . . . . . . . 125 A.4 UES Upper Bound Property . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 A.5 TONIC – Selectivity-Aware Branching . . . . . . . . . . . . . . . . . . . . . 128 A.6 TONIC – Sequences of Query Execution . . . . . . . . . . . . . . . . . . . 12

    Robust Query Optimization for Analytical Database Systems

    Get PDF
    Querying and efficiently analyzing complex data is required to gain valuable business insights, to support machine learning applications, and to make up-to-date information available. Therefore, this thesis investigates opportunities and challenges of selecting the most efficient execution strategy for analytical queries. These challenges include hard-to-capture data characteristics such as skew and correlation, the support of arbitrary data types, and the optimization time overhead of complex queries. Existing approaches often rely on optimistic assumptions about the data distribution, which can result in significant response time delays when these assumptions are not met. On the contrary, we focus on robust query optimization, emphasizing consistent query performance and applicability. Our presentation follows the general select-project-join query pattern, representing the fundamental stages of analytical query processing. To support arbitrary data types and complex filter expressions in the select stage, a novel sampling-based selectivity estimator is developed. Our approach exploits information from filter subexpressions and estimates correlations that are not captured by existing sampling-based methods. We demonstrate improved estimation accuracy and query execution time. Further, to minimize the runtime overhead of sampling, we propose new techniques that exploit access patterns and auxiliary database objects such as indices. For the join stage, we introduce a robust optimization approach by developing an upper-bound join enumeration strategy that connects accurate filter selectivity estimates –e.g., using our sampling-based approach– to join ordering. We demonstrate that join orders based on our upper-bound join ordering strategy achieve more consistent performance and faster workload execution on state-of-the-art database systems. However, besides identifying good logical join orders, it is crucial to determine appropriate physical join operators before query plan execution. To understand the importance of fine-grained physical operator selections, we exhaustively execute fixed join orders with all possible operator combinations. This analysis reveals that none of the investigated query optimizers fully reaches the potential of optimal operator decisions. Based on these insights and to achieve fine-grained operator selections for the previously determined join orders, the thesis presents a lightweight learning-based physical execution plan refinement component called. We show that this refinement component consistently outperforms existing approaches for physical operator selection while enabling a novel two-stage optimizer design. We conclude the thesis by providing a framework for the two-stage optimizer design that allows users to modify, replicate, and further analyze the concepts discussed throughout this thesis.:1 INTRODUCTION 1.1 Analytical Query Processing . . . . . . . . . . . . . . . . . . . 12 1.2 Select-Project-Join Queries . . . . . . . . . . . . . . . . . . . 13 1.3 Basics of SPJ Query Optimization . . . . . . . . . . . . . . . . . 14 1.3.1 Plan Enumeration . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.3 Cardinality Estimation . . . . . . . . . . . . . . . . . . . . . 15 1.4 Robust SPJ Query Optimization . . . . . . . . . . . . . . . . . . 16 1.4.1 Tail Latency Root Cause Analysis . . . . . . . . . . . . . . . . 17 1.4.2 Tenets of Robust Query Optimization . . . . . . . . . . . . . . 19 1.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 SELECT (-PROJECT) STAGE 2.1 Sampling for Selectivity Estimation . . . . . . . . . . . . . . . 24 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.1 Combined Selectivity Estimation (CSE) . . . . . . . . . . . . . 29 2.2.2 Kernel Density Estimator . . . . . . . . . . . . . . . . . . . . 31 2.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3 Beta Estimator for 0-Tuple-Situations . . . . . . . . . . . . . . 33 2.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.2 Beta Distribution in Non-0-TS . . . . . . . . . . . . . . . . . 35 2.3.3 Parameter Estimation in 0-TS . . . . . . . . . . . . . . . . . . 37 2.3.4 Selectivity Estimation and Predicate Ordering . . . . . . . . . 39 2.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4 Customized Sampling Techniques . . . . . . . . . . . . . . . . . . 53 2.4.1 Focused Sampling . . . . . . . . . . . . . . . . . . . . . . . . 54 2.4.2 Conditional Sampling . . . . . . . . . . . . . . . . . . . . . . 56 2.4.3 Zone Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3 JOIN STAGE: LOGICAL ENUMERATION 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.1.1 Point Estimates . . . . . . . . . . . . . . . . . . . . . . . . 63 3.1.2 Join Cardinality Upper Bound . . . . . . . . . . . . . . . . . . 64 3.2 Upper Bound Join Enumeration with Synopsis (UES) . . . . . . . . . 66 3.2.1 U-Block: Simple Upper Bound for Joins . . . . . . . . . . . . . 67 3.2.2 E-Block: Customized Enumeration Scheme . . . . . . . . . . . . . 68 3.2.3 UES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.3.1 General Performance . . . . . . . . . . . . . . . . . . . . . . 72 3.3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4 JOIN STAGE: PHYSICAL OPERATOR SELECTION 4.1 Operator Selection vs Join Ordering . . . . . . . . . . . . . . . 77 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.1 Adaptive Query Processing . . . . . . . . . . . . . . . . . . . 80 4.2.2 Bandit Optimizer (Bao) . . . . . . . . . . . . . . . . . . . . . 81 4.3 TONIC: Learned Physical Join Operator Selection . . . . . . . . . 82 4.3.1 Query Execution Plan Synopsis (QEP-S) . . . . . . . . . . . . . 83 4.3.2 QEP-S Life-Cycle . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.3 QEP-S Design Considerations . . . . . . . . . . . . . . . . . . 87 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.1 Performance Factors . . . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Rate of Improvement . . . . . . . . . . . . . . . . . . . . . . 92 4.4.3 Data Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4.4 TONIC - Runtime Traits . . . . . . . . . . . . . . . . . . . . . 97 4.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 TWO-STAGE OPTIMIZER FRAMEWORK 5.1 Upper-Bound-Driven Join Ordering Component . . . . . . . . . . . . 101 5.2 Physical Operator Selection Component . . . . . . . . . . . . . . 103 5.3 Example Query Optimization . . . . . . . . . . . . . . . . . . . . 103 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A APPENDIX A.1 Basics of Query Execution . . . . . . . . . . . . . . . . . . . . 123 A.2 Why Q? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 A.3 0-TS Proof of Unbiased Estimate . . . . . . . . . . . . . . . . . 125 A.4 UES Upper Bound Property . . . . . . . . . . . . . . . . . . . . . 127 A.5 TONIC – Selectivity-Aware Branching . . . . . . . . . . . . . . . 128 A.6 TONIC – Sequences of Query Execution . . . . . . . . . . . . . . . 12

    An Efficient Query Optimizer with Materialized Intermediate Views in Distributed and Cloud Environment

    Get PDF
    In cloud computing environment hardware resources required for the execution of query using distributed relational database system are scaled up or scaled down according to the query workload performance. Complex queries require large scale of resources in order to complete their execution efficiently. The large scale of resource requirements can be reduced by minimizing query execution time that maximizes resource utilization and decreases payment overhead of customers. Complex queries or batch queries contain some common subexpressions. If these common subexpressions evaluated once and their results are cached, they can be used for execution of further queries. In this research, we have come up with an algorithm for query optimization, which aims at storing intermediate results of the queries and use these by-products for execution of future queries. Extensive experiments have been carried out with the help of simulation model to test the algorithm efficiency

    An early look at the LDBC Social Network Benchmark's Business Intelligence workload

    Get PDF
    In this short paper, we provide an early look at the LDBC Social Network Benchmark's Business Intelligence (BI) workload which tests graph data management systems on a graph business analytics workload. Its queries involve complex aggregations and navigations (joins) that touch large data volumes, which is typical in BI workloads, yet they depend heavily on graph functionality such as connectivity tests and path finding. We outline the motivation for this new benchmark, which we derived from many interactions with the graph database industry and its users, and situate it in a scenario of social network analysis. The workload was designed by taking into account technical ``chokepoints'' identified by database system architects from academia and industry, which we also describe and map to the queries. We present reference implementations in openCypher, PGQL, SPARQL, and SQL, and preliminary results of SNB BI on a number of graph data management systems

    Just-in-time Data Distribution for Analytical Query Processing

    Get PDF
    Distributed processing commonly requires data spread across machines using a priori static or hash-based data allocation. In this paper, we explore an alternative approach that starts from a master node in control of the complete database, and a variable number of worker nodes for delegated query processing. Data is shipped just-in-time to the worker nodes using a need to know policy, and is being reused, if possible, in subsequent queries. A bidding mechanism among the workers yields a scheduling with the most efficient reuse of previously shipped data, minimizing the data transfer costs. Just-in-time data shipment allows our system to benefit from locally available idle resources to boost overall performance. The system is maintenance-free and allocation is fully transparent to users. Our experiments show that the proposed adaptive distributed architecture is a viable and flexible alternative for small scale MapReduce-type of settings

    Mining Query Plans for Finding Candidate Queries and Sub-Queries for Materialized Views in BI Systems Without Cube Generation

    Get PDF
    Materialized views are important for optimizing Business Intelligence (BI) systems when they are designed without data cubes. Selecting candidate queries from large number of queries for materialized views is a challenging task. Most of the work done in the past involves finding out frequent queries from the past workload and creating materialized views from such queries by either manually analyzing workload or using approximate string matching algorithms using query text. Most of the existing methods suggest complete queries but ignore query components such as sub queries for creation of materialized views. This paper presents a novel method to determine on which queries and query components materialized views can be created to optimize aggregate and join queries by mining database of query execution plans which are in the form of binary trees. The proposed algorithm showed significant improvement in terms of more number of optimized queries because it is using the execution plan tree of the query as a basis of selection of query to be optimized using materialized views rather than choosing query text which is used by traditional methods. For selecting a correct set of queries to be optimized using materialized views, the paper proposes efficient specialized frequent tree component mining algorithm with novel heuristics to prune search space. These frequent components are used to determine the possible set of candidate queries for creation of materialized views. Experimentation on standard, real and synthetic data sets, and also the theoretical basis, proved that the proposed method is able to optimize a large number of queries with less number of materialized views and showed a significant improvement in performance compared to traditional methods
    corecore