75 research outputs found

    JoinGym: An Efficient Query Optimization Environment for Reinforcement Learning

    Full text link
    In this paper, we present \textsc{JoinGym}, an efficient and lightweight query optimization environment for reinforcement learning (RL). Join order selection (JOS) is a classic NP-hard combinatorial optimization problem from database query optimization and can serve as a practical testbed for the generalization capabilities of RL algorithms. We describe how to formulate each of the left-deep and bushy variants of the JOS problem as a Markov Decision Process (MDP), and we provide an implementation adhering to the standard Gymnasium API. We highlight that our implementation \textsc{JoinGym} is completely based on offline traces of all possible joins, which enables RL practitioners to easily and quickly test their methods on a realistic data management problem without needing to setup any systems. Moreover, we also provide all possible join traces on 33003300 novel SQL queries generated from the IMDB dataset. Upon benchmarking popular RL algorithms, we find that at least one method can obtain near-optimal performance on train-set queries but their performance degrades by several orders of magnitude on test-set queries. This gap motivates further research for RL algorithms that generalize well in multi-task combinatorial optimization problems.Comment: We will make all the queries available soo

    Constructing Optimal Bushy Processing Trees for Join Queries is NP-hard

    Full text link
    We show that constructing optimal bushy processing trees for join queriesis NP-hard. More specifically, we show that even the construction of optimal bushy trees for computing the cross product for a set of relations is NP-hard

    Dynamic programming strikes back

    Get PDF
    Two highly efficient algorithms are known for optimally ordering joins while avoiding cross products: DPccp, which is based on dynamic programming, and Top-Down Partition Search, based on memoization. Both have two severe limitations: They handle only (1) simple (binary) join predicates and (2) inner joins. However, real queries may contain complex join predicates, involving more than two relations, and outer joins as well as other non-inner joins. Taking the most efficient known join-ordering algorithm, DPccp, as a starting point, we first develop a new algorithm, DPhyp, which is capable to handle complex join predicates efficiently. We do so by modeling the query graph as a (variant of a) hypergraph and then reason about its connected subgraphs. Then, we present a technique to exploit this capability to efficiently handle the widest class of non-inner joins dealt with so far. Our experimental results show that this reformulation of non-inner joins as complex predicates can improve optimization time by orders of magnitude, compared to known algorithms dealing with complex join predicates and non-inner joins. Once again, this gives dynamic programming a distinct advantage over current memoization techniques

    Multi-Mode Stream Processing For Hopping Window Queries

    Get PDF
    Window constraints are mechanisms to bound the tuples processed by continuous queries specified over unbounded data streams. While sliding window queries move the constraint window upon the arrival of each individual tuple, hopping window queries instead move the window by a fixed amount after some period, thus periodically refreshing their results. We observe that for large hops, techniques liked delta result updating may not be efficient -- as large portions of the tuples in the current window will be different from the previous window and thus must be maintained. On the other hand, the complete result updating technique, which has been found to be less suitable for sliding windows queries. Compute the next result based on the complete current window now can be shown to be superior in performance for some hopping windows queries. A trade-off emerges between the complete result method which has a lower per tuple processes cost but potentially processing redundant results versus the delta result method which has no redundant processing but pays a higher per tuple processing cost. On top of that, strict non-monotonic operators such as difference operator, cause premature expiration due to operator semantics. Negative tuples are needed for this kind of special expiration. Such negative tuples added extra burden to the stream engine. Thus, in streaming processing, the difference operator is typically suggested to be placed on top of the query plan despite its potential ability to reduce cardinality of the stream. With this thesis, we introduce a whole solution for hopping window query processing which includes an optimizer for generalized hopping window query optimization that exploits both processing techniques within one integrated query plan alone with query plan rewriting. First, we design the query operators to be multi-mode, that is, to be able to take either a delta or a complete result as input, and produce either a delta result or complete result as output. Then we design a cost model to be able to chose the optimal mode for each operator. Thirdly, our optimizer targets to configure each operator within a query plan to work in the suitable mode to achieve minimum overall processing costs. Last but not least, two query optimization techniques have been adopted. One explores all possibilities of pushing the difference down past joins using dynamic programming and assigning optimal mode at the same time, the other applies heuristic difference push down rule. The proposed techniques has been implemented within the WPI stream query engine, called CAPE. Finally, we show the benefit of our solution with a vast number of experimental results

    Dynamic Optimization and Migration of Continuous Queries Over Data Streams

    Get PDF
    Continuous queries process real-time streaming data and output results in streams for a wide range of applications. Due to the fluctuating stream characteristics, a streaming database system needs to dynamically adapt query execution. This dissertation proposes novel solutions to continuous query adaptation in three core areas, namely dynamic query optimization, dynamic plan migration and partitioned query adaptation. Runtime query optimization needs to efficiently generate plans that satisfy both CPU and memory resource constraints. Existing work focus on minimizing intermediate query results, which decreases memory and CPU usages simultaneously. However, doing so cannot assure that both resource constraints are being satisfied, because memory and CPU can be either positively or negatively correlated. This part of the dissertation proposes efficient optimization strategies that utilize both types of correlations to search the entire query plan space in polynomial time when a typical exhaustive search would take at least exponential time. Extensive experimental evaluations have demonstrated the effectiveness of the proposed strategies. Dynamic plan migration is concerned with on-the-fly transition from one continuous plan to a semantically equivalent yet more efficient plan. It is a must to guarantee the continuation and repeatability of dynamic query optimization. However, this research area has been largely neglected in the current literature. The second part of this dissertation proposes migration strategies that dynamically migrate continuous queries while guaranteeing the integrity of the query results, meaning there are no missing, duplicate or incorrect results. The extensive experimental evaluations show that the proposed strategies vary significantly in terms of output rates and memory usages given distinct system configurations and stream workloads. Partitioned query processing is effective to process continuous queries with large stateful operators in a distributed system. Dynamic load redistribution is necessary to balance uneven workload across machines due to changing stream properties. However, existing solutions generally assume static query plans without runtime query optimization. This part of the dissertation evaluates the benefits of applying query optimization in partitioned query processing and shows dramatic performance improvement of more than 300%. Several load balancing strategies are then proposed to consider the heterogeneity of plan shapes across machines caused by dynamic query optimization. The effectiveness of the proposed strategies is analyzed through extensive experiments using a cluster

    Scalable Integration View Computation and Maintenance with Parallel, Adaptive and Grouping Techniques

    Get PDF
    Materialized integration views constructed by integrating data from multiple distributed data sources help to achieve better access, reliable performance, and high availability for a wide range of applications. In this dissertation, we propose parallel, adaptive, and grouping techniques to address scalability challenges in high-performance integration view computation and maintenance due to increasingly large data sources and high rates of source updates. State-of-the-art parallel integration view computation makes the common assumption that the maximal pipelined parallelism leads to superior performance. We instead propose segmented bushy parallel processing that combines pipelined parallelism with alternate forms of parallelism to achieve an overall more effective strategy. Experimental studies conducted over a cluster of high-performance PCs confirm that the proposed strategy has an on average of 50\% improvement in terms of total processing time in comparison to existing solutions. Run-time adaptation becomes critical for parallel integration view computation due to its long running and memory intensive nature. We investigate two types of state level adaptations, namely, state spill and state relocation, to address the run-time memory shortage. We propose lazy-disk and active-disk approaches that integrate both adaptations to maximize run-time query throughput in a memory constrained environment. We also propose global throughput-oriented state adaptation strategies for computation plans with multiple state intensive operators. Extensive experiments confirm the effectiveness of our proposed adaptation solutions. Once results have been computed and materialized, it\u27s typically more efficient to maintain them incrementally instead of full recomputation. However, state-of-the-art incremental view maintenance require O(n2n^2) maintenance queries with n being the number of data sources that the view is defined upon. Moreover, they do not exploit view definitions and data source processing capabilities to further improve view maintenance performance. We propose novel grouping maintenance algorithms that dramatically reduce the number of maintenance queries to (O(n)). A cost-based view maintenance framework has been proposed to generate optimized maintenance plans tuned to particular environmental settings. Extensive experimental studies verify the effectiveness of our maintenance algorithms as well as the maintenance framework
    corecore