128 research outputs found

    Beyond Worst-Case Analysis for Joins with Minesweeper

    Full text link
    We describe a new algorithm, Minesweeper, that is able to satisfy stronger runtime guarantees than previous join algorithms (colloquially, `beyond worst-case guarantees') for data in indexed search trees. Our first contribution is developing a framework to measure this stronger notion of complexity, which we call {\it certificate complexity}, that extends notions of Barbay et al. and Demaine et al.; a certificate is a set of propositional formulae that certifies that the output is correct. This notion captures a natural class of join algorithms. In addition, the certificate allows us to define a strictly stronger notion of runtime complexity than traditional worst-case guarantees. Our second contribution is to develop a dichotomy theorem for the certificate-based notion of complexity. Roughly, we show that Minesweeper evaluates β\beta-acyclic queries in time linear in the certificate plus the output size, while for any β\beta-cyclic query there is some instance that takes superlinear time in the certificate (and for which the output is no larger than the certificate size). We also extend our certificate-complexity analysis to queries with bounded treewidth and the triangle query.Comment: [This is the full version of our PODS'2014 paper.

    Box Covers and Domain Orderings for Beyond Worst-Case Join Processing

    Get PDF
    Recent beyond worst-case optimal join algorithms Minesweeper and its generalization Tetris have brought the theory of indexing and join processing together by developing a geometric framework for joins. These algorithms take as input an index B\mathcal{B}, referred to as a box cover, that stores output gaps that can be inferred from traditional indexes, such as B+ trees or tries, on the input relations. The performances of these algorithms highly depend on the certificate of B\mathcal{B}, which is the smallest subset of gaps in B\mathcal{B} whose union covers all of the gaps in the output space of a query QQ. We study how to generate box covers that contain small size certificates to guarantee efficient runtimes for these algorithms. First, given a query QQ over a set of relations of size NN and a fixed set of domain orderings for the attributes, we give a O~(N)\tilde{O}(N)-time algorithm called GAMB which generates a box cover for QQ that is guaranteed to contain the smallest size certificate across any box cover for QQ. Second, we show that finding a domain ordering to minimize the box cover size and certificate is NP-hard through a reduction from the 2 consecutive block minimization problem on boolean matrices. Our third contribution is a O~(N)\tilde{O}(N)-time approximation algorithm called ADORA to compute domain orderings, under which one can compute a box cover of size O~(Kr)\tilde{O}(K^r), where KK is the minimum box cover for QQ under any domain ordering and rr is the maximum arity of any relation. This guarantees certificates of size O~(Kr)\tilde{O}(K^r). We combine ADORA and GAMB with Tetris to form a new algorithm we call TetrisReordered, which provides several new beyond worst-case bounds. On infinite families of queries, TetrisReordered's runtimes are unboundedly better than the bounds stated in prior work

    Optimal Joins Using Compact Data Structures

    Get PDF
    Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count with several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality we either need to build completely new indexes, or we must populate the database with several instantiations of indexes such as B+-trees. Either way, this means spending an extra amount of storage space that may be non-negligible. We show that optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of extra storage. Our representation is a compact quadtree for the static indexes, and a dynamic quadtree sharing subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, and show that the running time of this algorithm is worst-case optimal in data complexity. Remarkably, we can extend our framework to evaluate more expressive queries from relational algebra by introducing a lazy version of qdags (lqdags). Once again, we can show that the running time of our algorithms is worst-case optimal

    Instance and Output Optimal Parallel Algorithms for Acyclic Joins

    Full text link
    Massively parallel join algorithms have received much attention in recent years, while most prior work has focused on worst-optimal algorithms. However, the worst-case optimality of these join algorithms relies on hard instances having very large output sizes, which rarely appear in practice. A stronger notion of optimality is {\em output-optimal}, which requires an algorithm to be optimal within the class of all instances sharing the same input and output size. An even stronger optimality is {\em instance-optimal}, i.e., the algorithm is optimal on every single instance, but this may not always be achievable. In the traditional RAM model of computation, the classical Yannakakis algorithm is instance-optimal on any acyclic join. But in the massively parallel computation (MPC) model, the situation becomes much more complicated. We first show that for the class of r-hierarchical joins, instance-optimality can still be achieved in the MPC model. Then, we give a new MPC algorithm for an arbitrary acyclic join with load O ({\IN \over p} + {\sqrt{\IN \cdot \OUT} \over p}), where \IN,\OUT are the input and output sizes of the join, and pp is the number of servers in the MPC model. This improves the MPC version of the Yannakakis algorithm by an O (\sqrt{\OUT \over \IN} ) factor. Furthermore, we show that this is output-optimal when \OUT = O(p \cdot \IN), for every acyclic but non-r-hierarchical join. Finally, we give the first output-sensitive lower bound for the triangle join in the MPC model, showing that it is inherently more difficult than acyclic joins
    • …
    corecore