32,755 research outputs found

    A survey of parallel execution strategies for transitive closure and logic programs

    Get PDF
    An important feature of database technology of the nineties is the use of parallelism for speeding up the execution of complex queries. This technology is being tested in several experimental database architectures and a few commercial systems for conventional select-project-join queries. In particular, hash-based fragmentation is used to distribute data to disks under the control of different processors in order to perform selections and joins in parallel. With the development of new query languages, and in particular with the definition of transitive closure queries and of more general logic programming queries, the new dimension of recursion has been added to query processing. Recursive queries are complex; at the same time, their regular structure is particularly suited for parallel execution, and parallelism may give a high efficiency gain. We survey the approaches to parallel execution of recursive queries that have been presented in the recent literature. We observe that research on parallel execution of recursive queries is separated into two distinct subareas, one focused on the transitive closure of Relational Algebra expressions, the other one focused on optimization of more general Datalog queries. Though the subareas seem radically different because of the approach and formalism used, they have many common features. This is not surprising, because most typical Datalog queries can be solved by means of the transitive closure of simple algebraic expressions. We first analyze the relationship between the transitive closure of expressions in Relational Algebra and Datalog programs. We then review sequential methods for evaluating transitive closure, distinguishing iterative and direct methods. We address the parallelization of these methods, by discussing various forms of parallelization. Data fragmentation plays an important role in obtaining parallel execution; we describe hash-based and semantic fragmentation. Finally, we consider Datalog queries, and present general methods for parallel rule execution; we recognize the similarities between these methods and the methods reviewed previously, when the former are applied to linear Datalog queries. We also provide a quantitative analysis that shows the impact of the initial data distribution on the performance of methods

    From Massive Parallelization to Quantum Computing: Seven Novel Approaches to Query Optimization

    Get PDF
    The goal of query optimization is to map a declarative query (describing data to generate) to a query plan (describing how to generate the data) with optimal execution cost. Query optimization is required to support declarative query interfaces. It is a core problem in the area of database systems and has received tremendous attention in the research community, starting with an initial publication in 1979. In this thesis, we revisit the query optimization problem. This visit is motivated by several developments that change the context of query optimization. That change is not reflected in prior literature. First, advances in query execution platforms and processing techniques have changed the context of query optimization. Novel provisioning models and processing techniques such as Cloud computing, crowdsourcing, or approximate processing allow to trade between different execution cost metrics (e.g., execution time versus monetary execution fees in case of Cloud computing). This makes it necessary to compare alternative execution plans according to multiple cost metrics in query optimization. While this is a common scenario nowadays, the literature on query optimization with multiple cost metrics (a generalization of the classical problem variant with one execution cost metric) is surprisingly sparse. While prior methods take hours to optimize even moderately sized queries when considering multiple cost metrics, we propose a multitude of approaches to make query optimization in such scenarios practical. A second development that we address in this thesis is the availability of novel software and hardware platforms that can be exploited for optimization. We will show that integer programming solvers, massively parallel clusters (which nowadays are commonly used for query execution), and adiabatic quantum annealers enable us to solve query optimization problem instances that are far beyond the capabilities of prior approaches. In summary, we propose seven novel approaches to query optimization that significantly increase the size of the problem instances that can be addressed (measured by the query size and by the number of considered execution cost metrics). Those novel approaches can be classified into three broad categories: moving query optimization before run time to relax constraints on optimization time, trading optimization time for relaxed optimality guarantees (leading to approximation schemes, incremental algorithms, and randomized algorithms for query optimization with multiple cost metrics), and reducing optimization time by leveraging novel software and hardware platforms (integer programming solvers, massively parallel clusters, and adiabatic quantum annealers). Those approaches are novel since they address novel problem variants of query optimization, introduced in this thesis, since they are novel for their respective problem variant (e.g., we propose the first randomized algorithm for query optimization with multiple cost metrics), or because they have never been used for optimization problems in the database domain (e.g., this is the first time that quantum computing is used to solve a database-specific optimization problem)

    Distributed top-k aggregation queries at large

    Get PDF
    Top-k query processing is a fundamental building block for efficient ranking in a large number of applications. Efficiency is a central issue, especially for distributed settings, when the data is spread across different nodes in a network. This paper introduces novel optimization methods for top-k aggregation queries in such distributed environments. The optimizations can be applied to all algorithms that fall into the frameworks of the prior TPUT and KLEE methods. The optimizations address three degrees of freedom: 1) hierarchically grouping input lists into top-k operator trees and optimizing the tree structure, 2) computing data-adaptive scan depths for different input sources, and 3) data-adaptive sampling of a small subset of input sources in scenarios with hundreds or thousands of query-relevant network nodes. All optimizations are based on a statistical cost model that utilizes local synopses, e.g., in the form of histograms, efficiently computed convolutions, and estimators based on order statistics. The paper presents comprehensive experiments, with three different real-life datasets and using the ns-2 network simulator for a packet-level simulation of a large Internet-style network

    Extending a multi-set relational algebra to a parallel environment

    Get PDF
    Parallel database systems will very probably be the future for high-performance data-intensive applications. In the past decade, many parallel database systems have been developed, together with many languages and approaches to specify operations in these systems. A common background is still missing, however. This paper proposes an extended relational algebra for this purpose, based on the well-known standard relational algebra. The extended algebra provides both complete database manipulation language features, and data distribution and process allocation primitives to describe parallelism. It is defined in terms of multi-sets of tuples to allow handling of duplicates and to obtain a close connection to the world of high-performance data processing. Due to its algebraic nature, the language is well suited for optimization and parallelization through expression rewriting. The proposed language can be used as a database manipulation language on its own, as has been done in the PRISMA parallel database project, or as a formal basis for other languages, like SQL

    Forecasting the cost of processing multi-join queries via hashing for main-memory databases (Extended version)

    Full text link
    Database management systems (DBMSs) carefully optimize complex multi-join queries to avoid expensive disk I/O. As servers today feature tens or hundreds of gigabytes of RAM, a significant fraction of many analytic databases becomes memory-resident. Even after careful tuning for an in-memory environment, a linear disk I/O model such as the one implemented in PostgreSQL may make query response time predictions that are up to 2X slower than the optimal multi-join query plan over memory-resident data. This paper introduces a memory I/O cost model to identify good evaluation strategies for complex query plans with multiple hash-based equi-joins over memory-resident data. The proposed cost model is carefully validated for accuracy using three different systems, including an Amazon EC2 instance, to control for hardware-specific differences. Prior work in parallel query evaluation has advocated right-deep and bushy trees for multi-join queries due to their greater parallelization and pipelining potential. A surprising finding is that the conventional wisdom from shared-nothing disk-based systems does not directly apply to the modern shared-everything memory hierarchy. As corroborated by our model, the performance gap between the optimal left-deep and right-deep query plan can grow to about 10X as the number of joins in the query increases.Comment: 15 pages, 8 figures, extended version of the paper to appear in SoCC'1
    corecore