    LSQB: A large-scale subgraph query benchmark

    We introduce LSQB, a new large-scale subgraph query benchmark. LSQB tests the performance of database management systems on an important class of subgraph queries overlooked by existing benchmarks. Matching a labelled structural graph pattern, referred to as subgraph matching, is the focus of LSQB. In relational terms, the benchmark tests DBMSs' join performance as a choke-point since subgraph matching is equivalent to multi-way joins between base Vertex and base Edge tables on ID attributes. The benchmark focuses on read-heavy workloads by relying on global queries which have been ignored by prior benchmarks. Global queries, also referred to as unseeded queries, are a type of queries that are only constrained by labels on the query vertices and edges. LSQB contains a total of nine queries and leverages the LDBC social network data generator for scalability. The benchmark gained both academic and industrial interest and is used internally by 5+ different vendors

    Extending dynamic-programming-based plan generators: beyond pure enumeration

    The query optimizer plays an important role in a database management system supporting a declarative query language, such as SQL. One of its central components is the plan generator, which is responsible for determining the optimal join order of a query. Plan generators based on dynamic programming have been known for several decades. However, some significant progress in this field has only been made recently. This includes the emergence of highly efficient enumeration algorithms and the ability to optimize a wide range of queries by supporting complex join predicates. This thesis builds upon the recent advancements by providing a framework for extending the aforementioned algorithms. To this end, a modular design is proposed that allows for the exchange of individual parts of the plan generator, thus enabling the implementor to add new features at will. This is demonstrated by taking the example of two previously unsolved problems, namely the correct and complete reordering of different types of join operators as well as the efficient reordering of join operators and grouping operators

    Optimization of Boolean expressions for main memory database systems

    With the ubiquity of main memory databases which are increasingly replacing the old disk-oriented databases, relations are being stored in denormalized form in order to increase the query throughput, thus, the dominance of join operators in terms of costsis being replaced by the costs of evaluating selection predicates. Boolean expressions containing selection predicates connected both conjunctively and disjunctively have been thus far solved by rather simple heuristics which leaves a large optimization potential unharvested. To exacerbate the matter, such heuristics rely on the independent predicate selectivity assumption which typically does not hold, and the constant predicate costs assumption which in terms of main memory database systems does not hold either. In this thesis we tackle the problem of optimizing Boolean expressions by not relying on the independence assumption nor the constant predicate costs assumption. We present optimization algorithms for queries containing both conjunctively and disjunctively connected predicates together with a cost model which precisely captures CPU architectural characteristics such as branch misprediction. Our optimization algorithms achieve the optimum in terms of plan quality, thus, they harvest the entire optimization potential inherent in Boolean expressions

    Discovery and application of data dependencies

    Orientador: Prof. Dr. Eduardo Cunha de AlmeidaTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 08/09/2020Inclui referências: p. 126-140Área de concentração: Ciência da ComputaçãoResumo: D ependências de dados (ou, simplesmente, dependências) têm um papel fundamental em muitos aspectos do gerenciam ento de dados. Em consequência, pesquisas recentes têm desenvolvido contribuições para im portante problem as relacionados à dependências. Esta tese traz contribuições que abrangem dois desses problemas. O prim eiro problem a diz respeito à descoberta de dependências com alto poder de expressividade. O objetivo é substituir o projeto m anual de dependências, o qual é sujeito a erros, por um algoritmo capaz de descobrir dependências a partir de dados apenas. N esta tese, estudamos a descoberta de restrições de negação, um tipo de dependência que contorna muitos problemas relacionados ao poder de expressividade de depêndencias. As restrições de negação têm poder de expressividade suficiente para generalizar outros tipos importantes de dependências, e expressar com plexas regras de negócios. No entanto, sua descoberta é com putacionalm ente difícil, pois possui um espaço de busca m aior do que o espaço de busca visto na descoberta de dependências mais simples. Esta tese apresenta novas técnicas na forma de um algoritmo para a descoberta de restrições de negação. Avaliamos o projeto de nosso algoritmo em uma variedade de cenários: conjuntos de dados reais e sintéticos; e núm eros variáveis de registros e colunas. N ossa avaliação m ostra que, em com paração com soluções do estado da arte, nosso algoritmo m elhora significativamente a eficiência da descoberta de restrição de negação em term os de tempo de execução. O segundo problem a diz respeito à aplicação de dependências no gerenciam ento de dados. Primeiro, estudamos a aplicação de dependências na melhoraria da consistência de dados, um aspecto crítico da qualidade dos dados. Uma m aneira comum de m odelar inconsistências é identificando violações de dependências. N esse contexto, esta tese apresenta um m étodo que estende nosso algoritm o para a descoberta de restrições de negação de form a que ele possa retornar resultados confiáveis, m esm o que o algoritm o execute sobre dados contendo alguns registros inconsistentes. M ostram os que é possível extrair evidências dos conjuntos de dados para descobrir restrições de negação que se mantêm aproximadamente. Nossa avaliação mostra que nosso método retorna dependências de negação que podem identificar, com boa precisão e recuperação, inconsistências no conjunto de dados de entrada. Esta tese traz mais um a contribuição no que diz respeito à aplicação de dependências para m elhorar a consistência de dados. Ela apresenta um sistem a para detectar violações de dependências de form a eficiente. Realizam os um a extensa avaliação de nosso sistem a usando comparações com várias abordagens; dados do mundo real e sintéticos; e vários tipos de restrições de negação. Mostramos que os sistemas de gerenciamento de banco de dados comerciais testados com eçam a apresentar baixo desem penho para conjuntos de dados relativam ente pequenos e alguns tipos de restrições de negação. Nosso sistema, por sua vez, apresenta execuções até três ordens de magnitude mais rápidas do que as de outras soluções relacionadas, especialmente para conjuntos de dados maiores e um grande número de violações identificadas. N ossa contribuição final diz respeito à aplicação de dependências na otim ização de consultas. Em particular, esta tese apresenta um sistema para a descoberta automática e seleção de dependências funcionais que potencialmente melhoram a execução de consultas. Nosso sistema com bina representações das dependências funcionais descobertas em um conjunto de dados com representações extraídas de cargas de trabalho de consulta. Essa com binação direciona a seleção de dependências funcionais que podem produzir reescritas de consulta para as consultas de entrada. N ossa avaliação experim ental m ostra que nosso sistem a seleciona dependências funcionais relevantes que podem ajudar na redução do tempo de resposta geral de consultas. Palavras-chave: Perfilamento de dados. Qualidade de dados. Limpeza de dados. Depenência de dados. Execução de consulta.Abstract: Data dependencies (or dependencies, for short) have a fundamental role in many facets of data management. As a result, recent research has been continually driving contributions to central problem s in connection w ith dependencies. This thesis makes contributions that reach two of these problems. The first problem regards the discovery of dependencies of high expressive power. The goal is to replace the error-prone process of m anual design of dependencies with an algorithm capable of discovering dependencies using only data. In this thesis, we study the discovery of denial constraints, a type of dependency that circumvents many expressiveness drawbacks. Denial constraints have enough expressive pow er to generalize other im portant types of dependencies and to express com plex business rules. However, their discovery is com putationally hard since it regards a search space that is bigger than the search space seen in the discovery of sim pler dependencies. This thesis introduces novel algorithm ic techniques in the form of an algorithm for the discovery of denial constraints. We evaluate the design of our algorithm in a variety of scenarios: real and synthetic datasets; and a varying num ber of records and columns. Our evaluation shows that, com pared to state-of-the-art solutions, our algorithm significantly improves the efficiency of denial constraint discovery in terms of runtime. The second problem concerns the application of dependencies in data management. We first study the application of dependencies for improving data consistency, a critical aspect of data quality. A com m on way to m odel data inconsistencies is by identifying violations of dependencies. in that context, this thesis presents a m ethod that extends our algorithm for the discovery of denial constraints such that it can return reliable results even if the algorithm runs on data containing some inconsistent records. A central insight is that it is possible to extract evidence from datasets to discover denial constraints that alm ost hold in the dataset. Our evaluation shows that our method returns denial dependencies that can identify, with good precision and recall, inconsistencies in the input dataset. This thesis makes one m ore contribution regarding the application of dependencies for im proving data consistency. it presents a system for detecting violations of dependencies efficiently. We perform an extensive evaluation of our system that includes comparisons with several different approaches; real-world and synthetic data; and various kinds of denial constraints. We show that the tested com m ercial database m anagem ent systems start underperform ing for relatively small datasets and production dependencies in the form of denial constraints. Our system, in turn, is up to three orders-of-m agnitude faster than related solutions, especially for larger datasets and massive numbers of identified violations. Our final contribution regards the application of dependencies in query optimization. In particular, this thesis presents a system for the automatic discovery and selection of functional dependencies that potentially improve query executions. Our system combines representations from the functional dependencies discovered in a dataset with representations of the query workloads that run for that dataset. This combination guides the selection of functional dependencies that can produce query rewritings for the incoming queries. Our experimental evaluation shows that our system selects relevant functional dependencies, which can help in reducing the overall query response time. Keywords: D ata profiling. D ata quality. D ata cleaning. D ata dependencies. Query execution

    Algorithms for Efficient Top-Down Join Enumeration

    For a DBMS that provides support for a declarative query language like SQL, the query optimizer is a crucial piece of software. The declarative nature of a query allows it to be translated into many equivalent evaluation plans. The process of choosing a suitable plan from all alternatives is known as query optimization. The basis of this choice are a cost model and statistics over the data. Essential for the costs of a plan is the execution order of join operations in its operator tree, since the runtime of plans with different join orders can vary by several orders of magnitude. An exhaustive search for an optimal solution over all possible operator trees is computationally infeasible. To decrease complexity, the search space must be restricted. Therefore, a well-accepted heuristic is applied: All possible bushy join trees are considered, while cross products are excluded from the search. There are two efficient approaches to identify the best plan: bottom-up and top-down join enumeration. But only the top-down approach allows for branch-and-bound pruning, which can improve compile time by several orders of magnitude, while still preserving optimality. Hence, this thesis focuses on the top-down join enumeration. In the first part, we present two efficient graph-partitioning algorithms suitable for top-down join enumeration. However, as we will see, there are two severe limitations: The proposed algorithms can handle only (1) simple (binary) join predicates and (2) inner joins. Therefore, the second part adopts one of the proposed partitioning strategies to overcome those limitations. Furthermore, we propose a more generic partitioning framework that enables every graph-partitioning algorithm to handle join predicates involving more than two relations, and outer joins as well as other non-inner joins. As we will see, our framework is more efficient than the adopted graph-partitioning algorithm. The third part of this thesis discusses the two branch-and-bound pruning strategies that can be found in the literature. We present seven advancements to the combined strategy that improve pruning (1) in terms of effectiveness, (2) in terms of robustness and (3), most importantly, avoid the worst-case behavior otherwise observed. Different experiments evaluate the performance improvements of our proposed methods. We use the TPC-H, TPC-DS and SQLite test suite benchmarks to evaluate our joined contributions