6 research outputs found

    Operationalizing and automating data governance

    Get PDF
    The ability to cross data from multiple sources represents a competitive advantage for organizations. Yet, the governance of the data lifecycle, from the data sources into valuable insights, is largely performed in an ad-hoc or manual manner. This is specifically concerning in scenarios where tens or hundreds of continuously evolving data sources produce semi-structured data. To overcome this challenge, we develop a framework for operationalizing and automating data governance. For the first, we propose a zoned data lake architecture and a set of data governance processes that allow the systematic ingestion, transformation and integration of data from heterogeneous sources, in order to make them readily available for business users. For the second, we propose a set of metadata artifacts that allow the automatic execution of data governance processes, addressing a wide range of data management challenges. We showcase the usefulness of the proposed approach using a real world use case, stemming from the collaborative project with the World Health Organization for the management and analysis of data about Neglected Tropical Diseases. Overall, this work contributes on facilitating organizations the adoption of data-driven strategies into a cohesive framework operationalizing and automating data governance.This work was partly supported by the DOGO4ML project, funded by the Spanish Ministerio de Ciencia e Innovación under project PID2020-117191RB-I00/AEI/10.13039/501100011033. Sergi Nadal is partly supported by the Spanish Ministerio de Ciencia e Innovación, as well as the European Union - NextGenerationEU, under project FJC2020-045809-I/AEI/10.13039/501100011033.Peer ReviewedPostprint (published version

    Data-Efficient Learned Database Components

    Get PDF
    While databases are the backbone of many software systems, database components such as query optimizers often have to be redesigned to meet the increasing variety in workloads, data and hardware designs, which incurs significant engineering efforts to adapt their design. Recently, it was thus proposed to replace DBMS components such as optimizers, cardinality estimators, etc. by ML models, which not only eliminates the engineering efforts but also provides superior performance for many components. The predominant approach to derive such learned components is workload-driven learning where ten thousands of queries have to be executed first to derive the necessary training data. Unfortunately, the training data collection, which can take days even for medium-sized datasets, has to be repeated for every new database (i.e., the combination of dataset, schema and workload) a component should be deployed for. This is especially problematic for cloud databases such as Snowflake or Redshift since this effort has to be incurred for every customer. This dissertation thus proposes data-efficient learned database components, which either reduce or fully eliminate the high costs of training data collection for learned database components. In particular, three directions are proposed in this dissertation, namely (i) we first aim to reduce the number of training queries needed for workload-driven components before we (ii) propose data-driven learning, which uses the data stored in the database as training data instead of queries, and (iii) introduce zero-shot learned components, which can generalize to new databases out-of-the-box, s.t. no training data collection is required. First, we strive to reduce the number of training queries required for workload-driven components by using simulation models to convey the basic tradeoffs of the underlying problem, e.g., that in database partitioning the network costs of shuffling tuples over the network for joins is the dominating factor. This substantially reduces the number of training queries since the basic principles are already covered by the simulation model and thus only subtleties not covered in the simulation model have to be learned by observing query executions, which we will demonstrate for the problem of database partitioning. An alternative direction is to incorporate domain knowledge (e.g., in a cost model we could encode that scan costs increase linearly with the number of tuples) into components by designing them using differentiable programming. This significantly reduces the number of learnable parameters and thus also the number of required training queries. We demonstrate the feasibility of the approach for the problem of cost estimation in databases. While both approaches reduce the number of training queries, there is still a significant number of training queries required for unseen databases. This motivates our second approach of data-driven learning. In particular, we propose to train the database component by learning the data distribution present in a database instead of observing query executions. This not only completely eliminates the need to collect training data queries but can even improve the state-of-the-art in problems such as cardinality estimation or AQP. While we demonstrate the applicability to a wide range of additional database tasks such as the completion of incomplete relational datasets, data-driven learning is only useful for problems where the data distribution provides sufficient information for the underlying database task. However, for tasks where observations of query executions are indispensable such as cost estimation, data-driven learning cannot be leveraged. In a third direction, we thus propose zero-shot learned database components, which are applicable to a broader set of tasks including those that require observations of queries. In particular, motivated by recent advances in transfer learning, we propose to pretrain a model once on a variety of databases and workloads and thus allow the component to generalize to unseen databases out-of-the-box. Hence, similar to data-driven learning no training queries have to be collected. In this dissertation, we demonstrate that zero-shot learning can indeed yield learned cost models which can predict query latencies on entirely unseen databases more accurately than state-of-the-art workload-driven approaches, which require ten thousands of query executions on every unseen database. Overall, the proposed techniques yield state-of-the-art performance for many database tasks while significantly reducing or completely eliminating the expensive training data collection for unseen databases. However, while the proposed directions address the prevalent data-inefficiency of learned database components, there are still many opportunities to improve learned components in the future. First, the robustness and debuggability of learned components should be improved since as of today they do not offer the same transparency as standard code in databases, which can render the components less attractive to be deployed in production systems. Moreover, to increase the applicability of data-driven models it is desirable to increase the coverage of supported queries, e.g., queries involving wildcard predicates on string columns, which are currently not supported by data-driven learning. Finally, we envision that a broader set of tasks should be supported in the future by zero-shot models (e.g., query optimization) potentially converging towards complete zero-shot learned systems

    Automatically configuring parallelism for hybrid layouts

    Get PDF
    Distributed processing frameworks process data in parallel by dividing it into multiple partitions and each partition is processed in a separate task. The number of tasks is always created based on the total file size. However, this can lead to launch more tasks than needed in the case of hybrid layouts, because they help to read less data for certain operations (i.e., projection, selection). The over-provisioning of tasks may increase the job execution time and induce significant waste of computing resources. The latter due to the fact that each task introduces extra overhead (e.g., initialization, garbage collection, etc.). To allow a more efficient use of resources and reduce the job execution time, we propose a cost-based approach that decides the number of tasks based on the data being read. The proposed cost-model can be utilized in a multi-objective approach to decide both the number of tasks and number of machines for execution.Peer ReviewedPostprint (author's final draft
    corecore