Search CORE

52 research outputs found

The VADA Architecture for Cost-Effective Data Wrangling

Author: Abel Edward
Civili Cristina
Fernandes Alvaro A.A.
Gottlob Georg
Keane John A.
Koehler Martin
Konstantinou Nikolaos
Libkin Leonid
Neumayr Bernd
Paton Norman W.
Sallinger Emanuel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Data wrangling, the multi-faceted process by which the data required by an application is identified, extracted, cleaned and integrated, is often cumbersome and labor intensive. In this paper, we present an architecture that supports a complete data wrangling lifecycle, orchestrates components dynamically, builds on automation wherever possible, is informed by whatever data is available, refines automatically produced results in the light of feedback, takes into account the user’s priorities, and supports data scientists with diverse skill sets. The architecture is demonstrated in practice for wrangling property sales and open government data

Crossref

Edinburgh Research Explorer

Oxford University Research Archive

The University of Manchester - Institutional Repository

The Vadalog System: Datalog-based Reasoning for Knowledge Graphs

Author: Bellomarini Luigi
Gottlob Georg
Sallinger Emanuel
Publication venue
Publication date: 01/05/2018
Field of study

Over the past years, there has been a resurgence of Datalog-based systems in the database community as well as in industry. In this context, it has been recognized that to handle the complex knowl\-edge-based scenarios encountered today, such as reasoning over large knowledge graphs, Datalog has to be extended with features such as existential quantification. Yet, Datalog-based reasoning in the presence of existential quantification is in general undecidable. Many efforts have been made to define decidable fragments. Warded Datalog+/- is a very promising one, as it captures PTIME complexity while allowing ontological reasoning. Yet so far, no implementation of Warded Datalog+/- was available. In this paper we present the Vadalog system, a Datalog-based system for performing complex logic reasoning tasks, such as those required in advanced knowledge graphs. The Vadalog system is Oxford's contribution to the VADA research programme, a joint effort of the universities of Oxford, Manchester and Edinburgh and around 20 industrial partners. As the main contribution of this paper, we illustrate the first implementation of Warded Datalog+/-, a high-performance Datalog+/- system utilizing an aggressive termination control strategy. We also provide a comprehensive experimental evaluation.Comment: Extended version of VLDB paper <https://doi.org/10.14778/3213880.3213888

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive

Fairness in Data Wrangling

Author: Fernandes Alvaro A.a.
Konstantinou Nikolaos
Mazilu Lacramioara
Paton Norman W.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/09/2020
Field of study

Crossref

The University of Manchester - Institutional Repository

Data Wrangling for Big Data: Challenges and Opportunities

Author: Furche Tim
Gottlob Georg
Libkin Leonid
Orsi Giorgio
Paton Norman W.
Publication venue
Publication date: 01/01/2016
Field of study

University of Birmingham Research Portal

Edinburgh Research Explorer

The University of Manchester - Institutional Repository

Feedback Driven Improvement of Data Preparation Pipelines

Author: Konstantinou Nikolaos
Paton Norman
Publication venue
Publication date: 01/01/2019
Field of study

The University of Manchester - Institutional Repository

Data context informed data wrangling

Author: Abel Edward
Bogatu Alex
Civili Cristina
Fernandes Alvaro A. A.
Keane John
Koehler Martin
Konstantinou Nikolaos
Libkin Leonid
Paton Norman W.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2017
Field of study

The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process have been carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. Cost-effective data wrangling processes need to ensure that data wrangling steps benefit from automation wherever possible. In this paper, we define a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Instance-based evidence together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair. The approach is evaluated with real estate data showing substantial improvements in the results of automated wrangling

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

The University of Manchester - Institutional Repository

Dataset Discovery in Data Lakes

Author: Bogatu Alex
Fernandes Alvaro
Konstantinou Nikolaos
Paton Norman
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/05/2020
Field of study

Data analytics stands to benefit from the increasing availability of datasets that are held without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target datasets can be constructed that enable value-adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash-based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times

arXiv.org e-Print Archive

Crossref

The University of Manchester - Institutional Repository

Quarry: A user-centered big data integration platform

Author: Abelló Gamazo Alberto
Bilalli Besim
Jovanovic Petar
Nadal Francesch Sergi
Romero Moral Óscar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/02/2021
Field of study

Obtaining valuable insights and actionable knowledge from data requires cross-analysis of domain data typically coming from various sources. Doing so, inevitably imposes burdensome processes of unifying different data formats, discovering integration paths, and all this given specific analytical needs of a data analyst. Along with large volumes of data, the variety of formats, data models, and semantics drastically contribute to the complexity of such processes. Although there have been many attempts to automate various processes along the Big Data pipeline, no unified platforms accessible by users without technical skills (like statisticians or business analysts) have been proposed. In this paper, we present a Big Data integration platform (Quarry) that uses hypergraph-based metadata to facilitate (and largely automate) the integration of domain data coming from a variety of sources, and provides an intuitive interface to assist end users both in: (1) data exploration with the goal of discovering potentially relevant analysis facets, and (2) consolidation and deployment of data flows which integrate the data, and prepare them for further analysis (descriptive or predictive), visualization, and/or publishing. We validate Quarry’s functionalities with the use case of World Health Organization (WHO) epidemiologists and data analysts in their fight against Neglected Tropical Diseases (NTDs).This work is partially supported by GENESIS project, funded by the Spanish Ministerio de Ciencia, Innovación y Universidades under project TIN2016-79269-R.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Operationalizing and automating data governance

Author: Bilalli Besim
Jovanovic Petar
Nadal Francesch Sergi
Romero Moral Óscar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/12/2022
Field of study

The ability to cross data from multiple sources represents a competitive advantage for organizations. Yet, the governance of the data lifecycle, from the data sources into valuable insights, is largely performed in an ad-hoc or manual manner. This is specifically concerning in scenarios where tens or hundreds of continuously evolving data sources produce semi-structured data. To overcome this challenge, we develop a framework for operationalizing and automating data governance. For the first, we propose a zoned data lake architecture and a set of data governance processes that allow the systematic ingestion, transformation and integration of data from heterogeneous sources, in order to make them readily available for business users. For the second, we propose a set of metadata artifacts that allow the automatic execution of data governance processes, addressing a wide range of data management challenges. We showcase the usefulness of the proposed approach using a real world use case, stemming from the collaborative project with the World Health Organization for the management and analysis of data about Neglected Tropical Diseases. Overall, this work contributes on facilitating organizations the adoption of data-driven strategies into a cohesive framework operationalizing and automating data governance.This work was partly supported by the DOGO4ML project, funded by the Spanish Ministerio de Ciencia e Innovación under project PID2020-117191RB-I00/AEI/10.13039/501100011033. Sergi Nadal is partly supported by the Spanish Ministerio de Ciencia e Innovación, as well as the European Union - NextGenerationEU, under project FJC2020-045809-I/AEI/10.13039/501100011033.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC