29 research outputs found
BClean: A Bayesian Data Cleaning System
There is a considerable body of work on data cleaning which employs various
principles to rectify erroneous data and transform a dirty dataset into a
cleaner one. One of prevalent approaches is probabilistic methods, including
Bayesian methods. However, existing probabilistic methods often assume a
simplistic distribution (e.g., Gaussian distribution), which is frequently
underfitted in practice, or they necessitate experts to provide a complex prior
distribution (e.g., via a programming language). This requirement is both
labor-intensive and costly, rendering these methods less suitable for
real-world applications. In this paper, we propose BClean, a Bayesian Cleaning
system that features automatic Bayesian network construction and user
interaction. We recast the data cleaning problem as a Bayesian inference that
fully exploits the relationships between attributes in the observed dataset and
any prior information provided by users. To this end, we present an automatic
Bayesian network construction method that extends a structure learning-based
functional dependency discovery method with similarity functions to capture the
relationships between attributes. Furthermore, our system allows users to
modify the generated Bayesian network in order to specify prior information or
correct inaccuracies identified by the automatic generation process. We also
design an effective scoring model (called the compensative scoring model)
necessary for the Bayesian inference. To enhance the efficiency of data
cleaning, we propose several approximation strategies for the Bayesian
inference, including graph partitioning, domain pruning, and pre-detection. By
evaluating on both real-world and synthetic datasets, we demonstrate that
BClean is capable of achieving an F-measure of up to 0.9 in data cleaning,
outperforming existing Bayesian methods by 2% and other data cleaning methods
by 15%.Comment: Our source code is available at https://github.com/yyssl88/BClea
Knowledge Graph Imputation
Knowledge graphs are one of the most important resources of information in many applications such as question answering and social networks. These knowledge graphs however, are often far from complete as there are so many missing properties and links between entities. This greatly affects their usefulness in applications that they are used in. Many methods have been proposed to alleviate this problem. One of the most prominent and studied subjects in this area are the graph embedding and link prediction methods. However, these methods only consider the relations between entities in knowledge graphs and completely ignore their literal values and properties that account for 41% of the facts in the knowledge graph YAGO4. They also do not scale for large knowledge graphs and their inference process for imputing missing links is by nature quadratic with respect to the number of entities in the knowledge graph. Furthermore, the embedding vectors that represent entities and relations might not be able to capture information that is necessary for inference for millions of entities that exist in large-scale knowledge graphs. We present a novel method based on the HoloClean’s framework — a powerful cleaning tool for relational data. Our system is designed based on the open-source HoloClean and can be used to integrate multiple and different signals from various knowledge graph completion methods which allows us to holistically tackle this problem. We have done a thorough experiment on the YAGO4 dataset with 5M entities and 20M facts and we were able to enlarge the knowledge graph by roughly 12% with an average reconstruction precision of 0.81 on 162 different classes
Extracting and Cleaning RDF Data
The RDF data model has become a prevalent format to represent heterogeneous data because of its versatility. The capability of dismantling information from its native formats and representing it in triple format offers a simple yet powerful way of modelling data that is obtained from multiple sources. In addition, the triple format and schema constraints of the RDF model make the RDF data easy to process as labeled, directed graphs.
This graph representation of RDF data supports higher-level analytics by enabling querying using different techniques and querying languages, e.g., SPARQL. Anlaytics that require structured data are supported by transforming the graph data on-the-fly to populate the target schema that is needed for downstream analysis. These target schemas are defined by downstream applications according to their information need.
The flexibility of RDF data brings two main challenges. First, the extraction of RDF data is a complex task that may involve domain expertise about the information required to be extracted for different applications. Another significant aspect of analyzing RDF data is its quality, which depends on multiple factors including the reliability of data sources and the accuracy of the extraction systems. The quality of the analysis depends mainly on the quality of the underlying data. Therefore, evaluating and improving the quality of RDF data has a direct effect on the correctness of downstream analytics.
This work presents multiple approaches related to the extraction and quality evaluation of RDF data. To cope with the large amounts of data that needs to be extracted, we present DSTLR, a scalable framework to extract RDF triples from semi-structured and unstructured data sources. For rare entities that fall on the long tail of information, there may not be enough signals to support high-confidence extraction. Towards this problem, we present an approach to estimate property values for long tail entities. We also present multiple algorithms and approaches that focus on the quality of RDF data. These include discovering quality constraints from RDF data, and utilizing machine learning techniques to repair errors in RDF data
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
Investigating the attainment of optimum data quality for EHR Big Data: proposing a new methodological approach
The value derivable from the use of data is continuously increasing since some years. Both commercial and non-commercial organisations have realised the immense benefits that might be derived if all data at their disposal could be analysed and form the basis of decision taking. The technological tools required to produce, capture, store, transmit and analyse huge amounts of data form the background to the development of the phenomenon of Big Data. With Big Data, the aim is to be able to generate value from huge amounts of data, often in non-structured format and produced extremely frequently. However, the potential value derivable depends on general level of governance of data, more precisely on the quality of the data. The field of data quality is well researched for traditional data uses but is still in its infancy for the Big Data context. This dissertation focused on investigating effective methods to enhance data quality for Big Data. The principal deliverable of this research is in the form of a methodological approach which can be used to optimize the level of data quality in the Big Data context. Since data quality is contextual, (that is a non-generalizable field), this research study focuses on applying the methodological approach in one use case, in terms of the Electronic Health Records (EHR).
The first main contribution to knowledge of this study systematically investigates which data quality dimensions (DQDs) are most important for EHR Big Data. The two most important dimensions ascertained by the research methods applied in this study are accuracy and completeness. These are two well-known dimensions, and this study confirms that they are also very important for EHR Big Data. The second important contribution to knowledge is an investigation into whether Artificial Intelligence with a special focus upon machine learning could be used in improving the detection of dirty data, focusing on the two data quality dimensions of accuracy and completeness. Regression and clustering algorithms proved to be more adequate for accuracy and completeness related issues respectively, based on the experiments carried out. However, the limits of implementing and using machine learning algorithms for detecting data quality issues for Big Data were also revealed and discussed in this research study. It can safely be deduced from the knowledge derived from this part of the research study that use of machine learning for enhancing data quality issues detection is a promising area but not yet a panacea which automates this entire process. The third important contribution is a proposed guideline to undertake data repairs most efficiently for Big Data; this involved surveying and comparing existing data cleansing algorithms against a prototype developed for data reparation. Weaknesses of existing algorithms are highlighted and are considered as areas of practice which efficient data reparation algorithms must focus upon.
Those three important contributions form the nucleus for a new data quality methodological approach which could be used to optimize Big Data quality, as applied in the context of EHR. Some of the activities and techniques discussed through the proposed methodological approach can be transposed to other industries and use cases to a large extent. The proposed data quality methodological approach can be used by practitioners of Big Data Quality who follow a data-driven strategy. As opposed to existing Big Data quality frameworks, the proposed data quality methodological approach has the advantage of being more precise and specific. It gives clear and proven methods to undertake the main identified stages of a Big Data quality lifecycle and therefore can be applied by practitioners in the area.
This research study provides some promising results and deliverables. It also paves the way for further research in the area. Technical and technological changes in Big Data is rapidly evolving and future research should be focusing on new representations of Big Data, the real-time streaming aspect, and replicating same research methods used in this current research study but on new technologies to validate current results
Scaling Machine Learning Data Repair Systems for Sparse Datasets
Machine learning data repair systems (e.g. HoloClean) have achieved state-of-the-art performance for the data repair problem on many datasets. However, these systems face significant challenges with sparse datasets. In this work, the challenges presented by such datasets to machine learning data repair systems are investigated. Dataset-independent methods are presented to mitigate the effects of data sparseness. Finally, experimental results are validated on a large, sparse real-world dataset: Census. Showing that the problem size can be reduced by more than 70%, saving significant computational costs, while still getting high accuracy data repairs (94.5% accuracy)
Can Foundation Models Wrangle Your Data?
Foundation Models (FMs) are models trained on large corpora of data that, at
very large scale, can generalize to new tasks without any task-specific
finetuning. As these models continue to grow in size, innovations continue to
push the boundaries of what these models can do on language and image tasks.
This paper aims to understand an underexplored area of FMs: classical data
tasks like cleaning and integration. As a proof-of-concept, we cast five data
cleaning and integration tasks as prompting tasks and evaluate the performance
of FMs on these tasks. We find that large FMs generalize and achieve SoTA
performance on data cleaning and integration tasks, even though they are not
trained for these data tasks. We identify specific research challenges and
opportunities that these models present, including challenges with private and
domain specific data, and opportunities to make data management systems more
accessible to non-experts. We make our code and experiments publicly available
at: https://github.com/HazyResearch/fm_data_tasks.Comment: 12 pages, 5 figures; additional experiments, typo corrections,
modifications to Section 5 (Research Agenda