531 research outputs found
Improving Data Quality by Leveraging Statistical Relational Learning
Digitally collected data su
↵
ers from many data quality issues, such as duplicate, incorrect, or incomplete data. A common
approach for counteracting these issues is to formulate a set of data cleaning rules to identify and repair incorrect, duplicate and
missing data. Data cleaning systems must be able to treat data quality rules holistically, to incorporate heterogeneous constraints
within a single routine, and to automate data curation. We propose an approach to data cleaning based on statistical relational
learning (SRL). We argue that a formalism - Markov logic - is a natural fit for modeling data quality rules. Our approach
allows for the usage of probabilistic joint inference over interleaved data cleaning rules to improve data quality. Furthermore, it
obliterates the need to specify the order of rule execution. We describe how data quality rules expressed as formulas in first-order
logic directly translate into the predictive model in our SRL framework
Improving Data Quality by Leveraging Statistical Relational\ud Learning
Digitally collected data su\ud
↵\ud
ers from many data quality issues, such as duplicate, incorrect, or incomplete data. A common\ud
approach for counteracting these issues is to formulate a set of data cleaning rules to identify and repair incorrect, duplicate and\ud
missing data. Data cleaning systems must be able to treat data quality rules holistically, to incorporate heterogeneous constraints\ud
within a single routine, and to automate data curation. We propose an approach to data cleaning based on statistical relational\ud
learning (SRL). We argue that a formalism - Markov logic - is a natural fit for modeling data quality rules. Our approach\ud
allows for the usage of probabilistic joint inference over interleaved data cleaning rules to improve data quality. Furthermore, it\ud
obliterates the need to specify the order of rule execution. We describe how data quality rules expressed as formulas in first-order\ud
logic directly translate into the predictive model in our SRL framework
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
A constrained clustering approach to duplicate detection among relational data
This paper proposes an approach to detect duplicates among relational data. Traditional methods for record linkage or duplicate detection work on a set of records which have no explicit relations with each other. These records can be formatted into a single database table for processing. However, there are situations that records from different sources can not be flattened into one table and records within one source have certain (semantic) relations between them. The duplicate detection issue of these relational data records/instances can be dealt with by formatting them into several tables and applying traditional methods to each table. However, as the relations among the original data records are ignored, this approach generates poor or inconsistent results. This paper analyzes the characteristics of relational data and proposes a particular clustering approach to perform duplicate detection. This approach incorporates constraint rules derived from the characteristics of relational data and therefore yields better and more consistent results, which are revealed by our experiments. © Springer-Verlag Berlin Heidelberg 2007
- …