Search CORE

12 research outputs found

HoloDetect: Few-Shot Learning for Error Detection

Author: Bengio Yoshua
Elmagarmid Ahmed K.
Globerson Amir
Goodfellow Ian
Guo Chuan
Hinton G. E.
Rahm Erhard
Ratcliff John W.
Zhang Yu
Zhu Xiaojin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/04/2019
Field of study

We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2) a data augmentation model that, given a small seed of clean records, uses dataset-specific transformations to automatically generate additional training data. Our key insight is to learn data augmentation policies from the noisy input dataset in a weakly supervised manner. We show that our framework detects errors with an average precision of ~94% and an average recall of ~93% across a diverse array of datasets that exhibit different types and amounts of errors. We compare our approach to a comprehensive collection of error detection methods, ranging from traditional rule-based methods to ensemble-based and active learning approaches. We show that data augmentation yields an average improvement of 20 F1 points while it requires access to 3x fewer labeled examples compared to other ML approaches.Comment: 18 pages

arXiv.org e-Print Archive

Crossref

A Formal Framework for Probabilistic Unclean Databases

Author: De Sa Christopher
Ilyas Ihab F.
Kimelfeld Benny
Rekatsinas Theodoros
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 22nd International Conference on Database Theory (ICDT 2019)
Publication date: 01/01/2019
Field of study

Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how noise is introduced in the actual observed database. To capture this noisy channel model, we introduce the concept of a Probabilistic Unclean Database (PUD), a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization and captures how noise is introduced, and an observed unclean database that we call the observation. We define three computational problems in the PUD framework: cleaning (infer the most probable intended database, given a PUD), probabilistic query answering (compute the probability of an answer tuple over the unclean observed database), and learning (estimate the most likely intention and realization models of a PUD, given examples as training data). We illustrate the PUD framework on concrete representations of the intention and realization, show that they generalize traditional concepts of repairs such as cardinality and value repairs, draw connections to consistent query answering, and prove tractability results. We further show that parameters can be learned in some practical instantiations, and in fact, prove that under certain conditions we can learn a PUD directly from a single dirty database without any need for clean examples

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

BClean: A Bayesian Data Cleaning System

Author: Huang Sifan
Mao Rui
Miao Yukai
Onizuka Makoto
Qin Jianbin
Wang Yaoshu
Xiao Chuan
Zhang Yifan
Zhu Jing
Publication venue
Publication date: 11/11/2023
Field of study

There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian methods. However, existing probabilistic methods often assume a simplistic distribution (e.g., Gaussian distribution), which is frequently underfitted in practice, or they necessitate experts to provide a complex prior distribution (e.g., via a programming language). This requirement is both labor-intensive and costly, rendering these methods less suitable for real-world applications. In this paper, we propose BClean, a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction. We recast the data cleaning problem as a Bayesian inference that fully exploits the relationships between attributes in the observed dataset and any prior information provided by users. To this end, we present an automatic Bayesian network construction method that extends a structure learning-based functional dependency discovery method with similarity functions to capture the relationships between attributes. Furthermore, our system allows users to modify the generated Bayesian network in order to specify prior information or correct inaccuracies identified by the automatic generation process. We also design an effective scoring model (called the compensative scoring model) necessary for the Bayesian inference. To enhance the efficiency of data cleaning, we propose several approximation strategies for the Bayesian inference, including graph partitioning, domain pruning, and pre-detection. By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning, outperforming existing Bayesian methods by 2% and other data cleaning methods by 15%.Comment: Our source code is available at https://github.com/yyssl88/BClea

arXiv.org e-Print Archive

BigDansing

Author: Khayyat Zuhair
Ilyas Ihab F.
Ouzzani Mourad
Papotti Paolo
Quiané-Ruiz Jorge-Arnulfo
Tang Nan
Yin Si
Madden Samuel R
Jindal Alekh
Publication venue: Association for Computing Machinery
Publication date: 15/01/2003
Field of study

Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms

Electronic Archive of Poltava University of Economics and Trade

DSpace@MIT

Crossref

Електронний архів Полтавського університету економіки і торгівлі (Electronic archive of Poltava University of Economics and Trade)

BigDansing

Author: Ilyas Ihab F.
Jindal Alekh
Khayyat Zuhair
Madden Samuel R
Ouzzani Mourad
Papotti Paolo
Quiané-Ruiz Jorge-Arnulfo
Tang Nan
Yin Si
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/05/2013
Field of study

DSpace@MIT

SQL extensions for domain agnostic data representational consistency

Author: Yi Bingyu
Publication venue: 'University of Queensland Library'
Publication date: 30/01/2017
Field of study

University of Queensland eSpace

Towards dependable data repairing with fixing rules

Author: Abiteboul S.
Batini C.
Bravo L.
Cong G.
Fellegi I.
Herzog T. N.
Naumann F.
Raman V.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

One of the main challenges that data cleaning systems face is to automatically identify and repair data errors in a depend-able manner. Though data dependencies (a.k.a. integrity constraints) have been widely studied to capture errors in data, automated and dependable data repairing on these errors has remained a notoriously hard problem. In this work, we introduce an automated approach for dependably repairing data errors, based on a novel class of fixing rules. A fixing rule contains an evidence pattern, a set of nega-tive patterns, and a fact value. The heart of fixing rules is deterministic: given a tuple, the evidence pattern and the negative patterns of a fixing rule are combined to precisely capture which attribute is wrong, and the fact indicates how to correct this error. We study several fundamental prob-lems associated with fixing rules, and establish their com-plexity. We develop ecient algorithms to check whether a set of fixing rules is consistent, and discuss approaches to resolve inconsistent fixing rules. We also devise ecient algorithms for repairing data errors using fixing rules. We experimentally demonstrate that our techniques outperform other automated algorithms in terms of the accuracy of re-pairing data errors, using both real-life and synthetic data. 1

CiteSeerX

Crossref