156 research outputs found

    BClean: A Bayesian Data Cleaning System

    Full text link
    There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian methods. However, existing probabilistic methods often assume a simplistic distribution (e.g., Gaussian distribution), which is frequently underfitted in practice, or they necessitate experts to provide a complex prior distribution (e.g., via a programming language). This requirement is both labor-intensive and costly, rendering these methods less suitable for real-world applications. In this paper, we propose BClean, a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction. We recast the data cleaning problem as a Bayesian inference that fully exploits the relationships between attributes in the observed dataset and any prior information provided by users. To this end, we present an automatic Bayesian network construction method that extends a structure learning-based functional dependency discovery method with similarity functions to capture the relationships between attributes. Furthermore, our system allows users to modify the generated Bayesian network in order to specify prior information or correct inaccuracies identified by the automatic generation process. We also design an effective scoring model (called the compensative scoring model) necessary for the Bayesian inference. To enhance the efficiency of data cleaning, we propose several approximation strategies for the Bayesian inference, including graph partitioning, domain pruning, and pre-detection. By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning, outperforming existing Bayesian methods by 2% and other data cleaning methods by 15%.Comment: Our source code is available at https://github.com/yyssl88/BClea

    Bayesian Network Induction with Incomplete Private Data

    Get PDF
    A Bayesian network is a graphical model for representing probabilistic relationships among a set of variables. It is an important model for business analysis. Bayesian network learning methods have been applied to business analysis where data privacy is not considered. However, how to learn a Bayesian network over private data presents a much greater challenge. In this paper, we develop an approach to tackle the problem of Bayesian network induction on private data which may contain missing values. The basic idea of our proposed approach is that we combine randomization technique with Expectation Maximization (EM) algorithm. The purpose of using randomization is to disguise the raw data. EM algorithm is applied for missing values in the private data set. We also present a method to conduct Bayesian network construction, which is one of data mining computations, from the disguised data

    A measure of statistical complexity based on predictive information

    Full text link
    We introduce an information theoretic measure of statistical structure, called 'binding information', for sets of random variables, and compare it with several previously proposed measures including excess entropy, Bialek et al.'s predictive information, and the multi-information. We derive some of the properties of the binding information, particularly in relation to the multi-information, and show that, for finite sets of binary random variables, the processes which maximises binding information are the 'parity' processes. Finally we discuss some of the implications this has for the use of the binding information as a measure of complexity.Comment: 4 pages, 3 figure

    Data Mining Applications in Banking Sector While Preserving Customer Privacy

    Get PDF
    In real-life data mining applications, organizations cooperate by using each other’s data on the same data mining task for more accurate results, although they may have different security and privacy concerns. Privacy-preserving data mining (PPDM) practices involve rules and techniques that allow parties to collaborate on data mining applications while keeping their data private. The objective of this paper is to present a number of PPDM protocols and show how PPDM can be used in data mining applications in the banking sector. For this purpose, the paper discusses homomorphic cryptosystems and secure multiparty computing. Supported by experimental analysis, the paper demonstrates that data mining tasks such as clustering and Bayesian networks (association rules) that are commonly used in the banking sector can be efficiently and securely performed. This is the first study that combines PPDM protocols with applications for banking data mining. Doi: 10.28991/ESJ-2022-06-06-014 Full Text: PD
    corecore