156 research outputs found
BClean: A Bayesian Data Cleaning System
There is a considerable body of work on data cleaning which employs various
principles to rectify erroneous data and transform a dirty dataset into a
cleaner one. One of prevalent approaches is probabilistic methods, including
Bayesian methods. However, existing probabilistic methods often assume a
simplistic distribution (e.g., Gaussian distribution), which is frequently
underfitted in practice, or they necessitate experts to provide a complex prior
distribution (e.g., via a programming language). This requirement is both
labor-intensive and costly, rendering these methods less suitable for
real-world applications. In this paper, we propose BClean, a Bayesian Cleaning
system that features automatic Bayesian network construction and user
interaction. We recast the data cleaning problem as a Bayesian inference that
fully exploits the relationships between attributes in the observed dataset and
any prior information provided by users. To this end, we present an automatic
Bayesian network construction method that extends a structure learning-based
functional dependency discovery method with similarity functions to capture the
relationships between attributes. Furthermore, our system allows users to
modify the generated Bayesian network in order to specify prior information or
correct inaccuracies identified by the automatic generation process. We also
design an effective scoring model (called the compensative scoring model)
necessary for the Bayesian inference. To enhance the efficiency of data
cleaning, we propose several approximation strategies for the Bayesian
inference, including graph partitioning, domain pruning, and pre-detection. By
evaluating on both real-world and synthetic datasets, we demonstrate that
BClean is capable of achieving an F-measure of up to 0.9 in data cleaning,
outperforming existing Bayesian methods by 2% and other data cleaning methods
by 15%.Comment: Our source code is available at https://github.com/yyssl88/BClea
Bayesian Network Induction with Incomplete Private Data
A Bayesian network is a graphical model for representing probabilistic relationships among a set of variables. It is an important model for business analysis. Bayesian network learning methods have been applied to business analysis where data privacy is not considered. However, how to learn a Bayesian network over private data presents a much greater challenge. In this paper, we develop an approach to tackle the problem of Bayesian network induction on private data which may contain missing values. The basic idea of our proposed approach is that we combine randomization technique with Expectation Maximization (EM) algorithm. The purpose of using randomization is to disguise the raw data. EM algorithm is applied for missing values in the private data set. We also present a method to conduct Bayesian network construction, which is one of data mining computations, from the disguised data
A measure of statistical complexity based on predictive information
We introduce an information theoretic measure of statistical structure,
called 'binding information', for sets of random variables, and compare it with
several previously proposed measures including excess entropy, Bialek et al.'s
predictive information, and the multi-information. We derive some of the
properties of the binding information, particularly in relation to the
multi-information, and show that, for finite sets of binary random variables,
the processes which maximises binding information are the 'parity' processes.
Finally we discuss some of the implications this has for the use of the binding
information as a measure of complexity.Comment: 4 pages, 3 figure
Data Mining Applications in Banking Sector While Preserving Customer Privacy
In real-life data mining applications, organizations cooperate by using each other’s data on the same data mining task for more accurate results, although they may have different security and privacy concerns. Privacy-preserving data mining (PPDM) practices involve rules and techniques that allow parties to collaborate on data mining applications while keeping their data private. The objective of this paper is to present a number of PPDM protocols and show how PPDM can be used in data mining applications in the banking sector. For this purpose, the paper discusses homomorphic cryptosystems and secure multiparty computing. Supported by experimental analysis, the paper demonstrates that data mining tasks such as clustering and Bayesian networks (association rules) that are commonly used in the banking sector can be efficiently and securely performed. This is the first study that combines PPDM protocols with applications for banking data mining. Doi: 10.28991/ESJ-2022-06-06-014 Full Text: PD
- …