Search CORE

156 research outputs found

BClean: A Bayesian Data Cleaning System

Author: Huang Sifan
Mao Rui
Miao Yukai
Onizuka Makoto
Qin Jianbin
Wang Yaoshu
Xiao Chuan
Zhang Yifan
Zhu Jing
Publication venue
Publication date: 11/11/2023
Field of study

There is a considerable body of work on data cleaning which employs various principles to rectify erroneous data and transform a dirty dataset into a cleaner one. One of prevalent approaches is probabilistic methods, including Bayesian methods. However, existing probabilistic methods often assume a simplistic distribution (e.g., Gaussian distribution), which is frequently underfitted in practice, or they necessitate experts to provide a complex prior distribution (e.g., via a programming language). This requirement is both labor-intensive and costly, rendering these methods less suitable for real-world applications. In this paper, we propose BClean, a Bayesian Cleaning system that features automatic Bayesian network construction and user interaction. We recast the data cleaning problem as a Bayesian inference that fully exploits the relationships between attributes in the observed dataset and any prior information provided by users. To this end, we present an automatic Bayesian network construction method that extends a structure learning-based functional dependency discovery method with similarity functions to capture the relationships between attributes. Furthermore, our system allows users to modify the generated Bayesian network in order to specify prior information or correct inaccuracies identified by the automatic generation process. We also design an effective scoring model (called the compensative scoring model) necessary for the Bayesian inference. To enhance the efficiency of data cleaning, we propose several approximation strategies for the Bayesian inference, including graph partitioning, domain pruning, and pre-detection. By evaluating on both real-world and synthetic datasets, we demonstrate that BClean is capable of achieving an F-measure of up to 0.9 in data cleaning, outperforming existing Bayesian methods by 2% and other data cleaning methods by 15%.Comment: Our source code is available at https://github.com/yyssl88/BClea

arXiv.org e-Print Archive

Bayesian Network Induction with Incomplete Private Data

Author: Chang LiWu
Matwin Stan
Zhan Justin
Publication venue: AIS Electronic Library (AISeL)
Publication date: 05/12/2004
Field of study

A Bayesian network is a graphical model for representing probabilistic relationships among a set of variables. It is an important model for business analysis. Bayesian network learning methods have been applied to business analysis where data privacy is not considered. However, how to learn a Bayesian network over private data presents a much greater challenge. In this paper, we develop an approach to tackle the problem of Bayesian network induction on private data which may contain missing values. The basic idea of our proposed approach is that we combine randomization technique with Expectation Maximization (EM) algorithm. The purpose of using randomization is to disguise the raw data. EM algorithm is applied for missing values in the private data set. We also present a method to conduct Bayesian network construction, which is one of data mining computations, from the disguised data

AIS Electronic Library (AISeL)

A measure of statistical complexity based on predictive information

Author: Abdallah Samer A.
Plumbley Mark D.
Publication venue
Publication date: 01/01/2010
Field of study

We introduce an information theoretic measure of statistical structure, called 'binding information', for sets of random variables, and compare it with several previously proposed measures including excess entropy, Bialek et al.'s predictive information, and the multi-information. We derive some of the properties of the binding information, particularly in relation to the multi-information, and show that, for finite sets of binary random variables, the processes which maximises binding information are the 'parity' processes. Finally we discuss some of the implications this has for the use of the binding information as a measure of complexity.Comment: 4 pages, 3 figure

arXiv.org e-Print Archive

UCL Discovery

Surrey Research Insight

軌道予測に基づく高安全知能自動車に関する研究

Author: 孫博
Publication venue
Publication date: 20/04/2011
Field of study

Tohoku University亀山充

Tohoku University Repository (TOUR) / 東北大学機関リポジトリ

Data Mining Applications in Banking Sector While Preserving Customer Privacy

Author: Doğuç Özge
Publication venue: 'Ital Publication'
Publication date: 01/01/2022
Field of study

In real-life data mining applications, organizations cooperate by using each other’s data on the same data mining task for more accurate results, although they may have different security and privacy concerns. Privacy-preserving data mining (PPDM) practices involve rules and techniques that allow parties to collaborate on data mining applications while keeping their data private. The objective of this paper is to present a number of PPDM protocols and show how PPDM can be used in data mining applications in the banking sector. For this purpose, the paper discusses homomorphic cryptosystems and secure multiparty computing. Supported by experimental analysis, the paper demonstrates that data mining tasks such as clustering and Bayesian networks (association rules) that are commonly used in the banking sector can be efficiently and securely performed. This is the first study that combines PPDM protocols with applications for banking data mining. Doi: 10.28991/ESJ-2022-06-06-014 Full Text: PD

Emerging Science Journal (ESJ)

İstanbul Medipol University Institutional Repository