48 research outputs found

    Tpda2 Algorithm for Learning Bn Structure From Missing Value and Outliers in Data Mining

    Full text link
    Three-Phase Dependency Analysis (TPDA) algorithm was proved as most efficient algorithm (which requires at most O(N4) Conditional Independence (CI) tests). By integrating TPDA with "node topological sort algorithm", it can be used to learn Bayesian Network (BN) structure from missing value (named as TPDA1 algorithm). And then, outlier can be reduced by applying an "outlier detection & removal algorithm" as pre-processing for TPDA1. TPDA2 algorithm proposed consists of those ideas, outlier detection & removal, TPDA, and node topological sort node

    Being Bayesian about learning Gaussian Bayesian networks from incomplete data

    Get PDF
    We propose a Bayesian model averaging (BMA) approach for inferring the structure of Gaussian Bayesian networks (BNs) from incomplete data, i.e. from data with missing values. Our method builds on the ‘Bayesian metric for Gaussian networks having score equivalence’ (BGe score) and we make the assumption that the unobserved data points are ‘missing completely at random’. We present a Markov Chain Monte Carlo sampling algorithm that allows for simultaneously sampling directed acyclic graphs (DAGs) as well as the values of the unobserved data points. We empirically cross-compare the network reconstruction accuracy of the new BMA approach with two non-Bayesian approaches for dealing with incomplete BN data, namely the classical structural Expectation Maximisation (EM) approach and the more recently proposed node average likelihood (NAL) method. For the empirical evaluation we use synthetic data from a benchmark Gaussian BN and real wet-lab protein phosphorylation data from the RAF signalling pathway.</p

    A comparison of strategies for missing values in data on machine learning classification algorithms

    Get PDF
    Abstract: Dealing with missing values in data is an important feature engineering task in data science to prevent negative impacts on machine learning classification models in terms of accurate prediction. However, it is often unclear what the underlying cause of the missing values in real-life data is or rather the missing data mechanism that is causing the missingness. Thus, it becomes necessary to evaluate several missing data approaches for a given dataset. In this paper, we perform a comparative study of several approaches for handling missing values in data, namely listwise deletion, mean, mode, k–nearest neighbors, expectation-maximization, and multiple imputations by chained equations. The comparison is performed on two real-world datasets, using the following evaluation metrics: Accuracy, root mean squared error, receiver operating characteristics, and the F1 score. Most classifiers performed well across the missing data strategies. However, based on the result obtained, the support vector classifier method overall performed marginally better for the numerical data and naïve Bayes classifier for the categorical data when compared to the other evaluated missing value methods
    corecore