2,574 research outputs found

    Missing Value Imputation With Unsupervised Backpropagation

    Full text link
    Many data mining and data analysis techniques operate on dense matrices or complete tables of data. Real-world data sets, however, often contain unknown values. Even many classification algorithms that are designed to operate with missing values still exhibit deteriorated accuracy. One approach to handling missing values is to fill in (impute) the missing values. In this paper, we present a technique for unsupervised learning called Unsupervised Backpropagation (UBP), which trains a multi-layer perceptron to fit to the manifold sampled by a set of observed point-vectors. We evaluate UBP with the task of imputing missing values in datasets, and show that UBP is able to predict missing values with significantly lower sum-squared error than other collaborative filtering and imputation techniques. We also demonstrate with 24 datasets and 9 supervised learning algorithms that classification accuracy is usually higher when randomly-withheld values are imputed using UBP, rather than with other methods

    EVALUATING ALTERNATIVE METHODS OF DEALING WITH MISSING OBSERVATIONS - AN ECONOMIC APPLICATION

    Get PDF
    This paper compares methods to remedy missing value problems in survey data. The commonly used methods to deal with this issue are to delete observations that have missing values (case-deletion), replace missing values with sample mean (mean imputation), and substitute a fitted value from auxiliary regression (regression imputation). These methods are easy to implement but have potentially serious drawbacks such as bias and inefficiency. In addition, these methods treat imputed values as known so that they ignore the uncertainty due to 'missingness', which can result in underestimating the standard errors. An alternative method is Multiple Imputation (MI). In this paper, Expectation Maximization (EM) and Data Augmentation (DA) are used to create multiple complete datasets, each with different imputed values due to random draws. EM is essentially maximum-likelihood estimation, utilizing the interdependency between missing values and model parameters. DA estimates the distribution of missing values given the observed data and the model parameters through Markov Chain Monte Carlo (MCMC). These multiple datasets are subsequently combined into a single imputation, incorporating the uncertainty due to the missingness. Results from the Monte Carlo experiment using pseudo data show that MI is superior to other methods for the problem posed here.Research Methods/ Statistical Methods,

    Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

    Get PDF
    Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%
    • 

    corecore