Search CORE

2,574 research outputs found

Missing Value Imputation With Unsupervised Backpropagation

Author: Gashler Michael S.
Martinez Tony
Morris Richard
Smith Michael R.
Publication venue
Publication date: 18/12/2013
Field of study

Many data mining and data analysis techniques operate on dense matrices or complete tables of data. Real-world data sets, however, often contain unknown values. Even many classification algorithms that are designed to operate with missing values still exhibit deteriorated accuracy. One approach to handling missing values is to fill in (impute) the missing values. In this paper, we present a technique for unsupervised learning called Unsupervised Backpropagation (UBP), which trains a multi-layer perceptron to fit to the manifold sampled by a set of observed point-vectors. We evaluate UBP with the task of imputing missing values in datasets, and show that UBP is able to predict missing values with significantly lower sum-squared error than other collaborative filtering and imputation techniques. We also demonstrate with 24 datasets and 9 supervised learning algorithms that classification accuracy is usually higher when randomly-withheld values are imputed using UBP, rather than with other methods

arXiv.org e-Print Archive

CiteSeerX

EVALUATING ALTERNATIVE METHODS OF DEALING WITH MISSING OBSERVATIONS - AN ECONOMIC APPLICATION

Author: Onozaka Yuko
Publication venue
Publication date
Field of study

This paper compares methods to remedy missing value problems in survey data. The commonly used methods to deal with this issue are to delete observations that have missing values (case-deletion), replace missing values with sample mean (mean imputation), and substitute a fitted value from auxiliary regression (regression imputation). These methods are easy to implement but have potentially serious drawbacks such as bias and inefficiency. In addition, these methods treat imputed values as known so that they ignore the uncertainty due to 'missingness', which can result in underestimating the standard errors. An alternative method is Multiple Imputation (MI). In this paper, Expectation Maximization (EM) and Data Augmentation (DA) are used to create multiple complete datasets, each with different imputed values due to random draws. EM is essentially maximum-likelihood estimation, utilizing the interdependency between missing values and model parameters. DA estimates the distribution of missing values given the observed data and the model parameters through Markov Chain Monte Carlo (MCMC). These multiple datasets are subsequently combined into a single imputation, incorporating the uncertainty due to the missingness. Results from the Monte Carlo experiment using pseudo data show that MI is superior to other methods for the problem posed here.Research Methods/ Statistical Methods,

Research Papers in Economics

Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Author: Albrecht
Austin
Baird
Batista
Boehm
Boehm
Breiman
Briand
Briand
Briand
Brockmeier
Cartwright
Cheung
Clark
Feelders
Finnie
Gama
Gray
Holte
Jain
Jeffery
Jun Liu
Jönsson
Kemerer
Khotanzad
Kibler
Kim
Kitchenham
Kohavi
Little
Little
Little
Little
Little
Martin Shepperd
Miranda
Myrtveit
Pickard
Putnam
Qinbao Song
Quinlan
Robins
Rubin
Rubin
Rubin
Rubin
Samson
Selby
Shao
Shepperd
Shepperd
Siedelecki
Song
Song
Srinivasan
Strike
Tabachnick
Tay
Walkerden
Walston
Xiangru Chen
Publication venue: 'Elsevier BV'
Publication date: 01/12/2008
Field of study

Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%

Crossref

Brunel University Research Archive