Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Albrecht; Austin; Baird; Batista; Boehm; Boehm; Breiman; Briand; Briand; Briand; Brockmeier; Cartwright; Cheung; Clark; Feelders; Finnie; Gama; Gray; Holte; Jain; Jeffery; Jun Liu; Jönsson; Kemerer; Khotanzad; Kibler; Kim; Kitchenham; Kohavi; Little; Little; Little; Little; Little; Martin Shepperd; Miranda; Myrtveit; Pickard; Putnam; Qinbao Song; Quinlan; Robins; Rubin; Rubin; Rubin; Rubin; Samson; Selby; Shao; Shepperd; Shepperd; Siedelecki; Song; Song; Srinivasan; Strike; Tabachnick; Tay; Walkerden; Walston; Xiangru Chen

research

Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

Authors: Albrecht
Austin
Baird
Batista
Boehm
Boehm
Breiman
Briand
Briand
Briand
Brockmeier
Cartwright
Cheung
Clark
Feelders
Finnie
Gama
Gray
Holte
Jain
Jeffery
Jun Liu
Jönsson
Kemerer
Khotanzad
Kibler
Kim
Kitchenham
Kohavi
Little
Little
Little
Little
Little
Martin Shepperd
Miranda
Myrtveit
Pickard
Putnam
Qinbao Song
Quinlan
Robins
Rubin
Rubin
Rubin
Rubin
Samson
Selby
Shao
Shepperd
Shepperd
Siedelecki
Song
Song
Srinivasan
Strike
Tabachnick
Tay
Walkerden
Walston
Xiangru Chen
Publication date: 1 December 2008
Publisher: 'Elsevier BV'
Doi

Abstract

Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Crossref

Last time updated on 01/04/2019

Brunel University Research Archive

oai:bura.brunel.ac.uk:2438/322...

Last time updated on 23/02/2012