189 research outputs found
Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation
Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%
Recommended from our members
Augmenting Naive Bayes Classifiers with Statistical Language Models
We augment naive Bayes models with statistical n-gram language models to address short- comings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier which allows for a local Markov dependence among observations; a model we re- fer to as the Chain Augmented Naive Bayes (CAN) Bayes classifier. CAN models have two advantages over standard naive Bayes classifiers. First, they relax some of the indepen- dence assumptions of naive Bayesâallowing a local Markov chain dependence in the observed variablesâwhile still permitting efficient inference and learning. Second, they permit straight- forward application of sophisticated smoothing techniques from statistical language modeling, which allows one to obtain better parameter estimates than the standard Laplace smoothing used in naive Bayes classification. In this paper, we introduce CAN models and apply them to various text classification problems. To demonstrate the language independent and task independent nature of these classifiers, we present experimental results on several text clas- sification problemsâauthorship attribution, text genre classification, and topic detectionâin several languagesâGreek, English, Japanese and Chinese. We then systematically study the key factors in the CAN model that can influence the classification performance, and analyze the strengths and weaknesses of the model
Closed-Loop Learning of Visual Control Policies
In this paper we present a general, flexible framework for learning mappings
from images to actions by interacting with the environment. The basic idea is
to introduce a feature-based image classifier in front of a reinforcement
learning algorithm. The classifier partitions the visual space according to the
presence or absence of few highly informative local descriptors that are
incrementally selected in a sequence of attempts to remove perceptual aliasing.
We also address the problem of fighting overfitting in such a greedy algorithm.
Finally, we show how high-level visual features can be generated when the power
of local descriptors is insufficient for completely disambiguating the aliased
states. This is done by building a hierarchy of composite features that consist
of recursive spatial combinations of visual features. We demonstrate the
efficacy of our algorithms by solving three visual navigation tasks and a
visual version of the classical Car on the Hill control problem
A concept drift-tolerant case-base editing technique
© 2015 Elsevier B.V. All rights reserved. The evolving nature and accumulating volume of real-world data inevitably give rise to the so-called "concept drift" issue, causing many deployed Case-Based Reasoning (CBR) systems to require additional maintenance procedures. In Case-base Maintenance (CBM), case-base editing strategies to revise the case-base have proven to be effective instance selection approaches for handling concept drift. Motivated by current issues related to CBR techniques in handling concept drift, we present a two-stage case-base editing technique. In Stage 1, we propose a Noise-Enhanced Fast Context Switch (NEFCS) algorithm, which targets the removal of noise in a dynamic environment, and in Stage 2, we develop an innovative Stepwise Redundancy Removal (SRR) algorithm, which reduces the size of the case-base by eliminating redundancies while preserving the case-base coverage. Experimental evaluations on several public real-world datasets show that our case-base editing technique significantly improves accuracy compared to other case-base editing approaches on concept drift tasks, while preserving its effectiveness on static tasks
- âŠ