21,002 research outputs found
Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation
Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%
Using classifiers to predict linear feedback shift registers
Proceeding of: IEEE 35th International Carnahan Conference on Security Technology. October 16-19, 2001, LondonPreviously (J.C. Hernandez et al., 2000), some new ideas that justify the use of artificial intelligence techniques in cryptanalysis are presented. The main objective of that paper was to show that the theoretical next bit prediction problem can be transformed into a classification problem, and this classification problem could be solved with the aid of some AI algorithms. In particular, they showed how a well-known classifier called c4.5 could predict the next bit generated by a linear feedback shift register (LFSR, a widely used model of pseudorandom number generator) very efficiently and, most importantly, without any previous knowledge over the model used. The authors look for other classifiers, apart from c4.5, that could be useful in the prediction of LFSRs. We conclude that the selection of c4.5 by Hernandez et al. was adequate, because it shows the best accuracy of all the classifiers tested. However, we have found other classifiers that produce interesting results, and we suggest that these algorithms must be taken into account in the future when trying to predict more complex LFSR-based models. Finally, we show some other properties that make the c4.5 algorithm the best choice for this particular cryptanalytic problem.Publicad
Ensembles of probability estimation trees for customer churn prediction
Customer churn prediction is one of the most, important elements tents of a company's Customer Relationship Management, (CRM) strategy In tins study, two strategies are investigated to increase the lift. performance of ensemble classification models, i.e (1) using probability estimation trees (PETs) instead of standard decision trees as base classifiers; and (n) implementing alternative fusion rules based on lift weights lot the combination of ensemble member's outputs Experiments ale conducted lot font popular ensemble strategics on five real-life chin n data sets In general, the results demonstrate how lift performance can be substantially improved by using alternative base classifiers and fusion tides However: the effect vanes lot the (Idol cut ensemble strategies lit particular, the results indicate an increase of lift performance of (1) Bagging by implementing C4 4 base classifiets. (n) the Random Subspace Method (RSM) by using lift-weighted fusion rules, and (in) AdaBoost, by implementing both
Solving Multiclass Learning Problems via Error-Correcting Output Codes
Multiclass learning problems involve finding a definition for an unknown
function f(x) whose range is a discrete set containing k > 2 values (i.e., k
``classes''). The definition is acquired by studying collections of training
examples of the form [x_i, f (x_i)]. Existing approaches to multiclass learning
problems include direct application of multiclass algorithms such as the
decision-tree algorithms C4.5 and CART, application of binary concept learning
algorithms to learn individual binary functions for each of the k classes, and
application of binary concept learning algorithms with distributed output
representations. This paper compares these three approaches to a new technique
in which error-correcting codes are employed as a distributed output
representation. We show that these output representations improve the
generalization performance of both C4.5 and backpropagation on a wide range of
multiclass learning tasks. We also demonstrate that this approach is robust
with respect to changes in the size of the training sample, the assignment of
distributed representations to particular classes, and the application of
overfitting avoidance techniques such as decision-tree pruning. Finally, we
show that---like the other methods---the error-correcting code technique can
provide reliable class probability estimates. Taken together, these results
demonstrate that error-correcting output codes provide a general-purpose method
for improving the performance of inductive learning programs on multiclass
problems.Comment: See http://www.jair.org/ for any accompanying file
Cue Phrase Classification Using Machine Learning
Cue phrases may be used in a discourse sense to explicitly signal discourse
structure, but also in a sentential sense to convey semantic rather than
structural information. Correctly classifying cue phrases as discourse or
sentential is critical in natural language processing systems that exploit
discourse structure, e.g., for performing tasks such as anaphora resolution and
plan recognition. This paper explores the use of machine learning for
classifying cue phrases as discourse or sentential. Two machine learning
programs (Cgrendel and C4.5) are used to induce classification models from sets
of pre-classified cue phrases and their features in text and speech. Machine
learning is shown to be an effective technique for not only automating the
generation of classification models, but also for improving upon previous
results. When compared to manually derived classification models already in the
literature, the learned models often perform with higher accuracy and contain
new linguistic insights into the data. In addition, the ability to
automatically construct classification models makes it easier to comparatively
analyze the utility of alternative feature representations of the data.
Finally, the ease of retraining makes the learning approach more scalable and
flexible than manual methods.Comment: 42 pages, uses jair.sty, theapa.bst, theapa.st
Classifying Cue Phrases in Text and Speech Using Machine Learning
Cue phrases may be used in a discourse sense to explicitly signal discourse
structure, but also in a sentential sense to convey semantic rather than
structural information. This paper explores the use of machine learning for
classifying cue phrases as discourse or sentential. Two machine learning
programs (Cgrendel and C4.5) are used to induce classification rules from sets
of pre-classified cue phrases and their features. Machine learning is shown to
be an effective technique for not only automating the generation of
classification rules, but also for improving upon previous results.Comment: 8 pages, PostScript File, to appear in the Proceedings of AAAI-9
- âŠ