21,002 research outputs found

    Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation

    Get PDF
    Missing data is a widespread problem that can affect the ability to use data to construct effective prediction systems. We investigate a common machine learning technique that can tolerate missing values, namely C4.5, to predict cost using six real world software project databases. We analyze the predictive performance after using the k-NN missing data imputation technique to see if it is better to tolerate missing data or to try to impute missing values and then apply the C4.5 algorithm. For the investigation, we simulated three missingness mechanisms, three missing data patterns, and five missing data percentages. We found that the k-NN imputation can improve the prediction accuracy of C4.5. At the same time, both C4.5 and k-NN are little affected by the missingness mechanism, but that the missing data pattern and the missing data percentage have a strong negative impact upon prediction (or imputation) accuracy particularly if the missing data percentage exceeds 40%

    Using classifiers to predict linear feedback shift registers

    Get PDF
    Proceeding of: IEEE 35th International Carnahan Conference on Security Technology. October 16-19, 2001, LondonPreviously (J.C. Hernandez et al., 2000), some new ideas that justify the use of artificial intelligence techniques in cryptanalysis are presented. The main objective of that paper was to show that the theoretical next bit prediction problem can be transformed into a classification problem, and this classification problem could be solved with the aid of some AI algorithms. In particular, they showed how a well-known classifier called c4.5 could predict the next bit generated by a linear feedback shift register (LFSR, a widely used model of pseudorandom number generator) very efficiently and, most importantly, without any previous knowledge over the model used. The authors look for other classifiers, apart from c4.5, that could be useful in the prediction of LFSRs. We conclude that the selection of c4.5 by Hernandez et al. was adequate, because it shows the best accuracy of all the classifiers tested. However, we have found other classifiers that produce interesting results, and we suggest that these algorithms must be taken into account in the future when trying to predict more complex LFSR-based models. Finally, we show some other properties that make the c4.5 algorithm the best choice for this particular cryptanalytic problem.Publicad

    Ensembles of probability estimation trees for customer churn prediction

    Get PDF
    Customer churn prediction is one of the most, important elements tents of a company's Customer Relationship Management, (CRM) strategy In tins study, two strategies are investigated to increase the lift. performance of ensemble classification models, i.e (1) using probability estimation trees (PETs) instead of standard decision trees as base classifiers; and (n) implementing alternative fusion rules based on lift weights lot the combination of ensemble member's outputs Experiments ale conducted lot font popular ensemble strategics on five real-life chin n data sets In general, the results demonstrate how lift performance can be substantially improved by using alternative base classifiers and fusion tides However: the effect vanes lot the (Idol cut ensemble strategies lit particular, the results indicate an increase of lift performance of (1) Bagging by implementing C4 4 base classifiets. (n) the Random Subspace Method (RSM) by using lift-weighted fusion rules, and (in) AdaBoost, by implementing both

    Solving Multiclass Learning Problems via Error-Correcting Output Codes

    Full text link
    Multiclass learning problems involve finding a definition for an unknown function f(x) whose range is a discrete set containing k &gt 2 values (i.e., k ``classes''). The definition is acquired by studying collections of training examples of the form [x_i, f (x_i)]. Existing approaches to multiclass learning problems include direct application of multiclass algorithms such as the decision-tree algorithms C4.5 and CART, application of binary concept learning algorithms to learn individual binary functions for each of the k classes, and application of binary concept learning algorithms with distributed output representations. This paper compares these three approaches to a new technique in which error-correcting codes are employed as a distributed output representation. We show that these output representations improve the generalization performance of both C4.5 and backpropagation on a wide range of multiclass learning tasks. We also demonstrate that this approach is robust with respect to changes in the size of the training sample, the assignment of distributed representations to particular classes, and the application of overfitting avoidance techniques such as decision-tree pruning. Finally, we show that---like the other methods---the error-correcting code technique can provide reliable class probability estimates. Taken together, these results demonstrate that error-correcting output codes provide a general-purpose method for improving the performance of inductive learning programs on multiclass problems.Comment: See http://www.jair.org/ for any accompanying file

    Cue Phrase Classification Using Machine Learning

    Full text link
    Cue phrases may be used in a discourse sense to explicitly signal discourse structure, but also in a sentential sense to convey semantic rather than structural information. Correctly classifying cue phrases as discourse or sentential is critical in natural language processing systems that exploit discourse structure, e.g., for performing tasks such as anaphora resolution and plan recognition. This paper explores the use of machine learning for classifying cue phrases as discourse or sentential. Two machine learning programs (Cgrendel and C4.5) are used to induce classification models from sets of pre-classified cue phrases and their features in text and speech. Machine learning is shown to be an effective technique for not only automating the generation of classification models, but also for improving upon previous results. When compared to manually derived classification models already in the literature, the learned models often perform with higher accuracy and contain new linguistic insights into the data. In addition, the ability to automatically construct classification models makes it easier to comparatively analyze the utility of alternative feature representations of the data. Finally, the ease of retraining makes the learning approach more scalable and flexible than manual methods.Comment: 42 pages, uses jair.sty, theapa.bst, theapa.st

    Classifying Cue Phrases in Text and Speech Using Machine Learning

    Full text link
    Cue phrases may be used in a discourse sense to explicitly signal discourse structure, but also in a sentential sense to convey semantic rather than structural information. This paper explores the use of machine learning for classifying cue phrases as discourse or sentential. Two machine learning programs (Cgrendel and C4.5) are used to induce classification rules from sets of pre-classified cue phrases and their features. Machine learning is shown to be an effective technique for not only automating the generation of classification rules, but also for improving upon previous results.Comment: 8 pages, PostScript File, to appear in the Proceedings of AAAI-9
    • 

    corecore