7 research outputs found

    Hierarchical linear support vector machine

    Full text link
    This is the author’s version of a work that was accepted for publication in Pattern Recognition. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Pattern Recognition, Vol. 45, Iss. 12, (2012) DOI: 10.1016/j.patcog.2012.06.002The increasing size and dimensionality of real-world datasets make it necessary to design efficient algorithms not only in the training process but also in the prediction phase. In applications such as credit card fraud detection, the classifier needs to predict an event in 10 ms at most. In these environments the speed of the prediction constraints heavily outweighs the training costs. We propose a new classification method, called a Hierarchical Linear Support Vector Machine (H-LSVM), based on the construction of an oblique decision tree in which the node split is obtained as a Linear Support Vector Machine. Although other methods have been proposed to break the data space down in subregions to speed up Support Vector Machines, the H-LSVM algorithm represents a very simple and efficient model in training but mainly in prediction for large-scale datasets. Only a few hyperplanes need to be evaluated in the prediction step, no kernel computation is required and the tree structure makes parallelization possible. In experiments with medium and large datasets, the H-LSVM reduces the prediction cost considerably while achieving classification results closer to the non-linear SVM than that of the linear case.The authors would like to thank the anonymous reviewers for their comments that help improve the manuscript. I.R.-L. is supported by an FPU Grant from Universidad Autónoma de Madrid, and partially supported by the Universidad Autónoma de Madrid-IIC Chair and TIN2010-21575-C02-01. R.H. acknowledges partial support by ONRN00014-07-1-0741, USARIEM-W81XWH-10-C-0040 (ELINTRIX) and JPL-2012-1455933

    The Impact of Overfitting and Overgeneralization on the Classification Accuracy in Data Mining

    Get PDF
    Current classification approaches usually do not try to achieve a balance between fitting and generalization when they infer models from training data. Such approaches ignore the possibility of different penalty costs for the false-positive, false-negative, and unclassifiable types. Thus, their performances may not be optimal or may even be coincidental. This dissertation analyzes the above issues in depth. It also proposes two new approaches called the Homogeneity-Based Algorithm (HBA) and the Convexity-Based Algorithm (CBA) to address these issues. These new approaches aim at optimally balancing the data fitting and generalization behaviors of models when some traditional classification approaches are used. The approaches first define the total misclassification cost (TC) as a weighted function of the three penalty costs and their corresponding error rates. The approaches then partition the training data into regions. In the HBA, the partitioning is done according to some homogeneous properties derivable from the training data. Meanwhile, the CBA employs some convex properties to derive regions. A traditional classification method is then used in conjunction with the HBA and CBA. Finally, the approaches apply a genetic approach to determine the optimal levels of fitting and generalization. The TC serves as the fitness function in this genetic approach. Real-life datasets from a wide spectrum of domains were used to better understand the effectiveness of the HBA and CBA. The computational results have indicated that both the HBA and CBA might potentially fill a critical gap in the implementation of current or future classification approaches. Furthermore, the results have also shown that when the penalty cost of an error type was changed, the corresponding error rate followed stepwise patterns. The finding of stepwise patterns of classification errors can assist researchers in determining applicable penalties for classification errors. Thus, the dissertation also proposes a binary search approach (BSA) to produce those patterns. Real-life datasets were utilized to demonstrate for the BSA

    Generalization Bounds for Decision Trees

    No full text
    We derive a new bound on the error rate for decision trees. The bound depends both on the structure of the tree and the specific sample (not just the size of the sample). This bound is tighter than traditional bounds for unbalanced trees and justifies "compositional" algorithms for constructing decision trees. 1 Introduction The problem of over-fitting is central to both the theory and practice of machine learning. Intuitively, one over-fits by using too many parameters in the concept, e.g, fitting an nth order polynomial to n data points. One under-fits by using too few parameters, e.g., fitting a linear curve to clearly quadratic data. The fundamental question is how many parameters, or what concept size, should one allow for a given amount of training data. A standard theoretical approach is to prove a bound on generalization error as a function of the training error and the concept size (or VC dimension). One can then select a concept minimizing this bound, i.e., optimizing a cert..