5 research outputs found

    Cost-Sensitive Boosting

    Full text link

    Hierarchical linear support vector machine

    Full text link
    This is the author’s version of a work that was accepted for publication in Pattern Recognition. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Pattern Recognition, Vol. 45, Iss. 12, (2012) DOI: 10.1016/j.patcog.2012.06.002The increasing size and dimensionality of real-world datasets make it necessary to design efficient algorithms not only in the training process but also in the prediction phase. In applications such as credit card fraud detection, the classifier needs to predict an event in 10 ms at most. In these environments the speed of the prediction constraints heavily outweighs the training costs. We propose a new classification method, called a Hierarchical Linear Support Vector Machine (H-LSVM), based on the construction of an oblique decision tree in which the node split is obtained as a Linear Support Vector Machine. Although other methods have been proposed to break the data space down in subregions to speed up Support Vector Machines, the H-LSVM algorithm represents a very simple and efficient model in training but mainly in prediction for large-scale datasets. Only a few hyperplanes need to be evaluated in the prediction step, no kernel computation is required and the tree structure makes parallelization possible. In experiments with medium and large datasets, the H-LSVM reduces the prediction cost considerably while achieving classification results closer to the non-linear SVM than that of the linear case.The authors would like to thank the anonymous reviewers for their comments that help improve the manuscript. I.R.-L. is supported by an FPU Grant from Universidad Autónoma de Madrid, and partially supported by the Universidad Autónoma de Madrid-IIC Chair and TIN2010-21575-C02-01. R.H. acknowledges partial support by ONRN00014-07-1-0741, USARIEM-W81XWH-10-C-0040 (ELINTRIX) and JPL-2012-1455933

    A model driven approach to imbalanced data learning

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Minimizing Recommended Error Costs Under Noisy Inputs in Rule-Based Expert Systems

    Get PDF
    This dissertation develops methods to minimize recommendation error costs when inputs to a rule-based expert system are prone to errors. The problem often arises in web-based applications where data are inherently noisy or provided by users who perceive some benefit from falsifying inputs. Prior studies proposed methods that attempted to minimize the probability of recommendation error, but did not take into account the relative costs of different types of errors. In situations where these differences are significant, an approach that minimizes the expected misclassification error costs has advantages over extant methods that ignore these costs. Building on the existing literature, two new techniques - Cost-Based Input Modification (CBIM) and Cost-Based Knowledge-Base Modification (CBKM) were developed and evaluated. Each method takes as inputs (1) the joint probability distribution of a set of rules, (2) the distortion matrix for input noise as characterized by the probability distribution of the observed input vectors conditioned on their true values, and (3) the misclassification cost for each type of recommendation error. Under CBIM, for any observed input vector v, the recommendation is based on a modified input vector v\u27 such that the expected error costs are minimized. Under CBKM the rule base itself is modified to minimize the expected cost of error. The proposed methods were investigated as follows: as a control, in the special case where the costs associated with different types of errors are identical, the recommendations under these methods were compared for consistency with those obtained under extant methods. Next, the relative advantages of CBIM and CBKM were compared as (1) the noise level changed, and (2) the structure of the cost matrix varied. As expected, CBKM and CBIM outperformed the extant Knowledge Base Modification (KM) and Input Modification (IM) methods over a wide range of input distortion and cost matrices, with some restrictions. Under the control, with constant misclassification costs, the new methods performed equally with the extant methods. As misclassification costs increased, CBKM outperformed KM and CBIM outperformed IM. Using different cost matrices to increase misclassification cost asymmetry and order, CBKM and CBIM performance increased. At very low distortion levels, CBKM and CBIM underperformed as error probability became more significant in each method\u27s estimation. Additionally, CBKM outperformed CBIM over a wide range of input distortion as its technique of modifying an original knowledge base outperformed the technique of modifying inputs to an unmodified decision tree

    Mining the risk types of human papillomavirus (hpv) by adacost

    No full text
    Abstract. Human Papillomavirus (HPV) infection is known as the main factor for cervical cancer, where cervical cancer is a leading cause of cancer deaths in women worldwide. Because there are more than 100 types in HPV, it is critical to discriminate the HPVs related with cervical cancer from those not related with it. In this paper, we classify the risk type of HPVs using their textual explanation. The important issue in this problem is to distinguish false negatives from false positives. That is, we must find out high-risk HPVs though we may miss some low-risk HPVs. For this purpose, the AdaCost, a cost-sensitive learner is adopted to consider different costs between training examples. The experimental results on the HPV sequence database show that considering costs gives higher performance. The F-score is higher than the accuracy, which implies that most high-risk HPVs are found.
    corecore