1,101 research outputs found

    An Efficient Mixed Integer Programming Algorithm for Minimizing the Training Sample Misclassification Cost in Two-group Classification

    Get PDF
    In this paper, we introduce the Divide and Conquer (D&C) algorithm, a computationally efficient algorithm for determining classification rules which minimize the training sample misclassification cost in two-group classification. This classification rule can be derived using mixed integer programming (MIP) techniques. However, it is well-documented that the complexity of MIP-based classification problems grows exponentially as a function of the size of the training sample and the number of attributes describing the observations, requiring special-purpose algorithms to solve even small size problems within a reasonable computational time. The D&C algorithm derives its name from the fact that it relies, a.o., on partitioning the problem in smaller, more easily handled subproblems, rendering it substantially faster than previously proposed algorithms. The D&C algorithm solves the problem to the exact optimal solution (i.e., it is not a heuristic that approximates the solution), and allows for the analysis of much larger training samples than previous methods. For instance, our computational experiments indicate that, on average, the D&C algorithm solves problems with 2 attributes and 500 observations more than 3 times faster, and problems with 5 attributes and 100 observations over 50 times faster than Soltysik and Yarnold's software, which may be the fastest existing algorithm. We believe that the D&C algorithm contributes significantly to the field of classification analysis, because it substantially widens the array of data sets that can be analyzed meaningfully using methods which require MIP techniques, in particular methods which seek to minimize the misclassification cost in the training sample. The programs implementing the D&C algorithm are available from the authors upon request

    Nontraditional Approaches to Statistical Classification: Some Perspectives on Lp-Norm Methods

    Get PDF
    The body of literature on classification method which estimate boundaries between the groups (classes) by optimizing a function of the L_{p}-norm distances of observations in each group from these boundaries, is maturing fast. The number of published research articles on this topic, especially on mathematical programming (MP) formulations and techniques for L_{p}-norm classification, is now sizable. This paper highlights historical developments that have defined the field, and looks ahead at challenges that may shape new research directions in the next decade. In the first part, the paper summarizes basic concepts and ideas, and briefly reviews past research. Throughout, an attempt is made to integrate a number of the most important L_{p}-norm methods proposed to date within a unified framework, emphasizing their conceptual differences and similarities, rather than focusing on mathematical detail. In the second part, the paper discusses several potential directions for future research in this area. The long-term prospects of L_{p}-norm classification (and discriminant) research may well hinge upon whether or not the channels of communication between on the one hand researchers active in L_{p}-norm classification, who tend to have their roots primarily in decision sciences, the management sciences, computer sciences and engineering, and on the other hand practitioners and researchers in the statistical classification community, will be improved. This paper offers potential reasons for the lack of communication between these groups, and suggests ways in which L_{p}-norm research may be strengthened from a statistical viewpoint. The results obtained in L_{p}-norm classification studies are clearly relevant and of importance to all researchers and practitioners active in classification and discrimination analysis. The paper also briefly discusses artificial neural networks, a promising nontraditional method for classification which has recently emerged, and suggests that it may be useful to explore hybrid classification methods that take advantage of the complementary strengths of different methods, e.g., neural network and L_{p}-norm methods

    Nonparametric Two-Group Classification: Concepts and a SAS-Based Software Package

    Get PDF
    In this paper, we introduce BestClass, a set of SAS macros, available in the mainframe and workstation environment, designed for solving two-group classification problems using a class of recently developed nonparametric classification methods. The criteria used to estimate the classification function are based on either minimizing a function of the absolute deviations from the surface which separates the groups, or directly minimizing a function of the number of misclassified entities in the training sample. The solution techniques used by BestClass to estimate the classification rule utilize the mathematical programming routines of the SAS/OR@ software. Recently, a number of research studies have reported that under certain data conditions this class of classification methods can provide more accurate classification results than existing methods, such as Fisher's linear discriminant function and logistic regression. However, these robust classification methods have not yet been implemented in the major statistical packages, and hence are beyond the reach of those statistical analysts who are unfamiliar with mathematical programming techniques. We use a limited simulation experiment and an example to compare and contrast properties of the methods included in BestClass with existing parametric and nonparametric methods. We believe that BestClass contributes significantly to the field of nonparametric classification analysis, in that it provides the statistical community with convenient access to this recently developed class of methods. BestClass is available from the authors

    Supervised classification and mathematical optimization

    Get PDF
    Data Mining techniques often ask for the resolution of optimization problems. Supervised Classification, and, in particular, Support Vector Machines, can be seen as a paradigmatic instance. In this paper, some links between Mathematical Optimization methods and Supervised Classification are emphasized. It is shown that many different areas of Mathematical Optimization play a central role in off-the-shelf Supervised Classification methods. Moreover, Mathematical Optimization turns out to be extremely useful to address important issues in Classification, such as identifying relevant variables, improving the interpretability of classifiers or dealing with vagueness/noise in the data.Ministerio de Ciencia e InnovaciónJunta de Andalucí

    Multi-group support vector machines with measurement costs a biobjective approach

    Get PDF
    Support Vector Machine has shown to have good performance in many practical classification settings. In this paper we propose, for multi-group classification, a biobjective optimization model in which we consider not only the generalization ability (modelled through the margin maximization), but also costs associated with the features. This cost is not limited to an economical payment, but can also refer to risk, computational effort, space requirements, etc. We introduce a biobjective mixed integer problem, for which Pareto optimal solutions are obtained. Those Pareto optimal solutions correspond to different classification rules, among which the user would choose the one yielding the most appropriate compromise between the cost and the expected misclassification rate.Ministerio de Ciencia y TecnologíaPlan Andaluz de Investigació

    Supervised Classification and Mathematical Optimization

    Get PDF
    Data Mining techniques often ask for the resolution of optimization problems. Supervised Classification, and, in particular, Support Vector Machines, can be seen as a paradigmatic instance. In this paper, some links between Mathematical Optimization methods and Supervised Classification are emphasized. It is shown that many different areas of Mathematical Optimization play a central role in off-the-shelf Supervised Classification methods. Moreover, Mathematical Optimization turns out to be extremely useful to address important issues in Classification, such as identifying relevant variables, improving the interpretability of classifiers or dealing with vagueness/noise in the data

    Mathematical Programming Formulations for Two-group Classification with Binary Variables

    Get PDF
    In this paper, we introduce a nonparametric mathematical programming (MP) approach for solving the binary variable classification problem. In practice, there exists a substantial interest in the binary variable classification problem. For instance, medical diagnoses are often based on the presence or absence of relevant symptoms, and binary variable classification has long been used as a means to predict (diagnose) the nature of the medical condition of patients. Our research is motivated by the fact that none of the existing statistical methods for binary variable classification -- parametric and nonparametric alike -- are fully satisfactory. The general class of MP classification methods facilitates a geometric interpretation, and MP-based classification rules have intuitive appeal because of their potentially robust properties. These intuitive arguments appear to have merit, and a number of research studies have confirmed that MP methods can indeed yield effective classification rules under certain non-normal data conditions, for instance if the data set is outlier-contaminated or highly skewed. However, the MP-based approach in general lacks a probabilistic foundation, an ad hoc assessment of its classification performance. Our proposed nonparametric mixed integer programming (MIP) formulation for the binary variable classification problem not only has a geometric interpretation, but also is consistent with the Bayes decision theoretic approach. Therefore, our proposed formulation possesses a strong probabilistic foundation. We also introduce a linear programming (LP) formulation which parallels the concepts underlying the MIP formulation, but does not possess the decision theoretic justification. An additional advantage of both our LP and MIP formulations is that, due to the fact that the attribute variables are binary, the training sample observations can be partitioned into multinomial cells, allowing for a substantial reduction in the number of binary and deviational variables, so that our formulation can be used to analyze training samples of almost any size. We illustrate our formulations using an example problem, and use three real data sets to compare its classification performance with a variety of parametric and nonparametric statistical methods. For each of these data sets, our proposed formulation yields the minimum possible number of misclassifications, both using the resubstitution and the leave-one-out method
    corecore