2 research outputs found

    Mathematical Programming Formulations for Two-group Classification with Binary Variables

    Get PDF
    In this paper, we introduce a nonparametric mathematical programming (MP) approach for solving the binary variable classification problem. In practice, there exists a substantial interest in the binary variable classification problem. For instance, medical diagnoses are often based on the presence or absence of relevant symptoms, and binary variable classification has long been used as a means to predict (diagnose) the nature of the medical condition of patients. Our research is motivated by the fact that none of the existing statistical methods for binary variable classification -- parametric and nonparametric alike -- are fully satisfactory. The general class of MP classification methods facilitates a geometric interpretation, and MP-based classification rules have intuitive appeal because of their potentially robust properties. These intuitive arguments appear to have merit, and a number of research studies have confirmed that MP methods can indeed yield effective classification rules under certain non-normal data conditions, for instance if the data set is outlier-contaminated or highly skewed. However, the MP-based approach in general lacks a probabilistic foundation, an ad hoc assessment of its classification performance. Our proposed nonparametric mixed integer programming (MIP) formulation for the binary variable classification problem not only has a geometric interpretation, but also is consistent with the Bayes decision theoretic approach. Therefore, our proposed formulation possesses a strong probabilistic foundation. We also introduce a linear programming (LP) formulation which parallels the concepts underlying the MIP formulation, but does not possess the decision theoretic justification. An additional advantage of both our LP and MIP formulations is that, due to the fact that the attribute variables are binary, the training sample observations can be partitioned into multinomial cells, allowing for a substantial reduction in the number of binary and deviational variables, so that our formulation can be used to analyze training samples of almost any size. We illustrate our formulations using an example problem, and use three real data sets to compare its classification performance with a variety of parametric and nonparametric statistical methods. For each of these data sets, our proposed formulation yields the minimum possible number of misclassifications, both using the resubstitution and the leave-one-out method

    Pattern recognition with discrete and mixed data : theory and practice

    Get PDF
    This thesis is devoted to aspects related to the analysis of medical data bases in the context of pattern recognition. It contains both theoretical aspects and practical applications and its scope includes questions and problems that arise when applying pattern recognition methods and techniques to this type of data. The goal of the application of statistical pattern recognition techniques to medical records, is the classification of the (disease) patterns that may be present in such records in terms of the information they contain. Typically, a medical record contains a description of history, symptoms, results from laboratory tests, signals, etc., all related to a given patient, i.e. all the information normally required by a physician when making a diagnosis and/or a prognosis. Pattern recognition may be used in order to obtain procedures (computer implemented algorithms) to assign diagnostic or prognostic classes to a given patient, on the basis of information also used by a physician. These procedures are not intended to replace but to assist the physician in the decision making process. The procedures are called classifiers or discriminants and the symptoms, signals, etc., are called features. Each individual record is termed an object, and a collection of objects with qualitatively and/or quantitatively similar characteristics, as established by an expert, is called a class. It should be clear that pattern recognition can be applied to a wide variety of areas and problems, of which (computer-aided) medical decision making is just an example. In order to arrive at a classifier and restricting ourselves to what is called supervised learning, a set of objects known a-priori to belong to two or more classes (depending on the problem at hand) is needed. In this set, each object must be represented by a group of features and have a class assigned to it. The role of medical data bases is now clear: they are the set of objects required for supervised learning
    corecore