11 research outputs found

    On the optimal usage of labelled examples in semi-supervised multi-class classification problems

    Get PDF
    In recent years, the performance of semi-supervised learning has been theoretically investigated. However, most of this theoretical development has focussed on binary classification problems. In this paper, we take it a step further by extending the work of Castelli and Cover [1] [2] to the multi-class paradigm. Particularly, we consider the key problem in semi-supervised learning of classifying an unseen instance x into one of K different classes, using a training dataset sampled from a mixture density distribution and composed of l labelled records and u unlabelled examples. Even under the assumption of identifiability of the mixture and having infinite unlabelled examples, labelled records are needed to determine the K decision regions. Therefore, in this paper, we first investigate the minimum number of labelled examples needed to accomplish that task. Then, we propose an optimal multi-class learning algorithm which is a generalisation of the optimal procedure proposed in the literature for binary problems. Finally, we make use of this generalisation to study the probability of error when the binary class constraint is relaxed

    Frugal hypothesis testing and classification

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 157-175).The design and analysis of decision rules using detection theory and statistical learning theory is important because decision making under uncertainty is pervasive. Three perspectives on limiting the complexity of decision rules are considered in this thesis: geometric regularization, dimensionality reduction, and quantization or clustering. Controlling complexity often reduces resource usage in decision making and improves generalization when learning decision rules from noisy samples. A new margin-based classifier with decision boundary surface area regularization and optimization via variational level set methods is developed. This novel classifier is termed the geometric level set (GLS) classifier. A method for joint dimensionality reduction and margin-based classification with optimization on the Stiefel manifold is developed. This dimensionality reduction approach is extended for information fusion in sensor networks. A new distortion is proposed for the quantization or clustering of prior probabilities appearing in the thresholds of likelihood ratio tests. This distortion is given the name mean Bayes risk error (MBRE). The quantization framework is extended to model human decision making and discrimination in segregated populations.by Kush R. Varshney.Ph.D

    Theoretical and methodological advances in semi-supervised learning and the class-imbalance problem.

    Get PDF
    201 p.Este trabajo se centra en la generalización teórica y práctica de dos situaciones desafiantes y conocidas del campo del aprendizaje automático a problemas de clasificación en los cuales la suposición de tener una única clase binaria no se cumple.Aprendizaje semi-supervisado es una técnica que usa grandes cantidades de datos no etiquetados para, así, mejorar el rendimiento del aprendizaje supervisado cuando el conjunto de datos etiquetados es muy acotado. Concretamente, este trabajo contribuye con metodologías potentes y computacionalmente eficientes para aprender, de forma semi-supervisada, clasificadores para múltiples variables clase. También, se investigan, de forma teórica, los límites fundamentales del aprendizaje semi-supervisado en problemas multiclase.El problema de desbalanceo de clases aparece cuando las variables objetivo presentan una distribución de probabilidad lo suficientemente desbalanceada como para desvirtuar las soluciones propuestas por los algoritmos de aprendizaje supervisado tradicionales. En este proyecto, se propone un marco teórico para separar la desvirtuación producida por el desbalanceo de clases de otros factores que afectan a la precisión de los clasificadores. Este marco es usado principalmente para realizar una recomendación de métricas de evaluación de clasificadores en esta situación. Por último, también se propone una medida del grado de desbalanceo de clases en un conjunto de datos correlacionada con la pérdida de precisión ocasionada.Intelligent Systems Grou

    Theoretical and Methodological Advances in Semi-supervised Learning and the Class-Imbalance Problem

    Get PDF
    his paper focuses on the theoretical and practical generalization of two known and challenging situations from the field of machine learning to classification problems in which the assumption of having a single binary class is not fulfilled.semi-supervised learning is a technique that uses large amounts of unlabeled data to improve the performance of supervised learning when the labeled data set is very limited. Specifically, this work contributes with powerful and computationally efficient methodologies to learn, in a semi-supervised way, classifiers for multiple class variables. Also, the fundamental limits of semi-supervised learning in multi-class problems are investigated in a theoretical way. The problem of class unbalance appears when the target variables present a probability distribution unbalanced enough to distort the solutions proposed by the traditional supervised learning algorithms. In this project, a theoretical framework is proposed to separate the deviation produced by class unbalance from other factors that affect the accuracy of classifiers. This framework is mainly used to make a recommendation of classifier assessment metrics in this situation. Finally, a measure of the degree of class unbalance in a data set correlated with the loss of accuracy caused is also proposed

    Auxiliary Marker-Assisted Classification in the Absence of Class Identifiers

    Get PDF
    Constructing classification rules for accurate diagnosis of a disorder is an important goal in medical practice. In many clinical applications, there is no clinically significant anatomical or physiological deviation exists to identify the gold standard disease status to inform development of classification algorithms. Despite absence of perfect disease class identifiers, there are usually one or more disease-informative auxiliary markers along with feature variables comprising known symptoms. Existing statistical learning approaches do not effectively draw information from auxiliary prognostic markers. We propose a large margin classification method, with particular emphasis on the support vector machine (SVM), assisted by available informative markers in order to classify disease without knowing a subject’s true disease status. We view this task as statistical learning in the presence of missing data, and introduce a pseudo-EM algorithm to the classification. A major distinction with a regular EM algorithm is that we do not model the distribution of missing data given the observed feature variables either parametrically or semiparametrically. We also propose a sparse variable selection method embedded in the pseudo-EM algorithm. Theoretical examination shows that the proposed classification rule is Fisher consistent, and that under a linear rule, the proposed selection has an oracle variable selection property and the estimated coefficients are asymptotically normal. We apply the methods to build decision rules for including subjects in clinical trials of a new psychiatric disorder and present four applications to data available at the UCI Machine Learning Repository

    Context Awareness for Navigation Applications

    Get PDF
    This thesis examines the topic of context awareness for navigation applications and asks the question, “What are the benefits and constraints of introducing context awareness in navigation?” Context awareness can be defined as a computer’s ability to understand the situation or context in which it is operating. In particular, we are interested in how context awareness can be used to understand the navigation needs of people using mobile computers, such as smartphones, but context awareness can also benefit other types of navigation users, such as maritime navigators. There are countless other potential applications of context awareness, but this thesis focuses on applications related to navigation. For example, if a smartphone-based navigation system can understand when a user is walking, driving a car, or riding a train, then it can adapt its navigation algorithms to improve positioning performance. We argue that the primary set of tools available for generating context awareness is machine learning. Machine learning is, in fact, a collection of many different algorithms and techniques for developing “computer systems that automatically improve their performance through experience” [1]. This thesis examines systematically the ability of existing algorithms from machine learning to endow computing systems with context awareness. Specifically, we apply machine learning techniques to tackle three different tasks related to context awareness and having applications in the field of navigation: (1) to recognize the activity of a smartphone user in an indoor office environment, (2) to recognize the mode of motion that a smartphone user is undergoing outdoors, and (3) to determine the optimal path of a ship traveling through ice-covered waters. The diversity of these tasks was chosen intentionally to demonstrate the breadth of problems encompassed by the topic of context awareness. During the course of studying context awareness, we adopted two conceptual “frameworks,” which we find useful for the purpose of solidifying the abstract concepts of context and context awareness. The first such framework is based strongly on the writings of a rhetorician from Hellenistic Greece, Hermagoras of Temnos, who defined seven elements of “circumstance”. We adopt these seven elements to describe contextual information. The second framework, which we dub the “context pyramid” describes the processing of raw sensor data into contextual information in terms of six different levels. At the top of the pyramid is “rich context”, where the information is expressed in prose, and the goal for the computer is to mimic the way that a human would describe a situation. We are still a long way off from computers being able to match a human’s ability to understand and describe context, but this thesis improves the state-of-the-art in context awareness for navigation applications. For some particular tasks, machine learning has succeeded in outperforming humans, and in the future there are likely to be tasks in navigation where computers outperform humans. One example might be the route optimization task described above. This is an example of a task where many different types of information must be fused in non-obvious ways, and it may be that computer algorithms can find better routes through ice-covered waters than even well-trained human navigators. This thesis provides only preliminary evidence of this possibility, and future work is needed to further develop the techniques outlined here. The same can be said of the other two navigation-related tasks examined in this thesis

    Genetic algorithm-neural network: feature extraction for bioinformatics data.

    Get PDF
    With the advance of gene expression data in the bioinformatics field, the questions which frequently arise, for both computer and medical scientists, are which genes are significantly involved in discriminating cancer classes and which genes are significant with respect to a specific cancer pathology. Numerous computational analysis models have been developed to identify informative genes from the microarray data, however, the integrity of the reported genes is still uncertain. This is mainly due to the misconception of the objectives of microarray study. Furthermore, the application of various preprocessing techniques in the microarray data has jeopardised the quality of the microarray data. As a result, the integrity of the findings has been compromised by the improper use of techniques and the ill-conceived objectives of the study. This research proposes an innovative hybridised model based on genetic algorithms (GAs) and artificial neural networks (ANNs), to extract the highly differentially expressed genes for a specific cancer pathology. The proposed method can efficiently extract the informative genes from the original data set and this has reduced the gene variability errors incurred by the preprocessing techniques. The novelty of the research comes from two perspectives. Firstly, the research emphasises on extracting informative features from a high dimensional and highly complex data set, rather than to improve classification results. Secondly, the use of ANN to compute the fitness function of GA which is rare in the context of feature extraction. Two benchmark microarray data have been taken to research the prominent genes expressed in the tumour development and the results show that the genes respond to different stages of tumourigenesis (i.e. different fitness precision levels) which may be useful for early malignancy detection. The extraction ability of the proposed model is validated based on the expected results in the synthetic data sets. In addition, two bioassay data have been used to examine the efficiency of the proposed model to extract significant features from the large, imbalanced and multiple data representation bioassay data
    corecore