93 research outputs found

    A novel spectral-spatial co-training algorithm for the transductive classification of hyperspectral imagery data

    Get PDF
    The automatic classification of hyperspectral data is made complex by several factors, such as the high cost of true sample labeling coupled with the high number of spectral bands, as well as the spatial correlation of the spectral signature. In this paper, a transductive collective classifier is proposed for dealing with all these factors in hyperspectral image classification. The transductive inference paradigm allows us to reduce the inference error for the given set of unlabeled data, as sparsely labeled pixels are learned by accounting for both labeled and unlabeled information. The collective inference paradigm allows us to manage the spatial correlation between spectral responses of neighboring pixels, as interacting pixels are labeled simultaneously. In particular, the innovative contribution of this study includes: (1) the design of an application-specific co-training schema to use both spectral information and spatial information, iteratively extracted at the object (set of pixels) level via collective inference; (2) the formulation of a spatial-aware example selection schema that accounts for the spatial correlation of predicted labels to augment training sets during iterative learning and (3) the investigation of a diversity class criterion that allows us to speed-up co-training classification. Experimental results validate the accuracy and efficiency of the proposed spectral-spatial, collective, co-training strategy

    Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs)

    Get PDF
    We present a new support vector machine (SVM)-based approach to predict the substrate specificity of subtypes of a given protein sequence family. We demonstrate the usefulness of this method on the example of aryl acid-activating and amino acid-activating adenylation domains (A domains) of nonribosomal peptide synthetases (NRPS). The residues of gramicidin synthetase A that are 8 â„« around the substrate amino acid and corresponding positions of other adenylation domain sequences with 397 known and unknown specificities were extracted and used to encode this physico-chemical fingerprint into normalized real-valued feature vectors based on the physico-chemical properties of the amino acids. The SVM software package SVM(light) was used for training and classification, with transductive SVMs to take advantage of the information inherent in unlabeled data. Specificities for very similar substrates that frequently show cross-specificities were pooled to the so-called composite specificities and predictive models were built for them. The reliability of the models was confirmed in cross-validations and in comparison with a currently used sequence-comparison-based method. When comparing the predictions for 1230 NRPS A domains that are currently detectable in UniProt, the new method was able to give a specificity prediction in an additional 18% of the cases compared with the old method. For 70% of the sequences both methods agreed, for <6% they did not, mainly on low-confidence predictions by the existing method. None of the predictive methods could infer any specificity for 2.4% of the sequences, suggesting completely new types of specificity

    Apprentissage semi-supervisé pour les SVMS et leurs variantes

    Get PDF
    La reconnaissance de formes est un domaine fort intéressant de l'intelligence artificielle. Pour résoudre les problèmes de reconnaissance de formes, des classifieurs sont construits en utilisant des prototypes de données à reconnaître ainsi que leur classe d'appartenance. On parie d'apprentissage supervisé. Aujourd'hui, face aux importants volumes de données disponibles, le coût de l'étiquetage des données devient très exorbitant. Ainsi, il est impraticable, voir impossible d'étiqueter toutes les données disponibles. Mais puisque, nous savons que la performance d'un classifieur est liée au nombre de données d'apprenfissage, la principale question qui ressort est comment améliorer l'apprentissage d'un classifieur en ajoutant des données non étiquetées à l'ensemble d'apprentissage. La technique d'apprenfissage issue de la réponse à cette quesfion est appelée apprentissage semi- supervisé. La machine à vecteurs de support(SVM) et sa variante Least-Squares SVM (LS-SVM) sont des classifieurs particuliers basés sur le principe de la maximisation de la marge qui leur confère un fort pouvoir de généralisation. Au cours de nos travaux de recherche, nous avons considéré l'apprentissage semi-supervisé de ces machines. Dès lors, nous avons proposé diverses techniques d'apprentissage de ces machines pour accomplir cette tâche. Dans un premier temps, nous avons ufilisé l'inférence bayésienne pour estimer les paramètres du modèle et les étiquettes. Ainsi, nous avons élaboré des formulations bayésiennes à un et deux niveau(x) d'inférence, qui sont par la suite appliquées aux SVMs et aux LS-SVMs dans le contexte de l'apprentissage semi-supervisé. Dans un second temps, nous avons proposé d'améliorer la technique d'auto-apprentissage, en utilisant un classifieur d'approche générative pour aider le principal classifieur discriminant entraîné en semi-supervisé à étiqueter les données. Nous nommons cette stratégie Apprentissage soutenu (Help-Training), et nous l'avons appliqué avec succès aux SVMs et à sa variante LS-SVM. Nos divers algorithmes d'apprentissage semi-supervisé ont été testés sur des données artificielles et réelles et ont donné des résultats encourageants. Cette validation a été appuyée par une analyse montrant les avantages et les limites de chacun des méthodes développées

    Master of Science

    Get PDF
    thesisMultiple Instance Learning (MIL) is a type of supervised learning with missing data. Here, each example (a.k.a. bag) has one or more instances. In the training set, we have only labels at bag level. The task is to label both bags and instances from the test set. In most practical MIL problems, there is a relationship between the instances of a bag. Capturing this relationship may help learn the underlying concept better. We present an algorithm that uses the structure of bags along with the features of instances. The key idea is to allow a structured support vector machine (SVM) to "guess" at the true underlying structure, so long as it is consistent with the bag labels. This idea is formalized and a new cutting plane algorithm is proposed for optimization. To verify this idea, we implemented our algorithm for a particular kind of structure - hidden markov models. We performed experiments on three datasets and found this algorithm to work better than the existing algorithms in MIL. We present the details of these experiments and the effects of varying different hyperparameters in detail. The key contribution from our work is a very simple loss function with only one hyperparameter that needs to be tuned using a small portion of the training set. The thesis of this work is that it is possible and desirable to exploit the structural relationship between instances in a bag, even though that structure is not observed at training time (i.e., correct labels for all the instances are unknown). Our work opens a new direction to solving the MIL problem. We suggest a few ideas to further our work in this direction

    Doctor of Philosophy

    Get PDF
    dissertationWith the tremendous growth of data produced in the recent years, it is impossible to identify patterns or test hypotheses without reducing data size. Data mining is an area of science that extracts useful information from the data by discovering patterns and structures present in the data. In this dissertation, we will largely focus on clustering which is often the first step in any exploratory data mining task, where items that are similar to each other are grouped together, making downstream data analysis robust. Different clustering techniques have different strengths, and the resulting groupings provide different perspectives on the data. Due to the unsupervised nature i.e., the lack of domain experts who can label the data, validation of results is very difficult. While there are measures that compute "goodness" scores for clustering solutions as a whole, there are few methods that validate the assignment of individual data items to their clusters. To address these challenges we focus on developing a framework that can generate, compare, combine, and evaluate different solutions to make more robust and significant statements about the data. In the first part of this dissertation, we present fast and efficient techniques to generate and combine different clustering solutions. We build on some recent ideas on efficient representations of clusters of partitions to develop a well founded metric that is spatially aware to compare clusterings. With the ability to compare clusterings, we describe a heuristic to combine different solutions to produce a single high quality clustering. We also introduce a Markov chain Monte Carlo approach to sample different clusterings from the entire landscape to provide the users with a variety of choices. In the second part of this dissertation, we build certificates for individual data items and study their influence on effective data reduction. We present a geometric approach by defining regions of influence for data items and clusters and use this to develop adaptive sampling techniques to speedup machine learning algorithms. This dissertation is therefore a systematic approach to study the landscape of clusterings in an attempt to provide a better understanding of the data
    • …
    corecore