371 research outputs found

    Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space

    Full text link
    We present a framework for discriminative sequence classification where the learner works directly in the high dimensional predictor space of all subsequences in the training set. This is possible by employing a new coordinate-descent algorithm coupled with bounding the magnitude of the gradient for selecting discriminative subsequences fast. We characterize the loss functions for which our generic learning algorithm can be applied and present concrete implementations for logistic regression (binomial log-likelihood loss) and support vector machines (squared hinge loss). Application of our algorithm to protein remote homology detection and remote fold recognition results in performance comparable to that of state-of-the-art methods (e.g., kernel support vector machines). Unlike state-of-the-art classifiers, the resulting classification models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem

    Remote Homology Detection of Protein Sequences

    Get PDF
    The classification of protein sequences using string kernels provides valuable insights for protein function prediction. Almost all string kernels are based on patterns that are not independent, and therefore the associated scores are obtained using a set of redundant features. In this talk we will discuss how a class of patterns, called Irredundant, is specifically designed to address this issue. Loosely speaking the set of Irredundant patterns is the smallest class of independent patterns that can describe all patterns in a string. We present a classification method based on the statistics of these patterns, named Irredundant Class. Results on benchmark data show that Irredundant Class outperforms most of the string kernel methods previously proposed, and it achieves results as good as the current state-of-the-art methods with a fewer number of patterns. Unfortunately we show that the information carried by the irredundant patterns can not be easily interpreted, thus alternative notions are needed

    Probabilistic Inference of Biological Networks via Data Integration

    Get PDF

    Probabilistic multiple kernel learning

    Get PDF
    The integration of multiple and possibly heterogeneous information sources for an overall decision-making process has been an open and unresolved research direction in computing science since its very beginning. This thesis attempts to address parts of that direction by proposing probabilistic data integration algorithms for multiclass decisions where an observation of interest is assigned to one of many categories based on a plurality of information channels

    Sequence-based protein classification: binary Profile Hidden Markov Models and propositionalisation

    Get PDF
    Detecting similarity in biological sequences is a key element to understanding the mechanisms of life. Researchers infer potential structural, functional or evolutionary relationships from similarity. However, the concept of similarity is complex in biology. Sequences consist of different molecules with different chemical properties, have short and long distance interactions, form 3D structures and change through evolutionary processes. Amino acids are one of the key molecules of life. Most importantly, a sequence of amino acids constitutes the building block for proteins which play an essential role in cellular processes. This thesis investigates similarity amongst proteins. In this area of research there are two important and closely related classification tasks – the detection of similar proteins and the discrimination amongst them. Hidden Markov Models (HMMs) have been successfully applied to the detection task as they model sequence similarity very well. From a Machine Learning point of view these HMMs are essentially one-class classifiers trained solely on a small number of similar proteins neglecting the vast number of dissimilar ones. Our basic assumption is that integrating this neglected information will be highly beneficial to the classification task. Thus, we transform the problem representation from a one-class to a binary one. Equipped with the necessary sound understanding of Machine Learning, especially concerning problem representation and statistically significant evaluation, our work pursues and combines two different avenues on this aforementioned transformation. First, we introduce a binary HMM that discriminates significantly better than the standard one, even when only a fraction of the negative information is used. Second, we interpret the HMM as a structured graph of information. This information cannot be accessed by highly optimised standard Machine Learning classifiers as they expect a fixed length feature vector representation. Propositionalisation is a technique to transform the former representation into the latter. This thesis introduces new propositionalisation techniques. The change in representation changes the learning problem from a one-class, generative to a propositional, discriminative one. It is a common assumption that discriminative techniques are better suited for classification tasks, and our results validate this assumption. We suggest a new way to significantly improve on discriminative power and runtime by means of terminating the time-intense training of HMMs early, subsequently applying propositionalisation and classifying with a discriminative, binary learner

    Two-Stage Fuzzy Multiple Kernel Learning Based on Hilbert-Schmidt Independence Criterion

    Full text link
    Β© 1993-2012 IEEE. Multiple kernel learning (MKL) is a principled approach to kernel combination and selection for a variety of learning tasks, such as classification, clustering, and dimensionality reduction. In this paper, we develop a novel fuzzy multiple kernel learning model based on the Hilbert-Schmidt independence criterion (HSIC) for classification, which we call HSIC-FMKL. In this model, we first propose an HSIC Lasso-based MKL formulation, which not only has a clear statistical interpretation that minimum redundant kernels with maximum dependence on output labels are found and combined, but also enables the global optimal solution to be computed efficiently by solving a Lasso optimization problem. Since the traditional support vector machine (SVM) is sensitive to outliers or noises in the dataset, fuzzy SVM (FSVM) is used to select the prediction hypothesis once the optimal kernel has been obtained. The main advantage of FSVM is that we can associate a fuzzy membership with each data point such that these data points can have different effects on the training of the learning machine. We propose a new fuzzy membership function using a heuristic strategy based on the HSIC. The proposed HSIC-FMKL is a two-stage kernel learning approach and the HSIC is applied in both stages. We perform extensive experiments on real-world datasets from the UCI benchmark repository and the application domain of computational biology which validate the superiority of the proposed model in terms of prediction accuracy

    An automated approach to remote protein homology classification.

    Get PDF
    The classification of protein structures into evolutionary superfamilies, for example in the CATH or SCOP domain structure databases, although performed with varying degrees of automation, has remained a largely subjective activity guided by expert knowledge. The huge expansion of the Protein Structure Databank (PDB), partly due to the structural genomics initiatives, has posed significant challenges to maintaining the coverage of these structural classification resources. This is because the high degree of manual assessment currently involved has affected their ability to keep pace with high throughput structure determination. This thesis presents an evaluation of different methods used in remote homologue detection which was performed to identify the most powerful approaches currently available. The design and implementation of new protocols suitable for remote homologue detection was informed by an analysis of the extent to which different homologous superfamilies in CATH evolve in sequence, structure and function and characterisation of the mechanisms by which this occurs. This analysis revealed that relatives in some highly populated CATH superfamilies have diverged considerably in their structures. In diverse relatives, significant variations are observed in the secondary structure embellishments decorating the common structural core for the superfamily. There are also differences in the packing angles between secondary structures. Information on the variability observed in CATH superfamilies is collated in an established web resource the Dictionary of Homologous Superfamilies, which has been expanded and improved in a number of ways. A new structural comparison algorithm, CATHEDRAL, is described. This was developed to cope with the structural variation observed across CATH superfamilies and to improve the automatic recognition of domain boundaries in multidomain structures. CATHEDRAL combines both secondary structure matching and accurate residue alignment in an iterative protocol for determining the location of previously observed folds in novel multi-domain structures. A rigorous benchmarking protocol is also described that assesses the performance of CATHEDRAL against other leading structural comparison methods. The optimisation and benchmarking of several other methods for detecting homology are subsequently presented. These include methods which exploit Hidden Markov Models (HMMs) to detect sequence similarity and methods that attempt to assess functional similarity. Finally an automated, machine learning approach to detecting homologous relationships between proteins is presented which combines information on sequence, structure and functional similarity. This was able to identify over 85% of the homologous relationships in the CATH classification at a 5% error rate. This thesis was gratefully supported by the Biotechnology and Biological Sciences Research Council
    • …
    corecore