10,410 research outputs found

    Local protein structure prediction using discriminative models

    Get PDF
    BACKGROUND: In recent years protein structure prediction methods using local structure information have shown promising improvements. The quality of new fold predictions has risen significantly and in fold recognition incorporation of local structure predictions led to improvements in the accuracy of results. We developed a local structure prediction method to be integrated into either fold recognition or new fold prediction methods. For each local sequence window of a protein sequence the method predicts probability estimates for the sequence to attain particular local structures from a set of predefined local structure candidates. The first step is to define a set of local structure representatives based on clustering recurrent local structures. In the second step a discriminative model is trained to predict the local structure representative given local sequence information. RESULTS: The step of clustering local structures yields an average RMSD quantization error of 1.19 Ă… for 27 structural representatives (for a fragment length of 7 residues). In the prediction step the area under the ROC curve for detection of the 27 classes ranges from 0.68 to 0.88. CONCLUSION: The described method yields probability estimates for local protein structure candidates, giving signals for all kinds of local structure. These local structure predictions can be incorporated either into fold recognition algorithms to improve alignment quality and the overall prediction accuracy or into new fold prediction methods

    Multiple instance learning for sequence data with across bag dependencies

    Full text link
    In Multiple Instance Learning (MIL) problem for sequence data, the instances inside the bags are sequences. In some real world applications such as bioinformatics, comparing a random couple of sequences makes no sense. In fact, each instance may have structural and/or functional relations with instances of other bags. Thus, the classification task should take into account this across bag relation. In this work, we present two novel MIL approaches for sequence data classification named ABClass and ABSim. ABClass extracts motifs from related instances and use them to encode sequences. A discriminative classifier is then applied to compute a partial classification result for each set of related sequences. ABSim uses a similarity measure to discriminate the related instances and to compute a scores matrix. For both approaches, an aggregation method is applied in order to generate the final classification result. We applied both approaches to solve the problem of bacterial Ionizing Radiation Resistance prediction. The experimental results of the presented approaches are satisfactory

    Kernel methods in genomics and computational biology

    Full text link
    Support vector machines and kernel methods are increasingly popular in genomics and computational biology, due to their good performance in real-world applications and strong modularity that makes them suitable to a wide range of problems, from the classification of tumors to the automatic annotation of proteins. Their ability to work in high dimension, to process non-vectorial data, and the natural framework they provide to integrate heterogeneous data are particularly relevant to various problems arising in computational biology. In this chapter we survey some of the most prominent applications published so far, highlighting the particular developments in kernel methods triggered by problems in biology, and mention a few promising research directions likely to expand in the future

    Inferring Network Mechanisms: The Drosophila melanogaster Protein Interaction Network

    Get PDF
    Naturally occurring networks exhibit quantitative features revealing underlying growth mechanisms. Numerous network mechanisms have recently been proposed to reproduce specific properties such as degree distributions or clustering coefficients. We present a method for inferring the mechanism most accurately capturing a given network topology, exploiting discriminative tools from machine learning. The Drosophila melanogaster protein network is confidently and robustly (to noise and training data subsampling) classified as a duplication-mutation-complementation network over preferential attachment, small-world, and other duplication-mutation mechanisms. Systematic classification, rather than statistical study of specific properties, provides a discriminative approach to understand the design of complex networks.Comment: 19 pages, 5 figure

    Non-Gaussian Discriminative Factor Models via the Max-Margin Rank-Likelihood

    Full text link
    We consider the problem of discriminative factor analysis for data that are in general non-Gaussian. A Bayesian model based on the ranks of the data is proposed. We first introduce a new {\em max-margin} version of the rank-likelihood. A discriminative factor model is then developed, integrating the max-margin rank-likelihood and (linear) Bayesian support vector machines, which are also built on the max-margin principle. The discriminative factor model is further extended to the {\em nonlinear} case through mixtures of local linear classifiers, via Dirichlet processes. Fully local conjugacy of the model yields efficient inference with both Markov Chain Monte Carlo and variational Bayes approaches. Extensive experiments on benchmark and real data demonstrate superior performance of the proposed model and its potential for applications in computational biology.Comment: 14 pages, 7 figures, ICML 201

    ProtNN: Fast and Accurate Nearest Neighbor Protein Function Prediction based on Graph Embedding in Structural and Topological Space

    Full text link
    Studying the function of proteins is important for understanding the molecular mechanisms of life. The number of publicly available protein structures has increasingly become extremely large. Still, the determination of the function of a protein structure remains a difficult, costly, and time consuming task. The difficulties are often due to the essential role of spatial and topological structures in the determination of protein functions in living cells. In this paper, we propose ProtNN, a novel approach for protein function prediction. Given an unannotated protein structure and a set of annotated proteins, ProtNN finds the nearest neighbor annotated structures based on protein-graph pairwise similarities. Given a query protein, ProtNN finds the nearest neighbor reference proteins based on a graph representation model and a pairwise similarity between vector embedding of both query and reference protein-graphs in structural and topological spaces. ProtNN assigns to the query protein the function with the highest number of votes across the set of k nearest neighbor reference proteins, where k is a user-defined parameter. Experimental evaluation demonstrates that ProtNN is able to accurately classify several datasets in an extremely fast runtime compared to state-of-the-art approaches. We further show that ProtNN is able to scale up to a whole PDB dataset in a single-process mode with no parallelization, with a gain of thousands order of magnitude of runtime compared to state-of-the-art approaches

    EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation

    Full text link
    During the past decade, with the significant progress of computational power as well as ever-rising data availability, deep learning techniques became increasingly popular due to their excellent performance on computer vision problems. The size of the Protein Data Bank has increased more than 15 fold since 1999, which enabled the expansion of models that aim at predicting enzymatic function via their amino acid composition. Amino acid sequence however is less conserved in nature than protein structure and therefore considered a less reliable predictor of protein function. This paper presents EnzyNet, a novel 3D-convolutional neural networks classifier that predicts the Enzyme Commission number of enzymes based only on their voxel-based spatial structure. The spatial distribution of biochemical properties was also examined as complementary information. The 2-layer architecture was investigated on a large dataset of 63,558 enzymes from the Protein Data Bank and achieved an accuracy of 78.4% by exploiting only the binary representation of the protein shape. Code and datasets are available at https://github.com/shervinea/enzynet.Comment: 11 pages, 6 figure
    • …
    corecore