651 research outputs found

    Proteome coverage prediction with infinite Markov models

    Get PDF
    Motivation: Liquid chromatography tandem mass spectrometry (LC-MS/MS) is the predominant method to comprehensively characterize complex protein mixtures such as samples from prefractionated or complete proteomes. In order to maximize proteome coverage for the studied sample, i.e. identify as many traceable proteins as possible, LC-MS/MS experiments are typically repeated extensively and the results combined. Proteome coverage prediction is the task of estimating the number of peptide discoveries of future LC-MS/MS experiments. Proteome coverage prediction is important to enhance the design of efficient proteomics studies. To date, there does not exist any method to reliably estimate the increase of proteome coverage at an early stage. Results: We propose an extended infinite Markov model DiriSim to extrapolate the progression of proteome coverage based on a small number of already performed LC-MS/MS experiments. The method explicitly accounts for the uncertainty of peptide identifications. We tested DiriSim on a set of 37 LC-MS/MS experiments of a complete proteome sample and demonstrated that DiriSim correctly predicts the coverage progression already from a small subset of experiments. The predicted progression enabled us to specify maximal coverage for the test sample. We demonstrated that quality requirements on the final proteome map impose an upper bound on the number of useful experiment repetitions and limit the achievable proteome coverage. Contact: [email protected]; [email protected]

    Proteome coverage prediction with infinite Markov models

    Get PDF
    Motivation: Liquid chromatography tandem mass spectrometry (LC-MS/MS) is the predominant method to comprehensively characterize complex protein mixtures such as samples from prefractionated or complete proteomes. In order to maximize proteome coverage for the studied sample, i.e. identify as many traceable proteins as possible, LC-MS/MS experiments are typically repeated extensively and the results combined. Proteome coverage prediction is the task of estimating the number of peptide discoveries of future LC-MS/MS experiments. Proteome coverage prediction is important to enhance the design of efficient proteomics studies. To date, there does not exist any method to reliably estimate the increase of proteome coverage at an early stage

    REPARATION : ribosome profiling assisted (re-)annotation of bacterial genomes

    Get PDF
    Prokaryotic genome annotation is highly dependent on automated methods, as manual curation cannot keep up with the exponential growth of sequenced genomes. Current automated methods depend heavily on sequence composition and often underestimate the complexity of the proteome. We developed RibosomeE Profiling Assisted (re-)AnnotaTION (REPARATION), a de novo machine learning algorithm that takes advantage of experimental protein synthesis evidence from ribosome profiling (Ribo-seq) to delineate translated open reading frames (ORFs) in bacteria, independent of genome annotation (https://github.com/Biobix/ REPARATION). REPARATION evaluates all possible ORFs in the genome and estimates minimum thresholds based on a growth curve model to screen for spurious ORFs. We applied REPARATION to three annotated bacterial species to obtain a more comprehensive mapping of their translation landscape in support of experimental data. In all cases, we identified hundreds of novel (small) ORFs including variants of previously annotated ORFs and >70% of all (variants of) annotated protein coding ORFs were predicted by REPARATION to be translated. Our predictions are supported by matching mass spectrometry proteomics data, sequence composition and conservation analysis. REPARATION is unique in that it makes use of experimental translation evidence to intrinsically perform a de novo ORF delineation in bacterial genomes irrespective of the sequence features linked to open reading frames

    Testing statistical hypothesis on random trees and applications to the protein classification problem

    Full text link
    Efficient automatic protein classification is of central importance in genomic annotation. As an independent way to check the reliability of the classification, we propose a statistical approach to test if two sets of protein domain sequences coming from two families of the Pfam database are significantly different. We model protein sequences as realizations of Variable Length Markov Chains (VLMC) and we use the context trees as a signature of each protein family. Our approach is based on a Kolmogorov--Smirnov-type goodness-of-fit test proposed by Balding et al. [Limit theorems for sequences of random trees (2008), DOI: 10.1007/s11749-008-0092-z]. The test statistic is a supremum over the space of trees of a function of the two samples; its computation grows, in principle, exponentially fast with the maximal number of nodes of the potential trees. We show how to transform this problem into a max-flow over a related graph which can be solved using a Ford--Fulkerson algorithm in polynomial time on that number. We apply the test to 10 randomly chosen protein domain families from the seed of Pfam-A database (high quality, manually curated families). The test shows that the distributions of context trees coming from different families are significantly different. We emphasize that this is a novel mathematical approach to validate the automatic clustering of sequences in any context. We also study the performance of the test via simulations on Galton--Watson related processes.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS218 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Gene ontology based transfer learning for protein subcellular localization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as <it>GO</it>, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the <it>GO </it>terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology.</p> <p>Results</p> <p>In this paper, we propose a Gene Ontology Based Transfer Learning Model (<it>GO-TLM</it>) for large-scale protein subcellular localization. The model transfers the signature-based homologous <it>GO </it>terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false <it>GO </it>terms that are resulted from evolutionary divergence. We derive three <it>GO </it>kernels from the three aspects of gene ontology to measure the <it>GO </it>similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate <it>GO-TLM </it>performance against three baseline models: <it>MultiLoc, MultiLoc-GO </it>and <it>Euk-mPLoc </it>on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that <it>GO-TLM </it>achieves substantial accuracy improvement against the baseline models: 80.38% against model <it>Euk-mPLoc </it>67.40% with <it>12.98% </it>substantial increase; 96.65% and 96.27% against model <it>MultiLoc-GO </it>89.60% and 89.60%, with <it>7.05% </it>and <it>6.67% </it>accuracy increase on dataset <it>MultiLoc plant </it>and dataset <it>MultiLoc animal</it>, respectively; 97.14%, 95.90% and 96.85% against model <it>MultiLoc-GO </it>83.70%, 90.10% and 85.70%, with accuracy increase <it>13.44%</it>, <it>5.8% </it>and <it>11.15% </it>on dataset <it>BaCelLoc plant</it>, dataset <it>BaCelLoc fungi </it>and dataset <it>BaCelLoc animal </it>respectively. For <it>BaCelLoc </it>independent sets, <it>GO-TLM </it>achieves 81.25%, 80.45% and 79.46% on dataset <it>BaCelLoc plant holdout</it>, dataset <it>BaCelLoc plant holdout </it>and dataset <it>BaCelLoc animal holdout</it>, respectively, as compared against baseline model <it>MultiLoc-GO </it>76%, 60.00% and 73.00%, with accuracy increase <it>5.25%</it>, <it>20.45% </it>and <it>6.46%</it>, respectively.</p> <p>Conclusions</p> <p>Since direct homology-based <it>GO </it>term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, <it>GO-TLM</it>) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based <it>GO </it>term transfer and explicitly weighing the <it>GO </it>kernels substantially improve the prediction performance.</p

    Going the distance for protein function prediction: a new distance metric for protein interaction networks

    Get PDF
    Due to an error introduced in the production process, the x-axes in the first panels of Figure 1 and Figure 7 are not formatted correctly. The correct Figure 1 can be viewed here: http://dx.doi.org/10.1371/annotation/343bf260-f6ff-48a2-93b2-3cc79af518a9In protein-protein interaction (PPI) networks, functional similarity is often inferred based on the function of directly interacting proteins, or more generally, some notion of interaction network proximity among proteins in a local neighborhood. Prior methods typically measure proximity as the shortest-path distance in the network, but this has only a limited ability to capture fine-grained neighborhood distinctions, because most proteins are close to each other, and there are many ties in proximity. We introduce diffusion state distance (DSD), a new metric based on a graph diffusion property, designed to capture finer-grained distinctions in proximity for transfer of functional annotation in PPI networks. We present a tool that, when input a PPI network, will output the DSD distances between every pair of proteins. We show that replacing the shortest-path metric by DSD improves the performance of classical function prediction methods across the board.MC, HZ, NMD and LJC were supported in part by National Institutes of Health (NIH) R01 grant GM080330. JP was supported in part by NIH grant R01 HD058880. This material is based upon work supported by the National Science Foundation under grant numbers CNS-0905565, CNS-1018266, CNS-1012910, and CNS-1117039, and supported by the Army Research Office under grant W911NF-11-1-0227 (to MEC). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript

    Graph Theory and Networks in Biology

    Get PDF
    In this paper, we present a survey of the use of graph theoretical techniques in Biology. In particular, we discuss recent work on identifying and modelling the structure of bio-molecular networks, as well as the application of centrality measures to interaction networks and research on the hierarchical structure of such networks and network motifs. Work on the link between structural network properties and dynamics is also described, with emphasis on synchronization and disease propagation.Comment: 52 pages, 5 figures, Survey Pape

    Computational Labeling, Partitioning, and Balancing of Molecular Networks

    Get PDF
    Recent advances in high throughput techniques enable large-scale molecular quantification with high accuracy, including mRNAs, proteins and metabolites. Differential expression of these molecules in case and control samples provides a way to select phenotype-associated molecules with statistically significant changes. However, given the significance ranking list of molecular changes, how those molecules work together to drive phenotype formation is still unclear. In particular, the changes in molecular quantities are insufficient to interpret the changes in their functional behavior. My study is aimed at answering this question by integrating molecular network data to systematically model and estimate the changes of molecular functional behaviors. We build three computational models to label, partition, and balance molecular networks using modern machine learning techniques. (1) Due to the incompleteness of protein functional annotation, we develop AptRank, an adaptive PageRank model for protein function prediction on bilayer networks. By integrating Gene Ontology (GO) hierarchy with protein-protein interaction network, our AptRank outperforms four state-of-the-art methods in a comprehensive evaluation using benchmark datasets. (2) We next extend our AptRank into a network partitioning method, BioSweeper, to identify functional network modules in which molecules share similar functions and also densely connect to each other. Compared to traditional network partitioning methods using only network connections, BioSweeper, which integrates the GO hierarchy, can automatically identify functionally enriched network modules. (3) Finally, we conduct a differential interaction analysis, namely difFBA, on protein-protein interaction networks by simulating protein fluxes using flux balance analysis (FBA). We test difFBA using quantitative proteomic data from colon cancer, and demonstrate that difFBA offers more insights into functional changes in molecular behavior than does protein quantity changes alone. We conclude that our integrative network model increases the observational dimensions of complex biological systems, and enables us to more deeply understand the causal relationships between genotypes and phenotypes
    corecore