71 research outputs found
From sequence to structure to networks
A report on the 7th European Conference on Computational Biology (ECCB), Cagliari, Italy, 22-26 September 2008
Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server
When using conventional transmembrane topology and signal peptide predictors, such as TMHMM and SignalP, there is a substantial overlap between these two types of predictions. Applying these methods to five complete proteomes, we found that 30–65% of all predicted signal peptides and 25–35% of all predicted transmembrane topologies overlap. This impairs predictions of 5–10% of the proteome, hence this is an important issue in protein annotation
Retention Time and Fragmentation Predictors Increase Confidence in Identification of Common Variant Peptides
Precision medicine focuses on adapting care to the individual profile of patients, for example, accounting for their unique genetic makeup. Being able to account for the effect of genetic variation on the proteome holds great promise toward this goal. However, identifying the protein products of genetic variation using mass spectrometry has proven very challenging. Here we show that the identification of variant peptides can be improved by the integration of retention time and fragmentation predictors into a unified proteogenomic pipeline. By combining these intrinsic peptide characteristics using the search-engine post-processor Percolator, we demonstrate improved discrimination power between correct and incorrect peptide-spectrum matches. Our results demonstrate that the drop in performance that is induced when expanding a protein sequence database can be compensated, hence enabling efficient identification of genetic variation products in proteomics data. We anticipate that this enhancement of proteogenomic pipelines can provide a more refined picture of the unique proteome of patients and thereby contribute to improving patient care.publishedVersio
ABRF Proteome Informatics Research Group (iPRG) 2016 Study: Inferring Proteoforms from Bottom-up Proteomics Data.
This report presents the results from the 2016 Association of Biomolecular Resource Facilities Proteome Informatics Research Group (iPRG) study on proteoform inference and false discovery rate (FDR) estimation from bottom-up proteomics data. For this study, 3 replicate Q Exactive Orbitrap liquid chromatography-tandom mass spectrometry datasets were generated from each of
Predicting transmembrane topology and signal peptides with hidden Markov models
Transmembrane proteins make up a large and important class of proteins.
About 20% of all genes encode transmembrane proteins. They control both
substances and information going in and out of a cell. Yet basic
knowledge about membrane insertion and folding is sparse, and our ability
to identify, over-express, purify, and crystallize transmembrane proteins
lags far behind the field of water-soluble proteins.
It is diffcult to determine the three dimensional structures of
transmembrane proteins. erefore, researchers normally attempt to
determine their topology, i.e. which parts of the protein are buried in
the membrane, and on what side of the membrane are the other parts
located.
Proteins aimed for export have an N-terminal sequence known as a signal
peptide that is inserted into the membrane and cleaved off. The same
mechanism that inserts transmembrane proteins into their membranes also
handles the export of protein with signal peptides. Transmembrane helices
and signal peptides thus have many features in common.
In silico methods for predicting transmembrane topology and methods for
predicting signal peptides from amino acid sequence are a fast and
relatively accurate alternative to biochemical experiments. A methodology
called hidden Markov models (HMMs) has proved particularly useful for
these and other prediction tasks.
In this thesis, properties of transmembrane topology predictors and
signal peptide predictors are investigated. It includes three novel HMM
based prediction methods.
i) A combined transmembrane topology and signal peptide predictor,
Phobius. The paper shows that cross predictions, i.e. signal peptides
predicted as transmembrane helices and vice versa, are a common problem.
About 10% of the genes in E.coli have overlapping signal peptide and
transmembrane helix predictions by conventional predictors. We were able
to dramatically lower these false cross predictions.
ii)Amethod for detecting remote G protein-coupled receptor (GPCR)
families,GPCRHMM. GPCRs are a very large and divergent superfamily of
transmembrane proteins. We designed a hidden Markov model based on the
topological regions of the superfamily. We searched five genomes and
predicted 120 previously not annotated sequences as possible GPCRs. e
majority of these predictions (102) were in C. elegans, but 4 were found
in human and 7 in mouse. We as well conclude that a family of odorant
receptors in Drosophila are not GPCRs.
iii)Amethod to improve predictions with HMMs of generic sequence features
(such as transmembrane segments or signal peptides) by including
homologs. We show that the performance of Phobius using this decoder was
significantly better than with other decoders.
We also assessed the difficulty of benchmark sets used in transmembrane
topology prediction. By studying the level of agreement between different
predictors applied to typical benchmark sets andwhole proteome sets,we
concluded that the benchmark sets are far easier to predict than reality.
In other words, the accuracies reported in benchmark studies are
exaggerated.
Thesis also includes a paper presenting a hypothesis of the transmembrane
topology of presenilin, a protein involved in the development of
Alzheimer's disease. By comparing the output of several transmembrane
topology predictors with experimental results from previous studies, a
novel nine-transmembrane topology with an extracellular C-terminus was
elucidated
A simple null model for inferences from network enrichment analysis.
A prevailing technique to infer function from lists of identifications, from molecular biological high-throughput experiments, is over-representation analysis, where the identifications are compared to predefined sets of related genes often referred to as pathways. As at least some pathways are known to be incomplete in their annotation, algorithmic efforts have been made to complement them with information from functional association networks. While the terminology varies in the literature, we will here refer to such methods as Network Enrichment Analysis (NEA). Traditionally, the significance of inferences from NEA has been assigned using a null model constructed from randomizations of the network. Here we instead argue for a null model that more directly relates to the set of genes being studied, and have designed one dynamic programming algorithm that calculates the score distribution of NEA scores that makes it possible to assign unbiased mid p values to inferences. We also implemented a random sampling method, carrying out the same task. We demonstrate that our method obtains a superior statistical calibration as compared to the popular NEA inference engine, BinoX, while also providing statistics that are easier to interpret
- …