1,838 research outputs found

    Unravelling the architecture of membrane proteins with conditional random fields

    Get PDF
    In this thesis we use Conditional Random Fields (CRFs) as a sequential classifier to predict the location of transmembrane helical regions in membrane proteins. CRFs allow for a seamless and principled integration of biological domain knowledge into the model and are known to have several advantages over other approaches. We have used this flexibility in order to incorporate several biologically inspired features into the model. We compared our approach with twenty eight other methods and received the highest score in the percentage of residues predicted correctly. We have also carried out experiments comparing CRFs against Maximum Entropy Models (MEMMs). Our results confirm that CRFs overcome the label bias problem, which are known to afflict MEMMs. Furthermore, we have used CRFs to analyze the architecture of the protein complex, Cytochrome c oxidase, and have recreated the results obtained from physical experiments

    Unravelling the architecture of membrane proteins with conditional random fields

    Get PDF
    In this thesis we use Conditional Random Fields (CRFs) as a sequential classifier to predict the location of transmembrane helical regions in membrane proteins. CRFs allow for a seamless and principled integration of biological domain knowledge into the model and are known to have several advantages over other approaches. We have used this flexibility in order to incorporate several biologically inspired features into the model. We compared our approach with twenty eight other methods and received the highest score in the percentage of residues predicted correctly. We have also carried out experiments comparing CRFs against Maximum Entropy Models (MEMMs). Our results confirm that CRFs overcome the label bias problem, which are known to afflict MEMMs. Furthermore, we have used CRFs to analyze the architecture of the protein complex, Cytochrome c oxidase, and have recreated the results obtained from physical experiments

    DeepSig: Deep learning improves signal peptide detection in proteins

    Get PDF
    Motivation: The identification of signal peptides in protein sequences is an important step toward protein localization and function characterization. Results: Here, we present DeepSig, an improved approach for signal peptide detection and cleavage-site prediction based on deep learning methods. Comparative benchmarks performed on an updated independent dataset of proteins show that DeepSig is the current best performing method, scoring better than other available state-of-the-art approaches on both signal peptide detection and precise cleavage-site identification. Availability and implementation: DeepSig is available as both standalone program and web server at https://deepsig.biocomp.unibo.it. All datasets used in this study can be obtained from the same website

    Finding functional motifs in protein sequences with deep learning and natural language models

    Get PDF
    Recently, prediction of structural/functional motifs in protein sequences takes advantage of powerful machine learning based approaches. Protein encoding adopts protein language models overpassing standard procedures. Different combinations of machine learning and encoding schemas are available for predicting different structural/functional motifs. Particularly interesting is the adoption of protein language models to encode proteins in addition to evolution information and physicochemical parameters. A thorough analysis of recent predictors developed for annotating transmembrane regions, sorting signals, lipidation and phosphorylation sites allows to investigate the state-of-the-art focusing on the relevance of protein language models for the different tasks. This highlights that more experimental data are necessary to exploit available powerful machine learning methods

    Orientation-dependent backbone-only residue pair scoring functions for fixed backbone protein design

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Empirical scoring functions have proven useful in protein structure modeling. Most such scoring functions depend on protein side chain conformations. However, backbone-only scoring functions do not require computationally intensive structure optimization and so are well suited to protein design, which requires fast score evaluation. Furthermore, scoring functions that account for the distinctive relative position and orientation preferences of residue pairs are expected to be more accurate than those that depend only on the separation distance.</p> <p>Results</p> <p>Residue pair scoring functions for fixed backbone protein design were derived using only backbone geometry. Unlike previous studies that used spherical harmonics to fit 2D angular distributions, Gaussian Mixture Models were used to fit the full 3D (position only) and 6D (position and orientation) distributions of residue pairs. The performance of the 1D (residue separation only), 3D, and 6D scoring functions were compared by their ability to identify correct threading solutions for a non-redundant benchmark set of protein backbone structures. The threading accuracy was found to steadily increase with increasing dimension, with the 6D scoring function achieving the highest accuracy. Furthermore, the 3D and 6D scoring functions were shown to outperform side chain-dependent empirical potentials from three other studies. Next, two computational methods that take advantage of the speed and pairwise form of these new backbone-only scoring functions were investigated. The first is a procedure that exploits available sequence data by averaging scores over threading solutions for homologs. This was evaluated by applying it to the challenging problem of identifying interacting transmembrane alpha-helices and found to further improve prediction accuracy. The second is a protein design method for determining the optimal sequence for a backbone structure by applying Belief Propagation optimization using the 6D scoring functions. The sensitivity of this method to backbone structure perturbations was compared with that of fixed-backbone all-atom modeling by determining the similarities between optimal sequences for two different backbone structures within the same protein family. The results showed that the design method using 6D scoring functions was more robust to small variations in backbone structure than the all-atom design method.</p> <p>Conclusions</p> <p>Backbone-only residue pair scoring functions that account for all six relative degrees of freedom are the most accurate and including the scores of homologs further improves the accuracy in threading applications. The 6D scoring function outperformed several side chain-dependent potentials while avoiding time-consuming and error prone side chain structure prediction. These scoring functions are particularly useful as an initial filter in protein design problems before applying all-atom modeling.</p

    TMFoldRec: a statistical potential-based transmembrane protein fold recognition tool.

    Get PDF
    BACKGROUND: Transmembrane proteins (TMPs) are the key components of signal transduction, cell-cell adhesion and energy and material transport into and out from the cells. For the deep understanding of these processes, structure determination of transmembrane proteins is indispensable. However, due to technical difficulties, only a few transmembrane protein structures have been determined experimentally. Large-scale genomic sequencing provides increasing amounts of sequence information on the proteins and whole proteomes of living organisms resulting in the challenge of bioinformatics; how the structural information should be gained from a sequence. RESULTS: Here, we present a novel method, TMFoldRec, for fold prediction of membrane segments in transmembrane proteins. TMFoldRec based on statistical potentials was tested on a benchmark set containing 124 TMP chains from the PDBTM database. Using a 10-fold jackknife method, the native folds were correctly identified in 77 % of the cases. This accuracy overcomes the state-of-the-art methods. In addition, a key feature of TMFoldRec algorithm is the ability to estimate the reliability of the prediction and to decide with an accuracy of 70 %, whether the obtained, lowest energy structure is the native one. CONCLUSION: These results imply that the membrane embedded parts of TMPs dictate the TM structures rather than the soluble parts. Moreover, predictions with reliability scores make in this way our algorithm applicable for proteome-wide analyses. AVAILABILITY: The program is available upon request for academic use

    Inferring diffusion in single live cells at the single molecule level

    Get PDF
    The movement of molecules inside living cells is a fundamental feature of biological processes. The ability to both observe and analyse the details of molecular diffusion in vivo at the single molecule and single cell level can add significant insight into understanding molecular architectures of diffusing molecules and the nanoscale environment in which the molecules diffuse. The tool of choice for monitoring dynamic molecular localization in live cells is fluorescence microscopy, especially so combining total internal reflection fluorescence (TIRF) with the use of fluorescent protein (FP) reporters in offering exceptional imaging contrast for dynamic processes in the cell membrane under relatively physiological conditions compared to competing single molecule techniques. There exist several different complex modes of diffusion, and discriminating these from each other is challenging at the molecular level due to underlying stochastic behaviour. Analysis is traditionally performed using mean square displacements of tracked particles, however, this generally requires more data points than is typical for single FP tracks due to photophysical instability. Presented here is a novel approach allowing robust Bayesian ranking of diffusion processes (BARD) to discriminate multiple complex modes probabilistically. It is a computational approach which biologists can use to understand single molecule features in live cells.Comment: combined ms (1-37 pages, 8 figures) and SI (38-55, 3 figures

    Structural approaches to protein sequence analysis

    Get PDF
    Various protein sequence analysis techniques are described, aimed at improving the prediction of protein structure by means of pattern matching. To investigate the possibility that improvements in amino acid comparison matrices could result in improvements in the sensitivity and accuracy of protein sequence alignments, a method for rapidly calculating amino acid mutation data matrices from large sequence data sets is presented. The method is then applied to the membrane-spanning segments of integral membrane proteins in order to investigate the nature of amino acid mutability in a lipid environment. Whilst purely sequence analytic techniques work well for cases where some residual sequence similarity remains between a newly characterized protein and a protein of known 3-D structure, in the harder cases, there is little or no sequence similarity with which to recognize proteins with similar folding patterns. In the light of these limitations, a new approach to protein fold recognition is described, which uses a statistically derived pairwise potential to evaluate the compatibility between a test sequence and a library of structural templates, derived from solved crystal structures. The method, which is called optimal sequence threading, proves to be highly successful, and is able to detect the common TIM barrel fold between a number of enzyme sequences, which has not been achieved by any previous sequence analysis technique. Finally, a new method for the prediction of the secondary structure and topology of membrane proteins is described. The method employs a set of statistical tables compiled from well-characterized membrane protein data, and a novel dynamic programming algorithm to recognize membrane topology models by expectation maximization. The statistical tables show definite biases towards certain amino acid species on the inside, middle and outside of a cellular membrane
    corecore