486 research outputs found

    Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments

    Get PDF
    Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative approach for predicting PRM-mediated protein-protein interactions from sequence data. The model suffered from over-fitting, so Laplacian regularisation was found to be important in achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative model. We also propose another discriminative model which can be applied to all sequences present in the organism at a significantly lower computational cost. This is due to its additional assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small number of instances of each binding site motif. However, closely related species are expected to share similar binding sites, which would be expected to be highly conserved. We investigated rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic tree can represent the relationships and divergences between the taxa. However, taxa sequences exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites, and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments: one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo

    MRFalign: Protein Homology Detection through Alignment of Markov Random Fields

    Full text link
    Sequence-based protein homology detection has been extensively studied and so far the most sensitive method is based upon comparison of protein sequence profiles, which are derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. A sequence profile is usually represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This paper presents a new homology detection method MRFalign, consisting of three key components: 1) a Markov Random Fields (MRF) representation of a protein family; 2) a scoring function measuring similarity of two MRFs; and 3) an efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning two MRFs. Compared to HMM that can only model very short-range residue correlation, MRFs can model long-range residue interaction pattern and thus, encode information for the global 3D structure of a protein family. Consequently, MRF-MRF comparison for remote homology detection shall be much more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that MRFalign outperforms several popular HMM or PSSM-based methods in terms of both alignment accuracy and remote homology detection and that MRFalign works particularly well for mainly beta proteins. For example, tested on the benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM succeed on 48% and 52% of proteins, respectively, at superfamily level, and on 15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign succeeds on 57.3% and 42.5% of proteins at superfamily and fold level, respectively. This study implies that long-range residue interaction patterns are very helpful for sequence-based homology detection. The software is available for download at http://raptorx.uchicago.edu/download/.Comment: Accepted by both RECOMB 2014 and PLOS Computational Biolog

    Probabilistic protein homology modeling

    Get PDF
    Searching sequence databases and building 3D models for proteins are important tasks for biologists. When the structure of a query protein is given, its function can be inferred. However, experimental methods for structure prediction are both expensive and time consuming. Fully automatic homology modeling refers to building a 3D model for a query sequence from an alignment to related homologous proteins with known structure (templates) by a computer. Current prediction servers can provide accurate models within a few hours to days. Our group has developed HHpred, which is one of the top performing structure prediction servers in the field. In general, homology based structure modeling consists of four steps: (1) finding homologous templates in a database, (2) selecting and (3) aligning templates to the query, (4) building a 3D model based on the alignment. In part one of this thesis, we will present improvements of step (2) and (4). Specifically, homology modeling has been shown to work best when multiple templates are selected instead of only a single one. Yet, current servers are using rather ad-hoc approaches to combine information from multiple templates. We provide a rigorous statistical framework for multi-template homology modeling. Given an alignment, we employ Modeller to calculate the most probable structure for a query. The 3D model is obtained by optimally satisfying spatial restraints derived from the alignment and expressed as probability density functions. We find that the query’s atomic distance restraints can be accurately described by two-component Gaussian mixtures. Moreover, we derive statistical weights to quantify the redundancy among related templates. This allows us to apply the standard rules of probability theory to combine restraints from several templates. Together with a heuristic template selection strategy, we have implemented this approach within HHpred and could significantly improve model quality. Furthermore, we took part in CASP, a community wide competition for structure prediction, where we were ranked first in template based modeling and, at the same time, were more than 450 times faster than all other top servers. Homology modeling heavily relies on detecting and correctly aligning templates to the query sequence (step (1) and (3) from above). But remote homologies are difficult to detect and hard to align on a pure sequence level. Hence, modern tools are based on profiles instead of sequences. A profile summarizes the evolutionary history of a given sequence and consists of position specific amino acid probabilities for each residue. In addition to the similarity score between profile columns, most methods use extra terms that compare 1D structural properties such as secondary structure or solvent accessibility. These can be predicted from local profile windows. In the second part of this thesis, we develop a new score that is independent of any predefined structural property. For this purpose, we learn a library of 32 profile patterns that are most conserved in alignments of remotely homologous, structurally aligned proteins. Each so called “context state” in the library consists of a 13-residue sequence profile. We integrate the new context score into our Hmm-Hmm alignment tool HHsearch and improve especially the sensitivity and precision of difficult pairwise alignments significantly. Taken together, we introduced probabilistic methods to improve all four main steps in homology based structure prediction

    Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations

    Get PDF
    BACKGROUND: Linking structural effects of mutations to functional outcomes is a major issue in structural bioinformatics, and many tools and studies have shown that specific structural properties such as stability and residue burial can be used to distinguish neutral variations and disease associated mutations. RESULTS: We have investigated 39 structural properties on a set of SNPs and disease mutations from the Uniprot Knowledge Base that could be mapped on high quality crystal structures and show that none of these properties can be used as a sole classification criterion to separate the two data sets. Furthermore, we have reviewed the annotation process from mutation to result and identified the liabilities in each step. CONCLUSION: Although excellent annotation results of various research groups underline the great potential of using structural bioinformatics to investigate the mechanisms underlying disease, the interpretation of such annotations cannot always be extrapolated to proteome wide variation studies. Difficulties for large-scale studies can be found both on the technical level, i.e. the scarcity of data and the incompleteness of the structural tool suites, and on the conceptual level, i.e. the correct interpretation of the results in a cellular context.status: publishe

    Discriminative motif discovery in DNA and protein sequences using the DEME algorithm

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.</p> <p>Results</p> <p>We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.</p> <p>Conclusion</p> <p>Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at <url>http://bioinformatics.org.au/deme/</url></p

    : Protein Long Local Structure Prediction

    Get PDF
    International audienceA relevant and accurate description of three-dimensional (3D) protein structures can be achieved by characterizing recurrent local structures. In a previous study, we developed a library of 120 3D structural prototypes encompassing all known 11-residues long local protein structures and ensuring a good quality of structural approximation. A local structure prediction method was also proposed. Here, overlapping properties of local protein structures in global ones are taken into account to characterize frequent local networks. At the same time, we propose a new long local structure prediction strategy which involves the use of evolutionary information coupled with Support Vector Machines (SVMs). Our prediction is evaluated by a stringent geometrical assessment. Every local structure prediction with a Calpha RMSD less than 2.5 A from the true local structure is considered as correct. A global prediction rate of 63.1% is then reached, corresponding to an improvement of 7.7 points compared with the previous strategy. In the same way, the prediction of 88.33% of the 120 structural classes is improved with 8.65% mean gain. 85.33% of proteins have better prediction results with a 9.43% average gain. An analysis of prediction rate per local network also supports the global improvement and gives insights into the potential of our method for predicting super local structures. Moreover, a confidence index for the direct estimation of prediction quality is proposed. Finally, our method is proved to be very competitive with cutting-edge strategies encompassing three categories of local structure predictions. Proteins 2009. (c) 2009 Wiley-Liss, Inc

    HH-suite3 for fast remote homology detection and deep protein annotation.

    No full text
    BACKGROUND: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous proteins. RESULTS: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. These accelerated the search methods HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ∼10× faster than PSI-BLAST and ∼20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over cluster servers using OpenMP and message passing interface (MPI). The free, open-source, GPLv3-licensed software is available at https://github.com/soedinglab/hh-suite . CONCLUSION: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects

    Probabilistic protein homology modeling

    Get PDF
    Searching sequence databases and building 3D models for proteins are important tasks for biologists. When the structure of a query protein is given, its function can be inferred. However, experimental methods for structure prediction are both expensive and time consuming. Fully automatic homology modeling refers to building a 3D model for a query sequence from an alignment to related homologous proteins with known structure (templates) by a computer. Current prediction servers can provide accurate models within a few hours to days. Our group has developed HHpred, which is one of the top performing structure prediction servers in the field. In general, homology based structure modeling consists of four steps: (1) finding homologous templates in a database, (2) selecting and (3) aligning templates to the query, (4) building a 3D model based on the alignment. In part one of this thesis, we will present improvements of step (2) and (4). Specifically, homology modeling has been shown to work best when multiple templates are selected instead of only a single one. Yet, current servers are using rather ad-hoc approaches to combine information from multiple templates. We provide a rigorous statistical framework for multi-template homology modeling. Given an alignment, we employ Modeller to calculate the most probable structure for a query. The 3D model is obtained by optimally satisfying spatial restraints derived from the alignment and expressed as probability density functions. We find that the query’s atomic distance restraints can be accurately described by two-component Gaussian mixtures. Moreover, we derive statistical weights to quantify the redundancy among related templates. This allows us to apply the standard rules of probability theory to combine restraints from several templates. Together with a heuristic template selection strategy, we have implemented this approach within HHpred and could significantly improve model quality. Furthermore, we took part in CASP, a community wide competition for structure prediction, where we were ranked first in template based modeling and, at the same time, were more than 450 times faster than all other top servers. Homology modeling heavily relies on detecting and correctly aligning templates to the query sequence (step (1) and (3) from above). But remote homologies are difficult to detect and hard to align on a pure sequence level. Hence, modern tools are based on profiles instead of sequences. A profile summarizes the evolutionary history of a given sequence and consists of position specific amino acid probabilities for each residue. In addition to the similarity score between profile columns, most methods use extra terms that compare 1D structural properties such as secondary structure or solvent accessibility. These can be predicted from local profile windows. In the second part of this thesis, we develop a new score that is independent of any predefined structural property. For this purpose, we learn a library of 32 profile patterns that are most conserved in alignments of remotely homologous, structurally aligned proteins. Each so called “context state” in the library consists of a 13-residue sequence profile. We integrate the new context score into our Hmm-Hmm alignment tool HHsearch and improve especially the sensitivity and precision of difficult pairwise alignments significantly. Taken together, we introduced probabilistic methods to improve all four main steps in homology based structure prediction
    corecore