486 research outputs found
Bayesian machine learning methods for predicting protein-peptide interactions and detecting mosaic structures in DNA sequences alignments
Short well-defined domains known as peptide recognition modules (PRMs) regulate many important protein-protein interactions involved in the formation of macromolecular complexes
and biochemical pathways. High-throughput experiments like yeast two-hybrid and phage
display are expensive and intrinsically noisy, therefore it would be desirable to target informative interactions and pursue in silico approaches. We propose a probabilistic discriminative
approach for predicting PRM-mediated protein-protein interactions from sequence data. The
model suffered from over-fitting, so Laplacian regularisation was found to be important in
achieving a reasonable generalisation performance. A hybrid approach yielded the best performance, where the binding site motifs were initialised with the predictions of a generative
model. We also propose another discriminative model which can be applied to all sequences
present in the organism at a significantly lower computational cost. This is due to its additional
assumption that the underlying binding sites tend to be similar.It is difficult to distinguish between the binding site motifs of the PRM due to the small
number of instances of each binding site motif. However, closely related species are expected
to share similar binding sites, which would be expected to be highly conserved. We investigated
rate variation along DNA sequence alignments, modelling confounding effects such as recombination. Traditional approaches to phylogenetic inference assume that a single phylogenetic
tree can represent the relationships and divergences between the taxa. However, taxa sequences
exhibit varying levels of conservation, e.g. due to regulatory elements and active binding sites,
and certain bacteria and viruses undergo interspecific recombination. We propose a phylogenetic factorial hidden Markov model to infer recombination and rate variation. We examined
the performance of our model and inference scheme on various synthetic alignments, and compared it to state of the art breakpoint models. We investigated three DNA sequence alignments:
one of maize actin genes, one bacterial (Neisseria), and the other of HIV-1. Inference is carried
out in the Bayesian framework, using Reversible Jump Markov Chain Monte Carlo
MRFalign: Protein Homology Detection through Alignment of Markov Random Fields
Sequence-based protein homology detection has been extensively studied and so
far the most sensitive method is based upon comparison of protein sequence
profiles, which are derived from multiple sequence alignment (MSA) of sequence
homologs in a protein family. A sequence profile is usually represented as a
position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and
accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This
paper presents a new homology detection method MRFalign, consisting of three
key components: 1) a Markov Random Fields (MRF) representation of a protein
family; 2) a scoring function measuring similarity of two MRFs; and 3) an
efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning
two MRFs. Compared to HMM that can only model very short-range residue
correlation, MRFs can model long-range residue interaction pattern and thus,
encode information for the global 3D structure of a protein family.
Consequently, MRF-MRF comparison for remote homology detection shall be much
more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that
MRFalign outperforms several popular HMM or PSSM-based methods in terms of both
alignment accuracy and remote homology detection and that MRFalign works
particularly well for mainly beta proteins. For example, tested on the
benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM
succeed on 48% and 52% of proteins, respectively, at superfamily level, and on
15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign
succeeds on 57.3% and 42.5% of proteins at superfamily and fold level,
respectively. This study implies that long-range residue interaction patterns
are very helpful for sequence-based homology detection. The software is
available for download at http://raptorx.uchicago.edu/download/.Comment: Accepted by both RECOMB 2014 and PLOS Computational Biolog
Probabilistic protein homology modeling
Searching sequence databases and building 3D models for proteins are important tasks
for biologists. When the structure of a query protein is given, its function can be inferred. However, experimental methods for structure prediction are both expensive and
time consuming. Fully automatic homology modeling refers to building a 3D model for
a query sequence from an alignment to related homologous proteins with known structure (templates) by a computer. Current prediction servers can provide accurate models
within a few hours to days. Our group has developed HHpred, which is one of the top
performing structure prediction servers in the field.
In general, homology based structure modeling consists of four steps: (1) finding homologous templates in a database, (2) selecting and (3) aligning templates to the query, (4)
building a 3D model based on the alignment.
In part one of this thesis, we will present improvements of step (2) and (4). Specifically,
homology modeling has been shown to work best when multiple templates are selected
instead of only a single one. Yet, current servers are using rather ad-hoc approaches to
combine information from multiple templates. We provide a rigorous statistical framework for multi-template homology modeling. Given an alignment, we employ Modeller to calculate the most probable structure for a query. The 3D model is obtained
by optimally satisfying spatial restraints derived from the alignment and expressed as
probability density functions. We find that the query’s atomic distance restraints can
be accurately described by two-component Gaussian mixtures. Moreover, we derive statistical weights to quantify the redundancy among related templates. This allows us to
apply the standard rules of probability theory to combine restraints from several templates. Together with a heuristic template selection strategy, we have implemented this
approach within HHpred and could significantly improve model quality. Furthermore,
we took part in CASP, a community wide competition for structure prediction, where
we were ranked first in template based modeling and, at the same time, were more than
450 times faster than all other top servers.
Homology modeling heavily relies on detecting and correctly aligning templates to the
query sequence (step (1) and (3) from above). But remote homologies are difficult to
detect and hard to align on a pure sequence level. Hence, modern tools are based on
profiles instead of sequences. A profile summarizes the evolutionary history of a given
sequence and consists of position specific amino acid probabilities for each residue. In
addition to the similarity score between profile columns, most methods use extra terms
that compare 1D structural properties such as secondary structure or solvent accessibility. These can be predicted from local profile windows.
In the second part of this thesis, we develop a new score that is independent of any predefined structural property. For this purpose, we learn a library of 32 profile patterns that
are most conserved in alignments of remotely homologous, structurally aligned proteins.
Each so called “context state” in the library consists of a 13-residue sequence profile.
We integrate the new context score into our Hmm-Hmm alignment tool HHsearch and
improve especially the sensitivity and precision of difficult pairwise alignments significantly.
Taken together, we introduced probabilistic methods to improve all four main steps in
homology based structure prediction
Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations
BACKGROUND: Linking structural effects of mutations to functional outcomes is a major issue in structural bioinformatics, and many tools and studies have shown that specific structural properties such as stability and residue burial can be used to distinguish neutral variations and disease associated mutations. RESULTS: We have investigated 39 structural properties on a set of SNPs and disease mutations from the Uniprot Knowledge Base that could be mapped on high quality crystal structures and show that none of these properties can be used as a sole classification criterion to separate the two data sets. Furthermore, we have reviewed the annotation process from mutation to result and identified the liabilities in each step. CONCLUSION: Although excellent annotation results of various research groups underline the great potential of using structural bioinformatics to investigate the mechanisms underlying disease, the interpretation of such annotations cannot always be extrapolated to proteome wide variation studies. Difficulties for large-scale studies can be found both on the technical level, i.e. the scarcity of data and the incompleteness of the structural tool suites, and on the conceptual level, i.e. the correct interpretation of the results in a cellular context.status: publishe
Discriminative motif discovery in DNA and protein sequences using the DEME algorithm
<p>Abstract</p> <p>Background</p> <p>Motif discovery aims to detect short, highly conserved patterns in a collection of unaligned DNA or protein sequences. Discriminative motif finding algorithms aim to increase the sensitivity and selectivity of motif discovery by utilizing a second set of sequences, and searching only for patterns that can differentiate the two sets of sequences. Potential applications of discriminative motif discovery include discovering transcription factor binding site motifs in ChIP-chip data and finding protein motifs involved in thermal stability using sets of orthologous proteins from thermophilic and mesophilic organisms.</p> <p>Results</p> <p>We describe DEME, a discriminative motif discovery algorithm for use with protein and DNA sequences. Input to DEME is two sets of sequences; a "positive" set and a "negative" set. DEME represents motifs using a probabilistic model, and uses a novel combination of global and local search to find the motif that optimally discriminates between the two sets of sequences. DEME is unique among discriminative motif finders in that it uses an informative Bayesian prior on protein motif columns, allowing it to incorporate prior knowledge of residue characteristics. We also introduce four, synthetic, discriminative motif discovery problems that are designed for evaluating discriminative motif finders in various biologically motivated contexts. We test DEME using these synthetic problems and on two biological problems: finding yeast transcription factor binding motifs in ChIP-chip data, and finding motifs that discriminate between groups of thermophilic and mesophilic orthologous proteins.</p> <p>Conclusion</p> <p>Using artificial data, we show that DEME is more effective than a non-discriminative approach when there are "decoy" motifs or when a variant of the motif is present in the "negative" sequences. With real data, we show that DEME is as good, but not better than non-discriminative algorithms at discovering yeast transcription factor binding motifs. We also show that DEME can find highly informative thermal-stability protein motifs. Binaries for the stand-alone program DEME is free for academic use and is available at <url>http://bioinformatics.org.au/deme/</url></p
: Protein Long Local Structure Prediction
International audienceA relevant and accurate description of three-dimensional (3D) protein structures can be achieved by characterizing recurrent local structures. In a previous study, we developed a library of 120 3D structural prototypes encompassing all known 11-residues long local protein structures and ensuring a good quality of structural approximation. A local structure prediction method was also proposed. Here, overlapping properties of local protein structures in global ones are taken into account to characterize frequent local networks. At the same time, we propose a new long local structure prediction strategy which involves the use of evolutionary information coupled with Support Vector Machines (SVMs). Our prediction is evaluated by a stringent geometrical assessment. Every local structure prediction with a Calpha RMSD less than 2.5 A from the true local structure is considered as correct. A global prediction rate of 63.1% is then reached, corresponding to an improvement of 7.7 points compared with the previous strategy. In the same way, the prediction of 88.33% of the 120 structural classes is improved with 8.65% mean gain. 85.33% of proteins have better prediction results with a 9.43% average gain. An analysis of prediction rate per local network also supports the global improvement and gives insights into the potential of our method for predicting super local structures. Moreover, a confidence index for the direct estimation of prediction quality is proposed. Finally, our method is proved to be very competitive with cutting-edge strategies encompassing three categories of local structure predictions. Proteins 2009. (c) 2009 Wiley-Liss, Inc
HH-suite3 for fast remote homology detection and deep protein annotation.
BACKGROUND: HH-suite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition. It is based on pairwise alignment of profile Hidden Markov models (HMMs), which represent multiple sequence alignments of homologous proteins. RESULTS: We developed a single-instruction multiple-data (SIMD) vectorized implementation of the Viterbi algorithm for profile HMM alignment and introduced various other speed-ups. These accelerated the search methods HHsearch by a factor 4 and HHblits by a factor 2 over the previous version 2.0.16. HHblits3 is ∼10× faster than PSI-BLAST and ∼20× faster than HMMER3. Jobs to perform HHsearch and HHblits searches with many query profile HMMs can be parallelized over cores and over cluster servers using OpenMP and message passing interface (MPI). The free, open-source, GPLv3-licensed software is available at https://github.com/soedinglab/hh-suite . CONCLUSION: The added functionalities and increased speed of HHsearch and HHblits should facilitate their use in large-scale protein structure and function prediction, e.g. in metagenomics and genomics projects
Probabilistic protein homology modeling
Searching sequence databases and building 3D models for proteins are important tasks
for biologists. When the structure of a query protein is given, its function can be inferred. However, experimental methods for structure prediction are both expensive and
time consuming. Fully automatic homology modeling refers to building a 3D model for
a query sequence from an alignment to related homologous proteins with known structure (templates) by a computer. Current prediction servers can provide accurate models
within a few hours to days. Our group has developed HHpred, which is one of the top
performing structure prediction servers in the field.
In general, homology based structure modeling consists of four steps: (1) finding homologous templates in a database, (2) selecting and (3) aligning templates to the query, (4)
building a 3D model based on the alignment.
In part one of this thesis, we will present improvements of step (2) and (4). Specifically,
homology modeling has been shown to work best when multiple templates are selected
instead of only a single one. Yet, current servers are using rather ad-hoc approaches to
combine information from multiple templates. We provide a rigorous statistical framework for multi-template homology modeling. Given an alignment, we employ Modeller to calculate the most probable structure for a query. The 3D model is obtained
by optimally satisfying spatial restraints derived from the alignment and expressed as
probability density functions. We find that the query’s atomic distance restraints can
be accurately described by two-component Gaussian mixtures. Moreover, we derive statistical weights to quantify the redundancy among related templates. This allows us to
apply the standard rules of probability theory to combine restraints from several templates. Together with a heuristic template selection strategy, we have implemented this
approach within HHpred and could significantly improve model quality. Furthermore,
we took part in CASP, a community wide competition for structure prediction, where
we were ranked first in template based modeling and, at the same time, were more than
450 times faster than all other top servers.
Homology modeling heavily relies on detecting and correctly aligning templates to the
query sequence (step (1) and (3) from above). But remote homologies are difficult to
detect and hard to align on a pure sequence level. Hence, modern tools are based on
profiles instead of sequences. A profile summarizes the evolutionary history of a given
sequence and consists of position specific amino acid probabilities for each residue. In
addition to the similarity score between profile columns, most methods use extra terms
that compare 1D structural properties such as secondary structure or solvent accessibility. These can be predicted from local profile windows.
In the second part of this thesis, we develop a new score that is independent of any predefined structural property. For this purpose, we learn a library of 32 profile patterns that
are most conserved in alignments of remotely homologous, structurally aligned proteins.
Each so called “context state” in the library consists of a 13-residue sequence profile.
We integrate the new context score into our Hmm-Hmm alignment tool HHsearch and
improve especially the sensitivity and precision of difficult pairwise alignments significantly.
Taken together, we introduced probabilistic methods to improve all four main steps in
homology based structure prediction
- …