1,171 research outputs found
Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints
The inapplicability of amino acid covariation methods to small protein
families has limited their use for structural annotation of whole genomes.
Recently, deep learning has shown promise in allowing accurate residue-residue
contact prediction even for shallow sequence alignments. Here we introduce
DMPfold, which uses deep learning to predict inter-atomic distance bounds, the
main chain hydrogen bond network, and torsion angles, which it uses to build
models in an iterative fashion. DMPfold produces more accurate models than two
popular methods for a test set of CASP12 domains, and works just as well for
transmembrane proteins. Applied to all Pfam domains without known structures,
confident models for 25% of these so-called dark families were produced in
under a week on a small 200 core cluster. DMPfold provides models for 16% of
human proteome UniProt entries without structures, generates accurate models
with fewer than 100 sequences in some cases, and is freely available.Comment: JGG and SMK contributed equally to the wor
Accelerated Profile HMM Searches
Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches
Accurate multiple sequence-structure alignment of RNA sequences using combinatorial optimization
Background: The discovery of functional non-coding RNA sequences has led to an increasing interest in algorithms related to RNA analysis. Traditional sequence alignment algorithms, however, fail at computing reliable alignments of low-homology RNA sequences. The spatial conformation of RNA sequences largely determines their function, and therefore RNA alignment algorithms have to take structural information into account. Results: We present a graph-based representation for sequence-structure alignments, which we model as an integer linear program (ILP). We sketch how we compute an optimal or near-optimal solution to the ILP using methods from combinatorial optimization, and present results on a recently published benchmark set for RNA alignments. Conclusions: The implementation of our algorithm yields better alignments in terms of two published scores than the other programs that we tested: This is especially the case with an increasing number of inpu
A conditional neural fields model for protein threading
Motivation: Alignment errors are still the main bottleneck for current template-based protein modeling (TM) methods, including protein threading and homology modeling, especially when the sequence identity between two proteins under consideration is low (<30%)
Parameters for accurate genome alignment
<p>Abstract</p> <p>Background</p> <p>Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed.</p> <p>Results</p> <p>We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases.</p> <p>Conclusions</p> <p>These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours <url>http://last.cbrc.jp/</url>.</p
- …