367 research outputs found
Counting, generating and sampling tree alignments
Pairwise ordered tree alignment are combinatorial objects that appear in RNA
secondary structure comparison. However, the usual representation of tree
alignments as supertrees is ambiguous, i.e. two distinct supertrees may induce
identical sets of matches between identical pairs of trees. This ambiguity is
uninformative, and detrimental to any probabilistic analysis.In this work, we
consider tree alignments up to equivalence. Our first result is a precise
asymptotic enumeration of tree alignments, obtained from a context-free grammar
by mean of basic analytic combinatorics. Our second result focuses on
alignments between two given ordered trees and . By refining our grammar
to align specific trees, we obtain a decomposition scheme for the space of
alignments, and use it to design an efficient dynamic programming algorithm for
sampling alignments under the Gibbs-Boltzmann probability distribution. This
generalizes existing tree alignment algorithms, and opens the door for a
probabilistic analysis of the space of suboptimal RNA secondary structures
alignments.Comment: ALCOB - 3rd International Conference on Algorithms for Computational
Biology - 2016, Jun 2016, Trujillo, Spain. 201
A simple, practical and complete O-time Algorithm for RNA folding using the Four-Russians Speedup
<p>Abstract</p> <p>Background</p> <p>The problem of computationally predicting the secondary structure (or folding) of RNA molecules was first introduced more than thirty years ago and yet continues to be an area of active research and development. The basic <it>RNA-folding problem </it>of finding a maximum cardinality, non-crossing, matching of complimentary nucleotides in an RNA sequence of length <it>n</it>, has an <it>O</it>(<it>n</it><sup>3</sup>)-time dynamic programming solution that is widely applied. It is known that an <it>o</it>(<it>n</it><sup>3</sup>) worst-case time solution is possible, but the published and suggested methods are complex and have not been established to be practical. Significant practical improvements to the original dynamic programming method have been introduced, but they retain the <it>O</it>(<it>n</it><sup>3</sup>) worst-case time bound when <it>n </it>is the only problem-parameter used in the bound. Surprisingly, the most widely-used, general technique to achieve a worst-case (and often practical) speed up of dynamic programming, the <it>Four-Russians </it>technique, has not been previously applied to the RNA-folding problem. This is perhaps due to technical issues in adapting the technique to RNA-folding.</p> <p>Results</p> <p>In this paper, we give a simple, complete, and practical Four-Russians algorithm for the basic RNA-folding problem, achieving a worst-case time-bound of <it>O</it>(<it>n</it><sup>3</sup>/log(<it>n</it>)).</p> <p>Conclusions</p> <p>We show that this time-bound can also be obtained for richer nucleotide matching scoring-schemes, and that the method achieves consistent speed-ups in practice. The contribution is both theoretical and practical, since the basic RNA-folding problem is often solved multiple times in the inner-loop of more complex algorithms, and for long RNA molecules in the study of RNA virus genomes.</p
Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score
©2008 Pandit and Skolnick; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. This article is available from: http://www.biomedcentral.com/1471-2105/9/531doi:10.1186/1471-2105-9-531Background: Protein tertiary structure comparisons are employed in various fields of
contemporary structural biology. Most structure comparison methods involve generation of an
initial seed alignment, which is extended and/or refined to provide the best structural superposition
between a pair of protein structures as assessed by a structure comparison metric. One such
metric, the TM-score, was recently introduced to provide a combined structure quality measure
of the coordinate root mean square deviation between a pair of structures and coverage. Using the
TM-score, the TM-align structure alignment algorithm was developed that was often found to have
better accuracy and coverage than the most commonly used structural alignment programs;
however, there were a number of situations when this was not true.
Results: To further improve structure alignment quality, the Fr-TM-align algorithm has been
developed where aligned fragment pairs are used to generate the initial seed alignments that are
then refined using dynamic programming to maximize the TM-score. For the assessment of the
structural alignment quality from Fr-TM-align in comparison to other programs such as CE and TMalign,
we examined various alignment quality assessment scores such as PSI and TM-score. The
assessment showed that the structural alignment quality from Fr-TM-align is better in comparison
to both CE and TM-align. On average, the structural alignments generated using Fr-TM-align have
a higher TM-score (~9%) and coverage (~7%) in comparison to those generated by TM-align. Fr-
TM-align uses an exhaustive procedure to generate initial seed alignments. Hence, the algorithm is
computationally more expensive than TM-align.
Conclusion: Fr-TM-align, a new algorithm that employs fragment alignment and assembly provides
better structural alignments in comparison to TM-align. The source code and executables of Fr-
TM-align are freely downloadable at: http://cssb.biology.gatech.edu/skolnick/files/FrTMalign/
Island method for estimating the statistical significance of profile-profile alignment scores
<p>Abstract</p> <p>Background</p> <p>In the last decade, a significant improvement in detecting remote similarity between protein sequences has been made by utilizing alignment profiles in place of amino-acid strings. Unfortunately, no analytical theory is available for estimating the significance of a gapped alignment of two profiles. Many experiments suggest that the distribution of local profile-profile alignment scores is of the Gumbel form. However, estimating distribution parameters by random simulations turns out to be computationally very expensive.</p> <p>Results</p> <p>We demonstrate that the background distribution of profile-profile alignment scores heavily depends on profiles' composition and thus the distribution parameters must be estimated independently, for each pair of profiles of interest. We also show that accurate estimates of statistical parameters can be obtained using the "island statistics" for profile-profile alignments.</p> <p>Conclusion</p> <p>The island statistics can be generalized to profile-profile alignments to provide an efficient method for the alignment score normalization. Since multiple island scores can be extracted from a single comparison of two profiles, the island method has a clear speed advantage over the direct shuffling method for comparable accuracy in parameter estimates.</p
Evolutionary distances in the twilight zone -- a rational kernel approach
Phylogenetic tree reconstruction is traditionally based on multiple sequence
alignments (MSAs) and heavily depends on the validity of this information
bottleneck. With increasing sequence divergence, the quality of MSAs decays
quickly. Alignment-free methods, on the other hand, are based on abstract
string comparisons and avoid potential alignment problems. However, in general
they are not biologically motivated and ignore our knowledge about the
evolution of sequences. Thus, it is still a major open question how to define
an evolutionary distance metric between divergent sequences that makes use of
indel information and known substitution models without the need for a multiple
alignment. Here we propose a new evolutionary distance metric to close this
gap. It uses finite-state transducers to create a biologically motivated
similarity score which models substitutions and indels, and does not depend on
a multiple sequence alignment. The sequence similarity score is defined in
analogy to pairwise alignments and additionally has the positive semi-definite
property. We describe its derivation and show in simulation studies and
real-world examples that it is more accurate in reconstructing phylogenies than
competing methods. The result is a new and accurate way of determining
evolutionary distances in and beyond the twilight zone of sequence alignments
that is suitable for large datasets.Comment: to appear in PLoS ON
Autonomy support, basic need satisfaction and the optimal functioning of adult male and female sport participants: A test of basic needs theory
Grounded in Basic Needs Theory (BNT; Ryan and Deci, American Psychologist, 55, 68–78, 2000a), the present study aimed to: (a) test a theoretically-based model of coach autonomy support, motivational processes and well-/ill being among a sample of adult sport participants, (b) discern which basic psychological need(s) mediate the link between autonomy support and well-/ill-being, and (c) explore gender invariance in the hypothesized model. Five hundred and thirty nine participants (Male = 271;Female = 268; Mage = 22.75) completed a multi-section questionnaire tapping the targeted variables. Structural Equation Modeling (SEM) analysis revealed that coach autonomy support predicted participants’ basic need satisfaction for autonomy, competence and relatedness. In turn, basic need satisfaction predicted greater subjective vitality when engaged in sport. Participants with low levels of autonomy were more susceptible to feeling emotionally and physically exhausted from their sport investment. Autonomy and competence partially mediated the path from autonomy support to subjective vitality. Lastly, the results supported partial invariance of the model with respect to gender
Using Structure to Explore the Sequence Alignment Space of Remote Homologs
Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is “optimal” in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are “suboptimal” in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for “modelability”, we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended
Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences
BACKGROUND: The number of k-words shared between two sequences is a simple and effcient alignment-free sequence comparison method. This statistic, D(2), has been used for the clustering of EST sequences. Sequence comparison based on D(2 )is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D(2), and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied. RESULTS: We have computed the D(2 )optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D(2 )to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the D(2 )statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA. CONCLUSION: We have characterized the distribution of the D(2 )statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, D(2)-based methods are most appropriate for large genomic sequences
Optical map guided genome assembly
Background The long reads produced by third generation sequencing technologies have significantly boosted the results of genome assembly but still, genome-wide assemblies solely based on read data cannot be produced. Thus, for example, optical mapping data has been used to further improve genome assemblies but it has mostly been applied in a post-processing stage after contig assembly. Results We proposeOpticalKermitwhich directly integrates genome wide optical maps into contig assembly. We show how genome wide optical maps can be used to localize reads on the genome and then we adapt the Kermit method, which originally incorporated genetic linkage maps to the miniasm assembler, to use this information in contig assembly. Our experimental results show that incorporating genome wide optical maps to the contig assembly of miniasm increases NGA50 while the number of misassemblies decreases or stays the same. Furthermore, when compared to the Canu assembler,OpticalKermitproduces an assembly with almost three times higher NGA50 with a lower number of misassemblies on realA. thalianareads. Conclusions OpticalKermitsuccessfully incorporates optical mapping data directly to contig assembly of eukaryotic genomes. Our results show that this is a promising approach to improve the contiguity of genome assemblies.Peer reviewe
- …