47,551 research outputs found
Computational Molecular Coevolution
A major goal in computational biochemistry is to obtain three-dimensional structure information from protein sequence. Coevolution represents a biological mechanism through which structural information can be obtained from a family of protein sequences. Evolutionary relationships within a family of protein sequences are revealed through sequence alignment. Statistical analyses of these sequence alignments reveals positions in the protein family that covary, and thus appear to be dependent on one another throughout the evolution of the protein family. These covarying positions are inferred to be coevolving via one of two biological mechanisms, both of which imply that coevolution is facilitated by inter-residue contact. Thus, high-quality multiple sequence alignments and robust coevolution-inferring statistics can produce structural information from sequence alone. This work characterizes the relationship between coevolution statistics and sequence alignments and highlights the implicit assumptions and caveats associated with coevolutionary inference. An investigation of sequence alignment quality and coevolutionary-inference methods revealed that such methods are very sensitive to the systematic misalignments discovered in public databases. However, repairing the misalignments in such alignments restores the predictive power of coevolution statistics. To overcome the sensitivity to misalignments, two novel coevolution-inferring statistics were developed that show increased contact prediction accuracy, especially in alignments that contain misalignments. These new statistics were developed into a suite of coevolution tools, the MIpToolset. Because systematic misalignments produce a distinctive pattern when analyzed by coevolution-inferring statistics, a new method for detecting systematic misalignments was created to exploit this phenomenon. This new method called ``local covariation\u27\u27 was used to analyze publicly-available multiple sequence alignment databases. Local covariation detected putative misalignments in a database designed to benchmark sequence alignment software accuracy. Local covariation was incorporated into a new software tool, LoCo, which displays regions of potential misalignment during alignment editing assists in their correction. This work represents advances in multiple sequence alignment creation and coevolutionary inference
The art of sequence alignment
Sequence similarity: why are we interested in, how do you define it Scoring metrics: what is important in similarity
Scoring with DNA: integrating biological knowledge
Scoring with proteins: integrating biological knowledge
PAMs and BLOSUMs: the marriage of statistics and biology Gaps: how strong do they count?
Constant, affine and concave gap penalties
The dynamic programmig trick
Global versus local
Heuristics to get some speed
Can we trust our alignment?
PSI-BLAST & Co: highlights and traps
Multiple Sequence Alignment: finding and scoring them The third dimension
From alignments to trees
Beyond alignmentsUniversidad de Málaga. Campus de Excelencia Internacional AndalucÃa Tec
Island method for estimating the statistical significance of profile-profile alignment scores
<p>Abstract</p> <p>Background</p> <p>In the last decade, a significant improvement in detecting remote similarity between protein sequences has been made by utilizing alignment profiles in place of amino-acid strings. Unfortunately, no analytical theory is available for estimating the significance of a gapped alignment of two profiles. Many experiments suggest that the distribution of local profile-profile alignment scores is of the Gumbel form. However, estimating distribution parameters by random simulations turns out to be computationally very expensive.</p> <p>Results</p> <p>We demonstrate that the background distribution of profile-profile alignment scores heavily depends on profiles' composition and thus the distribution parameters must be estimated independently, for each pair of profiles of interest. We also show that accurate estimates of statistical parameters can be obtained using the "island statistics" for profile-profile alignments.</p> <p>Conclusion</p> <p>The island statistics can be generalized to profile-profile alignments to provide an efficient method for the alignment score normalization. Since multiple island scores can be extracted from a single comparison of two profiles, the island method has a clear speed advantage over the direct shuffling method for comparable accuracy in parameter estimates.</p
Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties: Extended Version
Although computationally aligning sequence is a crucial step in the vast
majority of comparative genomics studies our understanding of alignment biases
still needs to be improved. To infer true structural or homologous regions
computational alignments need further evaluation. It has been shown that the
accuracy of aligned positions can drop substantially in particular around gaps.
Here we focus on re-evaluation of score-based alignments with affine gap
penalty costs. We exploit their relationships with pair hidden Markov models
and develop efficient algorithms by which to identify gaps which are
significant in terms of length and multiplicity. We evaluate our statistics
with respect to the well-established structural alignments from SABmark and
find that indel reliability substantially increases with their significance in
particular in worst-case twilight zone alignments. This points out that our
statistics can reliably complement other methods which mostly focus on the
reliability of match positions.Comment: 17 pages, 7 figure
Polarization alignments of radio quasars in JVAS/CLASS surveys
We test the hypothesis that the polarization vectors of flat-spectrum radio
sources (FSRS) in the JVAS/CLASS 8.4-GHz surveys are randomly oriented on the
sky. The sample with robust polarization measurements is made of objects
and redshift information is known for of them. We performed two
statistical analyses: one in two dimensions and the other in three dimensions
when distance is available. We find significant large-scale alignments of
polarization vectors for samples containing only quasars (QSO) among the
varieties of FSRS's. While these correlations prove difficult to explain either
by a physical effect or by biases in the dataset, the fact that the QSO's which
have significantly aligned polarization vectors are found in regions of the sky
where optical polarization alignments were previously found is striking.Comment: 13 pages, 9 figures, submitted to MNRA
CLEVER: Clique-Enumerating Variant Finder
Next-generation sequencing techniques have facilitated a large scale analysis
of human genetic variation. Despite the advances in sequencing speeds, the
computational discovery of structural variants is not yet standard. It is
likely that many variants have remained undiscovered in most sequenced
individuals. Here we present a novel internal segment size based approach,
which organizes all, including also concordant reads into a read alignment
graph where max-cliques represent maximal contradiction-free groups of
alignments. A specifically engineered algorithm then enumerates all max-cliques
and statistically evaluates them for their potential to reflect insertions or
deletions (indels). For the first time in the literature, we compare a large
range of state-of-the-art approaches using simulated Illumina reads from a
fully annotated genome and present various relevant performance statistics. We
achieve superior performance rates in particular on indels of sizes 20--100,
which have been exposed as a current major challenge in the SV discovery
literature and where prior insert size based approaches have limitations. In
that size range, we outperform even split read aligners. We achieve good
results also on real data where we make a substantial amount of correct
predictions as the only tool, which complement the predictions of split-read
aligners. CLEVER is open source (GPL) and available from
http://clever-sv.googlecode.com.Comment: 30 pages, 8 figure
Evolutionary Inference via the Poisson Indel Process
We address the problem of the joint statistical inference of phylogenetic
trees and multiple sequence alignments from unaligned molecular sequences. This
problem is generally formulated in terms of string-valued evolutionary
processes along the branches of a phylogenetic tree. The classical evolutionary
process, the TKF91 model, is a continuous-time Markov chain model comprised of
insertion, deletion and substitution events. Unfortunately this model gives
rise to an intractable computational problem---the computation of the marginal
likelihood under the TKF91 model is exponential in the number of taxa. In this
work, we present a new stochastic process, the Poisson Indel Process (PIP), in
which the complexity of this computation is reduced to linear. The new model is
closely related to the TKF91 model, differing only in its treatment of
insertions, but the new model has a global characterization as a Poisson
process on the phylogeny. Standard results for Poisson processes allow key
computations to be decoupled, which yields the favorable computational profile
of inference under the PIP model. We present illustrative experiments in which
Bayesian inference under the PIP model is compared to separate inference of
phylogenies and alignments.Comment: 33 pages, 6 figure
- …