78,159 research outputs found
Scaling Laws and Similarity Detection in Sequence Alignment with Gaps
We study the problem of similarity detection by sequence alignment with gaps,
using a recently established theoretical framework based on the morphology of
alignment paths. Alignments of sequences without mutual correlations are found
to have scale-invariant statistics. This is the basis for a scaling theory of
alignments of correlated sequences. Using a simple Markov model of evolution,
we generate sequences with well-defined mutual correlations and quantify the
fidelity of an alignment in an unambiguous way. The scaling theory predicts the
dependence of the fidelity on the alignment parameters and on the statistical
evolution parameters characterizing the sequence correlations. Specific
criteria for the optimal choice of alignment parameters emerge from this
theory. The results are verified by extensive numerical simulations.Comment: 25 pages, 11 figure
Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties: Extended Version
Although computationally aligning sequence is a crucial step in the vast
majority of comparative genomics studies our understanding of alignment biases
still needs to be improved. To infer true structural or homologous regions
computational alignments need further evaluation. It has been shown that the
accuracy of aligned positions can drop substantially in particular around gaps.
Here we focus on re-evaluation of score-based alignments with affine gap
penalty costs. We exploit their relationships with pair hidden Markov models
and develop efficient algorithms by which to identify gaps which are
significant in terms of length and multiplicity. We evaluate our statistics
with respect to the well-established structural alignments from SABmark and
find that indel reliability substantially increases with their significance in
particular in worst-case twilight zone alignments. This points out that our
statistics can reliably complement other methods which mostly focus on the
reliability of match positions.Comment: 17 pages, 7 figure
Protein sectors: statistical coupling analysis versus conservation
Statistical coupling analysis (SCA) is a method for analyzing multiple
sequence alignments that was used to identify groups of coevolving residues
termed "sectors". The method applies spectral analysis to a matrix obtained by
combining correlation information with sequence conservation. It has been
asserted that the protein sectors identified by SCA are functionally
significant, with different sectors controlling different biochemical
properties of the protein. Here we reconsider the available experimental data
and note that it involves almost exclusively proteins with a single sector. We
show that in this case sequence conservation is the dominating factor in SCA,
and can alone be used to make statistically equivalent functional predictions.
Therefore, we suggest shifting the experimental focus to proteins for which SCA
identifies several sectors. Correlations in protein alignments, which have been
shown to be informative in a number of independent studies, would then be less
dominated by sequence conservation.Comment: 36 pages, 17 figure
Similarity-Detection and Localization
The detection of similarities between long DNA and protein sequences is
studied using concepts of statistical physics. It is shown that mutual
similarities can be detected by sequence alignment methods only if their amount
exceeds a threshold value. The onset of detection is a continuous phase
transition which can be viewed as a localization-delocalization transition. The
``fidelity'' of the alignment is the order parameter of that transition; it
leads to criteria for the selection of optimal alignment parameters.Comment: 4 pages including 4 figures (308kb post-script file
- …