78,159 research outputs found

    Scaling Laws and Similarity Detection in Sequence Alignment with Gaps

    Get PDF
    We study the problem of similarity detection by sequence alignment with gaps, using a recently established theoretical framework based on the morphology of alignment paths. Alignments of sequences without mutual correlations are found to have scale-invariant statistics. This is the basis for a scaling theory of alignments of correlated sequences. Using a simple Markov model of evolution, we generate sequences with well-defined mutual correlations and quantify the fidelity of an alignment in an unambiguous way. The scaling theory predicts the dependence of the fidelity on the alignment parameters and on the statistical evolution parameters characterizing the sequence correlations. Specific criteria for the optimal choice of alignment parameters emerge from this theory. The results are verified by extensive numerical simulations.Comment: 25 pages, 11 figure

    Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties: Extended Version

    Full text link
    Although computationally aligning sequence is a crucial step in the vast majority of comparative genomics studies our understanding of alignment biases still needs to be improved. To infer true structural or homologous regions computational alignments need further evaluation. It has been shown that the accuracy of aligned positions can drop substantially in particular around gaps. Here we focus on re-evaluation of score-based alignments with affine gap penalty costs. We exploit their relationships with pair hidden Markov models and develop efficient algorithms by which to identify gaps which are significant in terms of length and multiplicity. We evaluate our statistics with respect to the well-established structural alignments from SABmark and find that indel reliability substantially increases with their significance in particular in worst-case twilight zone alignments. This points out that our statistics can reliably complement other methods which mostly focus on the reliability of match positions.Comment: 17 pages, 7 figure

    Protein sectors: statistical coupling analysis versus conservation

    Full text link
    Statistical coupling analysis (SCA) is a method for analyzing multiple sequence alignments that was used to identify groups of coevolving residues termed "sectors". The method applies spectral analysis to a matrix obtained by combining correlation information with sequence conservation. It has been asserted that the protein sectors identified by SCA are functionally significant, with different sectors controlling different biochemical properties of the protein. Here we reconsider the available experimental data and note that it involves almost exclusively proteins with a single sector. We show that in this case sequence conservation is the dominating factor in SCA, and can alone be used to make statistically equivalent functional predictions. Therefore, we suggest shifting the experimental focus to proteins for which SCA identifies several sectors. Correlations in protein alignments, which have been shown to be informative in a number of independent studies, would then be less dominated by sequence conservation.Comment: 36 pages, 17 figure

    Similarity-Detection and Localization

    Full text link
    The detection of similarities between long DNA and protein sequences is studied using concepts of statistical physics. It is shown that mutual similarities can be detected by sequence alignment methods only if their amount exceeds a threshold value. The onset of detection is a continuous phase transition which can be viewed as a localization-delocalization transition. The ``fidelity'' of the alignment is the order parameter of that transition; it leads to criteria for the selection of optimal alignment parameters.Comment: 4 pages including 4 figures (308kb post-script file
    • …
    corecore