14 research outputs found

    The maximum of a random walk reflected at a general barrier

    Full text link
    We define the reflection of a random walk at a general barrier and derive, in case the increments are light tailed and have negative mean, a necessary and sufficient criterion for the global maximum of the reflected process to be finite a.s. If it is finite a.s., we show that the tail of the distribution of the global maximum decays exponentially fast and derive the precise rate of decay. Finally, we discuss an example from structural biology that motivated the interest in the reflection at a general barrier.Comment: Published at http://dx.doi.org/10.1214/105051605000000610 in the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Summation test for gap penalties and strong law of the local alignment score

    Full text link
    A summation test is proposed to determine admissible types of gap penalties for logarithmic growth of the local alignment score. We also define a converging sequence of log moment generating functions that provide the constants associated with the large deviation rate and logarithmic strong law of the local alignment score and the asymptotic number of matches in the optimal local alignment.Comment: Published at http://dx.doi.org/10.1214/105051605000000061 in the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Exact distribution for the local score of one i.i.d. random sequence

    Get PDF
    International audienceLet X1...Xn be a sequence of IID positive or negative integer valued random variables and Hn max i j n Xi Xj the local score of the sequence The exact distribution of Hn is obtained using a simple Markov chain This result is applied to the scoring of DNA and protein sequences in molecular biolog

    New approximate P-value of gapped local sequence alignments,

    Get PDF
    International audienceWe propose a new method to approximate the signi cativity of gapped local sequence alignments. We focus on short sequences for which standard methods are known to be less accurate since they have been developed under asymptotics. Our approach combines an approximate distribution of ungapped local score of two sequences and a special scoring scheme that allows the insertion of gaps. For a positive integer h, the scoring scheme is de ned on h-tuples of the components of the sequences and corresponds to the gapped global score. The in uence of h and the accuracy of the p-value are numerically studied

    h-tuple Approach to Evaluate Statistical Significance of Biological Sequence Comparison with Gaps,

    Get PDF
    International audienceWe propose an approximate distribution for the gapped local score of a two sequence comparison. Our method stands on combining an adapted scoring scheme that includes the gaps and an approximate distribution of the ungapped local score of two independent sequences of i.i.d. random variables. The new scoring scheme is defined on h-tuples of the sequences, using the gapped global score. The influence of h and the accuracy of the p-value are numerically studied and compared with obtained p-value of BLAST. The numerical experiments emphasize that our approximate p-values outperform the BLAST ones, particularly for both simulated and real short sequences

    On expected score of cellwise alignments

    Get PDF
    We consider certain suboptimal alignments of two independent i.i.d. random sequences from a finite alphabet A = {1;...,K}, both sequences having length n. In particular, we focus on so-called cellwise alignments, where in the first step so many 1-s as possible are aligned. These aligned 1-s define cells and the rest of the alignment is defined so that the already existing alignment of 1-s remains unchanged. We show that as n grows, for any cellwise alignment, the average score of a cell tends to the expected score of a random cell, a.s. Moreover, we show that a large deviation inequality holds. The second part of the paper is devoted to calculating the expected score of certain cellwise alignment referred to as priority letter alignment. In this alignment, inside every cell first all 2-s are aligned. Then all 3-s are aligned, but in such way that the already existing alignment of 2-s remains unchanged. Then we continue with 4-s and so on. Although easy to describe, for K bigger than 3 the exact formula for expected score is not that straightforward to find. We present a recursive formula for calculating the expected score

    A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation

    Get PDF
    Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments

    Sequence-specific sequence comparison using pairwise statistical significance

    Get PDF
    Sequence comparison is one of the most fundamental computational problems in bioinformatics for which many approaches have been and are still being developed. In particular, pairwise sequence alignment forms the crux of both DNA and protein sequence comparison techniques, which in turn forms the basis of many other applications in bioinformatics. Pairwise sequence alignment methods align two sequences using a substitution matrix consisting of pairwise scores of aligning different residues with each other (like BLOSUM62), and give an alignment score for the given sequence-pair. The biologists routinely use such pairwise alignment programs to identify similar, or more specifically, related sequences (having common ancestor). It is widely accepted that the relatedness of two sequences is better judged by statistical significance of the alignment score rather than by the alignment score alone. This research addresses the problem of accurately estimating statistical significance of pairwise alignment for the purpose of identifying related sequences, by making the sequence comparison process more sequence-specific. The major contributions of this research work are as follows. Firstly, using sequence-specific strategies for pairwise sequence alignment in conjunction with sequence-specific strategies for statistical significance estimation, wherein accurate methods for pairwise statistical significance estimation using standard, sequence-specific, and position-specific substitution matrices are developed. Secondly, using pairwise statistical significance to improve the performance of the most popular database search program PSI-BLAST. Thirdly, design and implementation of heuristics to speed-up pairwise statistical significance estimation by an factor of more than 200. The implementation of all the methods developed in this work is freely available online. With the all-pervasive application of sequence alignment methods in bioinformatics using the ever-increasing sequence data, this work is expected to offer useful contributions to the research community
    corecore