10 research outputs found

    Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability region, which is biologically less relevant.</p> <p>Results</p> <p>We provide a method to obtain numerically the biologically relevant rare-event tail of the distribution. The method, which has been outlined in an earlier work, is based on generating the sequences with a parametrized probability distribution, which is biased with respect to the original biological one, in the framework of Metropolis Coupled Markov Chain Monte Carlo. Here, we first present the approach in detail and evaluate the convergence of the algorithm by considering a simple test case. In the earlier work, the method was just applied to one single example case. Therefore, we consider here a large set of parameters:</p> <p>We study the distributions for protein alignment with different substitution matrices (BLOSUM62 and PAM250) and affine gap costs with different parameter values. In the logarithmic phase (large gap costs) it was previously assumed that the Gumbel form still holds, hence the Gumbel distribution is usually used when evaluating p-values in databases. Here we show that for all cases, provided that the sequences are not too long (<it>L </it>> 400), a "modified" Gumbel distribution, i.e. a Gumbel distribution with an additional Gaussian factor is suitable to describe the data. We also provide a "scaling analysis" of the parameters used in the modified Gumbel distribution. Furthermore, via a comparison with BLAST parameters, we show that significance estimations change considerably when using the true distributions as presented here. Finally, we study also the distribution of the sum statistics of the <it>k </it>best alignments.</p> <p>Conclusion</p> <p>Our results show that the statistics of gapped and ungapped local alignments deviates significantly from Gumbel in the rare-event tail. We provide a Gaussian correction to the distribution and an analysis of its scaling behavior for several different scoring parameter sets, which are commonly used to search protein data bases. The case of sum statistics of <it>k </it>best alignments is included.</p

    Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling

    Get PDF
    Wolfsheimer S, Herms I, Rahmann S, Hartmann AK. Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling. BMC Bioinformatics. 2011;12(1): 47.Background: Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet. Results: In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach. Conclusions: The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation

    Probability distribution P(s) for ungapped sequence alignment using BLOSUM62-matrices

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail"</p><p>http://www.almob.org/content/2/1/9</p><p>Algorithms for molecular biology : AMB 2007;2():9-9.</p><p>Published online 11 Jul 2007</p><p>PMCID:PMC1945026.</p><p></p> Deviations form the Gumbel-distribution can only be observed for short sequences (< 250). The inset shows the same data with linear ordinate

    Relative error of the probability estimation using gapped sequence alignment and BLOSUM62 matrices

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail"</p><p>http://www.almob.org/content/2/1/9</p><p>Algorithms for molecular biology : AMB 2007;2():9-9.</p><p>Published online 11 Jul 2007</p><p>PMCID:PMC1945026.</p><p></p

    Conotoxin protein classification using free scores of words and support vector machines

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Conotoxin has been proven to be effective in drug design and could be used to treat various disorders such as schizophrenia, neuromuscular disorders and chronic pain. With the rapidly growing interest in conotoxin, accurate conotoxin superfamily classification tools are desirable to systematize the increasing number of newly discovered sequences and structures. However, despite the significance and extensive experimental investigations on conotoxin, those tools have not been intensively explored.</p> <p>Results</p> <p>In this paper, we propose to consider suboptimal alignments of words with restricted length. We developed a scoring system based on local alignment partition functions, called free score. The scoring system plays the key role in the feature extraction step of support vector machine classification. In the classification of conotoxin proteins, our method, SVM-Freescore, features an improved sensitivity and specificity by approximately 5.864% and 3.76%, respectively, over previously reported methods. For the generalization purpose, SVM-Freescore was also applied to classify superfamilies from curated and high quality database such as ConoServer. The average computed sensitivity and specificity for the superfamily classification were found to be 0.9742 and 0.9917, respectively. </p> <p>Conclusions</p> <p>The SVM-Freescore method is shown to be a useful sequence-based analysis tool for functional and structural characterization of conotoxin proteins. The datasets and the software are available at <url>http://faculty.uaeu.ac.ae/nzaki/SVM-Freescore.htm</url>.</p
    corecore