2,263 research outputs found

    Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

    Get PDF
    BACKGROUND: The number of k-words shared between two sequences is a simple and effcient alignment-free sequence comparison method. This statistic, D(2), has been used for the clustering of EST sequences. Sequence comparison based on D(2 )is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D(2), and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied. RESULTS: We have computed the D(2 )optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D(2 )to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the D(2 )statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA. CONCLUSION: We have characterized the distribution of the D(2 )statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, D(2)-based methods are most appropriate for large genomic sequences

    Weighted k-word matches: a sequence comparison tool for proteins

    Get PDF
    The use of kk-word matches was developed as a fast alignment-free comparison method for DNA sequences in cases where long range contiguity has been compromised, for example, by shuffling, duplication, deletion or inversion of extended blocks of sequence. Here we extend the algorithm to amino acid sequences. We define a new statistic, the weighted word match, which reflects the varying degrees of similarity between pairs of amino acids. We computed the mean and variance, and simulated the distribution function for various forms of this statistic for sequences of identically and independently distributed letters. We present these results and a method for choosing an optimal word size. The efficiency of the method is tested by using simulated evolutionary sequences, and the results compared with BLAST. References R. A. Lippert, H. Huang, and M. S. Waterman. Distributional regimes for the number of kk-word matches between two random sequences. Proc. Natl. Acad. Sci. USA, 99(22):13980--9, 2002. doi:10.1073/pnas.202468099 J. Jing, C. J. Burden, S. Foret, and S. R. Wilson. Statistical considerations underpinning an alignment-free sequence comparison method. J. Korean Stat. Soc., 39:325--335, 2010. doi:10.1016/j.jkss.2010.02.009 S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17):3389--402, 1997. doi:10.1093/nar/25.17.3389 W. J. Ewens and G. R. Grant. Statistical Methods in Bioinformatics: an Introduction. Springer, 2nd edition, 2005. S. Foret, M. R. Kantorovitz, and C. J. Burden. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformatics, 7 Suppl 5:S21, 2006. doi:10.1186/1471-2105-7-S5-S21 S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA, 89:10915--10919, 1992. doi:10.1073/pnas.89.22.10915 http://bioinfo.lifl.fr/reblosum/ [31 May 2011] G. Reinert, D. Chew, F. Sun, and M. S. Waterman. Alignment-free sequence comparison (i): statistics and power. J. Comput. Biol., 16(12):1615--1634, 2009. doi:10.1089/cmb.2009.0198 S. Foret, S. R. Wilson, and C. J. Burden. Empirical distribution of kk-word matches in biological sequences. Pattern Recogn., 42:539--548, 2009. doi:10.1016/j.patcog.2008.06.026 S. Foret, S. R. Wilson, and C. J. Burden. Characterizing the D2D2 statistic: Word matches in biological sequences. Stat. Appl. Genet. Mo. B., 8(1):Article 43, 2009. doi:10.2202/1544-6115.1447 M. R. Kantorovitz, H. S. Booth, C. J. Burden, and S. R. Wilson. Asymptotic behavior of kk-word matches between two uniformly distributed sequences. J. Appl. Probab., 44:788--805, 2006. doi:10.1239/jap/1189717545 T. J. Wu, Y. H. Huang, and L. A. Li. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics, 21(22):4125--32, 2005. doi:10.1093/bioinformatics/bti658 S. Q. Le and O. Gascuel. An improved general amino acid replacement marix. Mol. Biol. Evol., 25:1307--1320, 2008. doi:10.1093/molbev/msn067 E. Gazave, P. Lapebi, G. S. Richards, F. Brunet, A. V. Ereskovsky, B. M. Degnan, C. Borchiellini, M. Vervoort, and E. Renard. Origin and evolution of the Notch signalling pathway: an overview from eukaryotic genomes. BMC Evol. Biol., 9:249, 2009. doi:10.1186/1471-2148-9-249 S. Q. Schneider, J. R. Finnerty, and M. Q. Martindale. Protein evolution: structure-function relationships of the oncogene Beta-catenin in the evolution of multicellular animals. J. Exptl. Zool. (Mol. Dev. Evol.), 295B:25--44, 2003. doi:10.1002/jez.b.0000

    The distribution of word matches between Markovian sequences with periodic boundary conditions

    Get PDF
    Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D2 statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D2 statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D2 distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D2 statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a PĂłlya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D2 distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D2 distribution from the human genome

    Empirical distribution of k-word matches in biological sequences

    Full text link
    This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D_2 statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D_2 statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D_2 have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D_2 uncharacterised for most practical cases. The work presented here bridges these two worlds to give usable approximations of the distribution of D_2 for ranges of parameters most frequently encountered in the study of biological sequences.Comment: 23 pages, 10 figure

    Alignment-free sequence comparison for biologically realistic sequences of moderate length

    No full text
    The D2 statistic, defined as the number of matches of words of some pre-specified length k, is a computationally fast alignment-free measure of biological sequence similarity. However there is some debate about its suitability for this purpose as the variability in D2 may be dominated by the terms that reflect the noise in each of the single sequences only. We examine the extent of the problem and the effectiveness of overcoming it by using two mean-centred variants of this statistic, D2* and D2c. We conclude that all three statistics are potentially useful measures of sequence similarity, for which reasonably accurate p-values can be estimated under a null hypothesis of sequences composed of identically and independently distributed letters. We show that D2 and D2c, and to a somewhat lesser extent D2*, perform well in tests to classify moderate length query sequences as putative cis-regulatory modules.This work was funded in part by ARC discovery grant DP098729

    Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts

    Get PDF
    Motivation: The identity of cells and tissues is to a large degree governed by transcriptional regulation. A major part is accomplished by the combinatorial binding of transcription factors at regulatory sequences, such as enhancers. Even though binding of transcription factors is sequence-specific, estimating the sequence similarity of two functionally similar enhancers is very difficult. However, a similarity measure for regulatory sequences is crucial to detect and understand functional similarities between two enhancers and will facilitate large-scale analyses like clustering, prediction and classification of genome-wide datasets

    New algorithms and methods for protein and DNA sequence comparison

    Get PDF
    • 

    corecore