Search CORE

2,263 research outputs found

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

Author: A Barbour
A Christoffels
CJ Burden
Conrad J Burden
J Burke
JE Carpenter
L Florea
M Kimura
Miriam R Kantorovitz
MR Kantorovitz
MS Waterman
OM Melko
RA Lippert
S Vinga
SF Altschul
Sylvain Forêt
TJ Wu
W Hide
WJ Conover
WJ Kent
WR Pearson
Z Zhang
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The number of k-words shared between two sequences is a simple and effcient alignment-free sequence comparison method. This statistic, D(2), has been used for the clustering of EST sequences. Sequence comparison based on D(2 )is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D(2), and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied. RESULTS: We have computed the D(2 )optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D(2 )to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the D(2 )statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA. CONCLUSION: We have characterized the distribution of the D(2 )statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, D(2)-based methods are most appropriate for large genomic sequences

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Australian National University

Weighted k-word matches: a sequence comparison tool for proteins

Author: Burden Conrad John
Jing J.
Wilson S. R.
Publication venue: Australian Mathematical Society
Publication date: 04/06/2011
Field of study

The use of

k

-word matches was developed as a fast alignment-free comparison method for DNA sequences in cases where long range contiguity has been compromised, for example, by shuffling, duplication, deletion or inversion of extended blocks of sequence. Here we extend the algorithm to amino acid sequences. We define a new statistic, the weighted word match, which reflects the varying degrees of similarity between pairs of amino acids. We computed the mean and variance, and simulated the distribution function for various forms of this statistic for sequences of identically and independently distributed letters. We present these results and a method for choosing an optimal word size. The efficiency of the method is tested by using simulated evolutionary sequences, and the results compared with BLAST. References R. A. Lippert, H. Huang, and M. S. Waterman. Distributional regimes for the number of

k

-word matches between two random sequences. Proc. Natl. Acad. Sci. USA, 99(22):13980--9, 2002. doi:10.1073/pnas.202468099 J. Jing, C. J. Burden, S. Foret, and S. R. Wilson. Statistical considerations underpinning an alignment-free sequence comparison method. J. Korean Stat. Soc., 39:325--335, 2010. doi:10.1016/j.jkss.2010.02.009 S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25(17):3389--402, 1997. doi:10.1093/nar/25.17.3389 W. J. Ewens and G. R. Grant. Statistical Methods in Bioinformatics: an Introduction. Springer, 2nd edition, 2005. S. Foret, M. R. Kantorovitz, and C. J. Burden. Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformatics, 7 Suppl 5:S21, 2006. doi:10.1186/1471-2105-7-S5-S21 S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA, 89:10915--10919, 1992. doi:10.1073/pnas.89.22.10915 http://bioinfo.lifl.fr/reblosum/ [31 May 2011] G. Reinert, D. Chew, F. Sun, and M. S. Waterman. Alignment-free sequence comparison (i): statistics and power. J. Comput. Biol., 16(12):1615--1634, 2009. doi:10.1089/cmb.2009.0198 S. Foret, S. R. Wilson, and C. J. Burden. Empirical distribution of

k

-word matches in biological sequences. Pattern Recogn., 42:539--548, 2009. doi:10.1016/j.patcog.2008.06.026 S. Foret, S. R. Wilson, and C. J. Burden. Characterizing the

D2

statistic: Word matches in biological sequences. Stat. Appl. Genet. Mo. B., 8(1):Article 43, 2009. doi:10.2202/1544-6115.1447 M. R. Kantorovitz, H. S. Booth, C. J. Burden, and S. R. Wilson. Asymptotic behavior of

k

-word matches between two uniformly distributed sequences. J. Appl. Probab., 44:788--805, 2006. doi:10.1239/jap/1189717545 T. J. Wu, Y. H. Huang, and L. A. Li. Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences. Bioinformatics, 21(22):4125--32, 2005. doi:10.1093/bioinformatics/bti658 S. Q. Le and O. Gascuel. An improved general amino acid replacement marix. Mol. Biol. Evol., 25:1307--1320, 2008. doi:10.1093/molbev/msn067 E. Gazave, P. Lapebi, G. S. Richards, F. Brunet, A. V. Ereskovsky, B. M. Degnan, C. Borchiellini, M. Vervoort, and E. Renard. Origin and evolution of the Notch signalling pathway: an overview from eukaryotic genomes. BMC Evol. Biol., 9:249, 2009. doi:10.1186/1471-2148-9-249 S. Q. Schneider, J. R. Finnerty, and M. Q. Martindale. Protein evolution: structure-function relationships of the oncogene Beta-catenin in the evolution of multicellular animals. J. Exptl. Zool. (Mol. Dev. Evol.), 295B:25--44, 2003. doi:10.1002/jez.b.0000

Australian Mathematical Society (AustMS): E-Journals

The distribution of word matches between Markovian sequences with periodic boundary conditions

Author: Burden Conrad J
Foret Sylvain
Leopardi Paul
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2014
Field of study

Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D2 statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D2 statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D2 distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D2 statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D2 distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D2 distribution from the human genome

PubMed Central

The Australian National University

Empirical distribution of k-word matches in biological sequences

Author: Burden
Conover
Conrad J. Burden
Forêt
Gumbel
Kantorovitz
Lippert
Susan R. Wilson
Sylvain Forêt
Waterman
Publication venue: 'Elsevier BV'
Publication date: 14/03/2008
Field of study

This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D_2 statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D_2 statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D_2 have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D_2 uncharacterised for most practical cases. The work presented here bridges these two worlds to give usable approximations of the distribution of D_2 for ranges of parameters most frequently encountered in the study of biological sequences.Comment: 23 pages, 10 figure

arXiv.org e-Print Archive

ResearchOnline@JCU

Crossref

ResearchOnline at James Cook University

The Australian National University

Alignment-free sequence comparison for biologically realistic sequences of moderate length

Author: Burden Conrad J
Jing Junmei
Wilson Susan R
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 08/12/2015
Field of study

The D2 statistic, defined as the number of matches of words of some pre-specified length k, is a computationally fast alignment-free measure of biological sequence similarity. However there is some debate about its suitability for this purpose as the variability in D2 may be dominated by the terms that reflect the noise in each of the single sequences only. We examine the extent of the problem and the effectiveness of overcoming it by using two mean-centred variants of this statistic, D2* and D2c. We conclude that all three statistics are potentially useful measures of sequence similarity, for which reasonably accurate p-values can be estimated under a null hypothesis of sequences composed of identically and independently distributed letters. We show that D2 and D2c, and to a somewhat lesser extent D2*, perform well in tests to classify moderate length query sequences as putative cis-regulatory modules.This work was funded in part by ARC discovery grant DP098729

The Australian National University

Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts

Author: Benson
Blaisdell
Blow
Burden
Carpenter
Dai
Doering
Forêt S.
Gardiner-Garden
Gordân R.
Goto
Hide
Jonathan Göke
Julia Lasserre
Kantorovitz
Kantorovitz
Kunarso
Lee
Lippert
Marcel H. Schulz
Martin Vingron
Needleman
Reinert
Robin
Small
Smith
Thomas-Chollier
van Helden
Vinga
Visel
Wilson
Wu
Zemojtel
Zinzen
Publication venue: Oxford University Press
Publication date: 01/01/2012
Field of study

Motivation: The identity of cells and tissues is to a large degree governed by transcriptional regulation. A major part is accomplished by the combinatorial binding of transcription factors at regulatory sequences, such as enhancers. Even though binding of transcription factors is sequence-specific, estimating the sequence similarity of two functionally similar enhancers is very difficult. However, a similarity measure for regulatory sequences is crucial to detect and understand functional similarities between two enhancers and will facilitate large-scale analyses like clustering, prediction and classification of genome-wide datasets

New algorithms and methods for protein and DNA sequence comparison

Author: Crook James
Publication venue: The University of Edinburgh
Publication date: 01/01/1991
Field of study

Edinburgh Research Archive