Search CORE

1 research outputs found

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences

Author: A Barbour
A Christoffels
CJ Burden
Conrad J Burden
J Burke
JE Carpenter
L Florea
M Kimura
Miriam R Kantorovitz
MR Kantorovitz
MS Waterman
OM Melko
RA Lippert
S Vinga
SF Altschul
Sylvain Forêt
TJ Wu
W Hide
WJ Conover
WJ Kent
WR Pearson
Z Zhang
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The number of k-words shared between two sequences is a simple and effcient alignment-free sequence comparison method. This statistic, D(2), has been used for the clustering of EST sequences. Sequence comparison based on D(2 )is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D(2), and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied. RESULTS: We have computed the D(2 )optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D(2 )to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the D(2 )statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA. CONCLUSION: We have characterized the distribution of the D(2 )statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, D(2)-based methods are most appropriate for large genomic sequences

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Australian National University