This paper introduces an efficient implementation of approaches to alignment-free comparative genome analysis and genome-based phylogeny relying on substring composition. Distances derived from substring statistics have been proposed recently as a meaningful alternative to distances derived from sequence alignment. In particular, procaryote phylogenies based on comparative 5- and 6-mer analysis of whole proteomes have successfully been worked out. The present implementation extends the computation of composition-based distances so as to involve allk-mers for anyk up to any preset m aximum length K (including K = ∞). Remarkably, although there may be Θ(L2) distinct strings that occur in a given sequence of length L (and Θ(KL) of length k ≤ K), it is shown that composition-based distances as well as many other details of interest in comparative genome analysis can be computed in O(L) time and space (with a constant that is independent of the size of K, that is, the same constant works for all K).
A typical run with 2 sequences of altogether 1.5 million characters computes their composition-based distance in about 2 s, a performance to be contrasted with the several hours needed, even when restricting attention to substrings of length at most 6, by the direct method in use. This paper
• describes the details of this implementation—an implementation that allows the user to compute composition-based distances for a wide range of instances on data sets of unprecedented size which may be useful in assessing the validity of the approach and to fine-tune the identification of those values of k (or K) yielding the best separators and descriptors in correspondence with different inputs,
• indicates how the proposed algorithm can also be used for other tasks related to the identification and comparative analysis of highly over- or under-represented (sub)strings in given genomes, meta-genomes, or any other sequence families of interest (e.g., all proteins encoded by a given genome, all strings of non-coding or regulatory RNA, all introns, etc.),
• and thus conforms with the increasing need for radically new, fast, and massive techniques for comparative genome analysis