Skip to main content
Article thumbnail
Location of Repository

Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies

By Orion Penner, Peter Grassberger and Maya Paczuski


Background: \ud Existing sequence alignment algorithms use heuristic scoring schemes based on biological expertise, which cannot be used as objective distance metrics. As a result one relies on crude measures, like the p- or log-det distances, or makes explicit, and often too simplistic, a priori assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI). MI is, in principle, an objective and model independent similarity measure, but it is not widely used in this context and no algorithm for extracting MI from a given alignment (without assuming an evolutionary model) is known. MI can be estimated without alignments, by concatenating and zipping sequences, but so far this has only produced estimates with uncontrolled errors, despite the fact that the normalized compression distance based on it has shown promising results.\ud \ud Results: \ud We describe a simple approach to get robust estimates of MI from global pairwise alignments. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. For animal mitochondrial DNA our approach uses the alignments made by popular global alignment algorithms to produce MI estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. We point out that, due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances.\ud \ud Conclusions: \ud Several versions of MI based distances outperform conventional distances in distance-based phylogeny. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments. It strongly suggests that information theory concepts can be exploited further in sequence analysis

Topics: QA Mathematics, QC Physics, QH301 Biology
Publisher: Public Library of Science
Year: 2011
DOI identifier: 10.1371/journal.pone.0014373
OAI identifier:

Suggested articles


  1. (1995). A doi
  2. (2001). A guided tour to approximate string matching. doi
  3. (1974). A note on metric properties of trees. doi
  4. (2007). A Simple Statistical Algorithm for Biological Sequence Compression. In: doi
  5. (1997). Alignment by maximization of mutual information. doi
  6. (2003). Alignment-free sequence comparison–a review. doi
  7. (2001). An informationbased sequence distance and its application to whole mitochondrial genome phylogeny. doi
  8. (1990). Basic local alignment search tool. doi
  9. (1998). Biological sequence analysis: Probabilistic models of proteins and nucleic acids. doi
  10. (2004). BLAST: at the core of a powerful and diverse set of sequence analysis tools. doi
  11. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. doi
  12. (2005). Clustering by compression. doi
  13. (1999). Compression and approximate matching. doi
  14. (1993). Confidence in evolutionary trees from biological sequence data. doi
  15. (2006). Elements of information theory doi
  16. (1981). Evolutionary trees from DNA sequences: a maximum likelihood approach. doi
  17. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. doi
  18. (2003). Glocal alignment: finding rearrangements during alignment. doi
  19. (2006). Handbook of computational molecular biology. Boca Raton:
  20. (2003). Hierarchical clustering based on mutual information. Arxiv preprint q-bio/0311039.
  21. (2005). Kalign – an accurate and fast multiple sequence alignment algorithm. doi
  22. (2003). LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA. doi
  23. (2003). MAVID multiple alignment server. doi
  24. (1978). Modeling by shortest data description. doi
  25. (2000). Molecular evolution and phylogenetics. doi
  26. (2007). Numerical recipes: the art of scientific computing doi
  27. (1994). Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. doi
  28. (1994). Recovering evolutionary trees under a more realistic model of sequence evolution. Molecular biology and evolution 11:
  29. (1986). Stochastic complexity and statistical inference Analysis and Optimization of Systems. doi
  30. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution 4: doi
  31. (2004). The similarity metric. doi
  32. (1999). Transformation distances: a family of dissimilarity measures based on movements of segments. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.