Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition

Abstract

© 2011 Dr. Isaam SaeedTapping into the remarkable power of the uncultured majority of microbial organisms is the driving force of metagenomics. Metagenomics is the study of a microbial community’s genetic content when sampled directly from the environment. Given that microbial genomes within an environmental sample are fragmented prior to sequencing, the association of a genomic DNA fragment to its original genome is not known. As a result, the underlying population structure of the sampled microbial community is also unknown. While it is still possible to analyse the overall function of a microbial community, the functional roles of individual populations and the interactions between them cannot be examined. An approach to infer the underlying population structure of a metagenome is to group sequenced DNA fragments using common patterns in nucleotide base composition that are representative of a particular population (or a group of related populations). The primary challenges for any such method however are the taxonomic resolution and accuracy at which sequences are grouped. These are dependent on both the representation of patterns in DNA sequences and the method of grouping similar patterns. In this study, the oligonucleotide frequency derived error gradient (OFDEG), a novel representation of metagenomic sequences, is first proposed. In addition to grouping related metagenomic sequences, the OFDEG measure is also used to examine how patterns in base composition vary within a microbial genome. A model-based clustering framework is then developed to deal with the ambiguity and noise that affect the cluster distribution of patterns extracted from real-world metagenomic data. The concept of patterns in base composition is then extended to short metagenomic sequences (less than 1000 base-pairs in length), with the proposal of two novel representations based on dinucleotide frequency. The methods developed in this study are evaluated on simulated benchmark data sets and are shown to perform with greater accuracy and resolution than currently available methods. Further validation against publically available metagenomes produced results which were in accordance with reported analyses of sample diversity. Finally, the proposed methods are applied to four pyrosequenced metagenomic libraries of samples taken from a mud volcano in southwestern Taiwan. The inferred population structure and function were found to be consistent with complementary marker gene analysis as well as the local geochemistry of the sampling site

    Similar works