Metagenomics Binning Using Assembly Graphs

Abstract

Metagenomics involves the study of various genetic material obtained directly from communities of microorganisms living in natural environments. The field of metagenomics has provided valuable insights into the structure, diversity and ecology within microbial communities. Recent developments in high-throughput sequencing technologies have enabled metagenomics to analyse samples from environments, without having to rely on culture-based methods. Once an environmental sample is sequenced, a process called metagenomics binning is used to cluster the sequences into bins that represent different taxonomic groups such as species, genera or higher levels. Various efforts have been made throughout the past to bin metagenomic sequences. One approach followed is to bin raw sequencing reads prior to assembly. However, reads are considered too short to produce accurate and reliable binning results for downstream analysis. Hence, the standard approach followed during metagenomics analysis is to assemble short reads into longer sequences called contigs and then bin these resulting contigs. Existing metagenomic contig-binning methods rely on the composition and abundance information of the contigs, and face challenges when binning short contigs and contigs with similar composition and abundance. Contigs are derived from the underlying assembly graph which contains valuable connectivity information among contigs. However, existing metagenomic contig-binning methods do not consider the assembly graph in the binning process. Firstly, this thesis describes a bin refinement tool named GraphBin that improves existing metagenomic binning results using assembly graphs. GraphBin makes use of the assembly graph and a label propagation method to refine binning results of existing contig-binning tools by correcting mis-binned contigs and recovering short contigs that are discarded. Secondly, this thesis explains how to enable the detection of shared sequences among multiple species from assembly graphs and introduces a tool named GraphBin2 which can perform overlapped binning. GraphBin2 makes use of the assembly graph and the coverage information of contigs which enables the detection of contigs that may belong to multiple species. Thirdly, this thesis introduces a stand-alone approach named MetaCoAG to bridge metagenomics binning and assembly by incorporating composition, coverage and assembly graphs. MetaCoAG uses single-copy marker genes to estimate the number of initial bins, assigns contigs into bins iteratively and adjusts the number of bins dynamically throughout the binning process. In summary, this thesis discusses the challenges in binning metagenomic contigs, the shortcomings of existing metagenomic contig-binning tools and presents how the assembly graph can be incorporated to improve metagenomics binning

    Similar works