6 research outputs found

    LRBinner: Binning Long Reads in Metagenomics Datasets

    Get PDF
    Advancements in metagenomics sequencing allow the study of microbial communities directly from their environments. Metagenomics binning is a key step in the species characterisation of microbial communities. Next-generation sequencing reads are usually assembled into contigs for metagenomics binning mainly due to the limited information within short reads. Third-generation sequencing provides much longer reads that have lengths similar to the contigs assembled from short reads. However, existing contig-binning tools cannot be directly applied on long reads due to the absence of coverage information and the presence of high error rates. The few existing long-read binning tools either use only composition or use composition and coverage information separately. This may ignore bins that correspond to low-abundance species or erroneously split bins that correspond to species with non-uniform coverages. Here we present a reference-free binning approach, LRBinner, that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes. The experimental results on both simulated and real datasets show that LRBinner achieves the best binning accuracy against the baselines. Moreover, we show that binning reads using LRBinner prior to assembly reduces computational resources for assembly while attaining satisfactory assembly qualities

    Models and Algorithms for Metagenomics Analysis and Plasmid Classification

    Get PDF
    Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyze metagenomics data, binning is considered a crucial step to characterize the different species of microorganisms present. Metagenomics binning can be extended further towards determination of plasmids and chromosomes to study environmental adaptations. The field of metagenomics binning is mostly done on contigs from genome assemblies. Metagenomics studies are mostly performed with short read sequencing. Direct binning of short reads suffers from insufficient species-specific signal, thus they are usually assembled into longer contigs before binning. Therefore, the emergence of long-read sequencing technologies gives us the opportunity to study the binning of long reads directly, where such studies have been carried out in limited numbers. Firstly, this thesis presents the challenges in binning long reads compared to contigs assembled from short reads. One key challenge in binning long reads is the absence of coverage information, which is typically obtained from assembly. Moreover, the scale of long reads compared to contigs demands more computationally efficient methods for binning. Therefore, we develop MetaBCC-LR to address these challenges and perform metagenomics binning of long reads. We introduce the concept of k-mer coverage histogram to estimate the coverage of long reads without alignments and use a sampling strategy to handle the immense number of long reads. Since MetaBCC-LR is limited by the use of coverage and composition information in a stepwise manner, we further develop LRBinner to combine the coverage and composition information. This enables LRBinner to effectively combine coverage and composition features and use them simultaneously for binning. LRBinner also implemented a novel clustering algorithm that performs better on binning long-read datasets from species with varying abundances. Moreover, we propose OBLR to improve the coverage estimation of long reads via a read-overlap graph instead of k-mers. The read-overlap graph also enables OBLR to perform probabilistic sampling to better recover low-abundant species. Secondly, we investigate opportunities to improve plasmid detection which is considered as a binary plasmid-chromosome classification problem. We introduce PlasLR that enables adaptation of plasmid prediction tools designed for contigs to classify long and error-prone reads. We also develop GraphPlas that uses the assembly graph to improve plasmid classification results for assembled contigs. In summary, this thesis presents the progressive development of models and algorithms for metagenomics binning and plasmid classification

    GraphBin2: Refined and Overlapped Binning of Metagenomic Contigs Using Assembly Graphs

    Get PDF
    Metagenomic sequencing allows us to study structure, diversity and ecology in microbial communities without the necessity of obtaining pure cultures. In many metagenomics studies, the reads obtained from metagenomics sequencing are first assembled into longer contigs and these contigs are then binned into clusters of contigs where contigs in a cluster are expected to come from the same species. As different species may share common sequences in their genomes, one assembled contig may belong to multiple species. However, existing tools for contig binning only support non-overlapped binning, i.e., each contig is assigned to at most one bin (species). In this paper, we introduce GraphBin2 which refines the binning results obtained from existing tools and, more importantly, is able to assign contigs to multiple bins. GraphBin2 uses the connectivity and coverage information from assembly graphs to adjust existing binning results on contigs and to infer contigs shared by multiple species. Experimental results on both simulated and real datasets demonstrate that GraphBin2 not only improves binning results of existing tools but also supports to assign contigs to multiple bins

    Binning long reads in metagenomics datasets using composition and coverage information

    No full text
    Abstract Background Advancements in metagenomics sequencing allow the study of microbial communities directly from their environments. Metagenomics binning is a key step in the species characterisation of microbial communities. Next-generation sequencing reads are usually assembled into contigs for metagenomics binning mainly due to the limited information within short reads. Third-generation sequencing provides much longer reads that have lengths similar to the contigs assembled from short reads. However, existing contig-binning tools cannot be directly applied on long reads due to the absence of coverage information and the presence of high error rates. The few existing long-read binning tools either use only composition or use composition and coverage information separately. This may ignore bins that correspond to low-abundance species or erroneously split bins that correspond to species with non-uniform coverages. Here we present a reference-free binning approach, LRBinner, that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes. Results The experimental results on both simulated and real datasets show that LRBinner achieves the best binning accuracy in most cases while handling the complete datasets without any sampling. Moreover, we show that binning reads using LRBinner prior to assembly reduces computational resources required for assembly while attaining satisfactory assembly qualities. Conclusion LRBinner shows that deep-learning techniques can be used for effective feature aggregation to support the metagenomics binning of long reads. Furthermore, accurate binning of long reads supports improvements in metagenomics assembly, especially in complex datasets. Binning also helps to reduce the resources required for assembly. Source code for LRBinner is freely available at https://github.com/anuradhawick/LRBinner

    Phylogenetic Tree Construction Using K-Mer Forest- Based Distance Calculation

    No full text
    Phylogenetics is one of the dominant data engineering research disciplines based on biological information. More particularly here, we consider raw DNA sequences and do comparative analysis in order to come up with important conclusions. When representing evolutionary relationships among different organisms in a concise manner, the phylogenetic tree helps significantly. When constructing phylogenetic trees, the elementary step is to calculate the genetic distance among species. Alignment-based sequencing and alignment-free sequencing are the two main distance computation methods that are used to find genetic relatedness of different species. In this paper we propose a novel alignment-free, pairwise, distance calculation method based on k-mers and a state of art machine learning-based phylogenetic tree construction mechanism. With the proposed approach we can convert longer DNA sequences into compendious k-mer forests which gear up the efficiency of comparison. Later we construct the phylogenetic tree based on calculated distances with the help of an algorithm build upon k-medoid clustering, which guaranteed significant efficiency and accuracy compared to traditional phylogenetic tree construction methods

    Phylogenetic Tree Construction Using K-Mer Forest- Based Distance Calculation

    No full text
    Phylogenetics is one of the dominant data engineering research disciplines based on biological information. More particularly here, we consider raw DNA sequences and do comparative analysis in order to come up with important conclusions. When representing evolutionary relationships among different organisms in a concise manner, the phylogenetic tree helps significantly. When constructing phylogenetic trees, the elementary step is to calculate the genetic distance among species. Alignment-based sequencing and alignment-free sequencing are the two main distance computation methods that are used to find genetic relatedness of different species. In this paper we propose a novel alignment-free, pairwise, distance calculation method based on k-mers and a state of art machine learning-based phylogenetic tree construction mechanism. With the proposed approach we can convert longer DNA sequences into compendious k-mer forests which gear up the efficiency of comparison. Later we construct the phylogenetic tree based on calculated distances with the help of an algorithm build upon k-medoid clustering, which guaranteed significant efficiency and accuracy compared to traditional phylogenetic tree construction methods
    corecore