18 research outputs found

    The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The characterisation, or binning, of metagenome fragments is an important first step to further downstream analysis of microbial consortia. Here, we propose a one-dimensional signature, OFDEG, derived from the oligonucleotide frequency profile of a DNA sequence, and show that it is possible to obtain a meaningful phylogenetic signal for relatively short DNA sequences. The one-dimensional signal is essentially a compact representation of higher dimensional feature spaces of greater complexity and is intended to improve on the tetranucleotide frequency feature space preferred by current compositional binning methods.</p> <p>Results</p> <p>We compare the fidelity of OFDEG against tetranucleotide frequency in both an unsupervised and semi-supervised setting on simulated metagenome benchmark data. Four tests were conducted using assembler output of Arachne and phrap, and for each, performance was evaluated on contigs which are greater than or equal to 8 kbp in length and contigs which are composed of at least 10 reads. Using both G-C content in conjunction with OFDEG gave an average accuracy of 96.75% (semi-supervised) and 95.19% (unsupervised), versus 94.25% (semi-supervised) and 82.35% (unsupervised) for tetranucleotide frequency.</p> <p>Conclusion</p> <p>We have presented an observation of an alternative characteristic of DNA sequences. The proposed feature representation has proven to be more beneficial than the existing tetranucleotide frequency space to the metagenome binning problem. We do note, however, that our observation of OFDEG deserves further anlaysis and investigation. Unsupervised clustering revealed OFDEG related features performed better than standard tetranucleotide frequency in representing a relevant organism specific signal. Further improvement in binning accuracy is given by semi-supervised classification using OFDEG. The emphasis on a feature-driven, bottom-up approach to the problem of binning reveals promising avenues for future development of techniques to characterise short environmental sequences without bias toward cultivable organisms.</p

    Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition

    Get PDF
    An approach to infer the unknown microbial population structure within a metagenome is to cluster nucleotide sequences based on common patterns in base composition, otherwise referred to as binning. When functional roles are assigned to the identified populations, a deeper understanding of microbial communities can be attained, more so than gene-centric approaches that explore overall functionality. In this study, we propose an unsupervised, model-based binning method with two clustering tiers, which uses a novel transformation of the oligonucleotide frequency-derived error gradient and GC content to generate coarse groups at the first tier of clustering; and tetranucleotide frequency to refine these groups at the secondary clustering tier. The proposed method has a demonstrated improvement over PhyloPythia, S-GSOM, TACOA and TaxSOM on all three benchmarks that were used for evaluation in this study. The proposed method is then applied to a pyrosequenced metagenomic library of mud volcano sediment sampled in southwestern Taiwan, with the inferred population structure validated against complementary sequencing of 16S ribosomal RNA marker genes. Finally, the proposed method was further validated against four publicly available metagenomes, including a highly complex Antarctic whale-fall bone sample, which was previously assumed to be too complex for binning prior to functional analysis

    Accurate reconstruction of viral quasispecies spectra through improved estimation of strain richness

    Get PDF
    Background Estimating the number of different species (richness) in a mixed microbial population has been a main focus in metagenomic research. Existing methods of species richness estimation ride on the assumption that the reads in each assembled contig correspond to only one of the microbial genomes in the population. This assumption and the underlying probabilistic formulations of existing methods are not useful for quasispecies populations where the strains are highly genetically related. The lack of knowledge on the number of different strains in a quasispecies population is observed to hinder the precision of existing Viral Quasispecies Spectrum Reconstruction (QSR) methods due to the uncontrolled reconstruction of a large number of in silico false positives. In this work, we formulated a novel probabilistic method for strain richness estimation specifically targeting viral quasispecies. By using this approach we improved our recently proposed spectrum reconstruction pipeline ViQuaS to achieve higher levels of precision in reconstructed quasispecies spectra without compromising the recall rates. We also discuss how one other existing popular QSR method named ShoRAH can be improved using this new approach. Results On benchmark data sets, our estimation method provided accurate richness estimates (< 0.2 median estimation error) and improved the precision of ViQuaS by 2%-13% and F-score by 1%-9% without compromising the recall rates. We also demonstrate that our estimation method can be used to improve the precision and F-score of ShoRAH by 0%-7% and 0%-5% respectively. Conclusions The proposed probabilistic estimation method can be used to estimate the richness of viral populations with a quasispecies behavior and to improve the accuracy of the quasispecies spectra reconstructed by the existing methods ViQuaS and ShoRAH in the presence of a moderate level of technical sequencing errors

    A new peak detection algorithm for MALDI mass spectrometry data based on a modified Asymmetric Pseudo-Voigt model

    Get PDF
    Background: Mass Spectrometry (MS) is a ubiquitous analytical tool in biological research and is used to measure the mass-to-charge ratio of bio-molecules. Peak detection is the essential first step in MS data analysis. Precise estimation of peak parameters such as peak summit location and peak area are critical to identify underlying bio-molecules and to estimate their abundances accurately. We propose a new method to detect and quantify peaks in mass spectra. It uses dual-tree complex wavelet transformation along with Stein's unbiased risk estimator for spectra smoothing. Then, a new method, based on the modified Asymmetric Pseudo-Voigt (mAPV) model and hierarchical particle swarm optimization, is used for peak parameter estimation. Results: Using simulated data, we demonstrated the benefit of using the mAPV model over Gaussian, Lorentz and Bi-Gaussian functions for MS peak modelling. The proposed mAPV model achieved the best fitting accuracy for asymmetric peaks, with lower percentage errors in peak summit location estimation, which were 0.17% to 4.46% less than that of the other models. It also outperformed the other models in peak area estimation, delivering lower percentage errors, which were about 0.7% less than its closest competitor - the Bi-Gaussian model. In addition, using data generated from a MALDI-TOF computer model, we showed that the proposed overall algorithm outperformed the existing methods mainly in terms of sensitivity. It achieved a sensitivity of 85%, compared to 77% and 71% of the two benchmark algorithms, continuous wavelet transformation based method and Cromwell respectively. Conclusions: The proposed algorithm is particularly useful for peak detection and parameter estimation in MS data with overlapping peak distributions and asymmetric peaks. The algorithm is implemented using MATLAB and the source code is freely available at http://mapv.sourceforge.net

    ENVirT: inference of ecological characteristics of viruses from metagenomic data

    Get PDF
    Background Estimating the parameters that describe the ecology of viruses,particularly those that are novel, can be made possible using metagenomic approaches. However, the best-performing existing methods require databases to first estimate an average genome length of a viral community before being able to estimate other parameters, such as viral richness. Although this approach has been widely used, it can adversely skew results since the majority of viruses are yet to be catalogued in databases. Results In this paper, we present ENVirT, a method for estimating the richness of novel viral mixtures, and for the first time we also show that it is possible to simultaneously estimate the average genome length without a priori information. This is shown to be a significant improvement over database-dependent methods, since we can now robustly analyze samples that may include novel viral types under-represented in current databases. We demonstrate that the viral richness estimates produced by ENVirT are several orders of magnitude higher in accuracy than the estimates produced by existing methods named PHACCS and CatchAll when benchmarked against simulated data. We repeated the analysis of 20 metavirome samples using ENVirT, which produced results in close agreement with complementary in virto analyses. Conclusions These insights were previously not captured by existing computational methods. As such, ENVirT is shown to be an essential tool for enhancing our understanding of novel viral populations.This work was supported partially by Australia Research Council [grant numbers LP140100670 and DP150103512] and the Biodiversity Research Center, Academia Sinica, Taiwan. DJ, DH, DS and YS were funded by the MIFRS and MIRS scholarships of The University of Melbourne. Publication costs were funded by The Australian National University

    Prokaryotic assemblages and metagenomes in pelagic zones of the South China Sea

    Get PDF
    Background: Prokaryotic microbes, the most abundant organisms in the ocean, are remarkably diverse. Despite numerous studies of marine prokaryotes, the zonation of their communities in pelagic zones has been poorly delineated. By exploiting the persistent stratification of the South China Sea (SCS), we performed a 2-year, large spatial scale (10, 100, 1000, and 3000 m) survey, which included a pilot study in 2006 and comprehensive sampling in 2007, to investigate the biological zonation of bacteria and archaea using 16S rRNA tag and shotgun metagenome sequencing. Results: Alphaproteobacteria dominated the bacterial community in the surface SCS, where the abundance of Betaproteobacteria was seemingly associated with climatic activity. Gammaproteobacteria thrived in the deep SCS, where a noticeable amount of Cyanobacteria were also detected. Marine Groups II and III Euryarchaeota were predominant in the archaeal communities in the surface and deep SCS, respectively. Bacterial diversity was higher than archaeal diversity at all sampling depths in the SCS, and peaked at mid-depths, agreeing with the diversity pattern found in global water columns. Metagenomic analysis not only showed differential %GC values and genome sizes between the surface and deep SCS, but also demonstrated depth-dependent metabolic potentials, such as cobalamin biosynthesis at 10 m, osmoregulation at 100 m, signal transduction at 1000 m, and plasmid and phage replication at 3000 m. When compared with other oceans, urease at 10 m and both exonuclease and permease at 3000 m were more abundant in the SCS. Finally, enriched genes associated with nutrient assimilation in the sea surface and transposase in the deep-sea metagenomes exemplified the functional zonation in global oceans. Conclusions: Prokaryotic communities in the SCS stratified with depth, with maximal bacterial diversity at mid-depth, in accordance with global water columns. The SCS had functional zonation among depths and endemically enriched metabolic potentials at the study site, in contrast to other oceans

    Comprehensive Insights Into Composition, Metabolic Potentials, and Interactions Among Archaeal, Bacterial, and Viral Assemblages in Meromictic Lake Shunet in Siberia

    Get PDF
    Microorganisms are critical to maintaining stratified biogeochemical characteristics in meromictic lakes; however, their community composition and potential roles in nutrient cycling are not thoroughly described. Both metagenomics and metaviromics were used to determine the composition and capacity of archaea, bacteria, and viruses along the water column in the landlocked meromictic Lake Shunet in Siberia. Deep sequencing of 265 Gb and high-quality assembly revealed a near-complete genome corresponding to Nonlabens sp. sh3vir. in a viral sample and 38 bacterial bins (0.2–5.3 Mb each). The mixolimnion (3.0 m) had the most diverse archaeal, bacterial, and viral communities, followed by the monimolimnion (5.5 m) and chemocline (5.0 m). The bacterial and archaeal communities were dominated by Thiocapsa and Methanococcoides, respectively, whereas the viral community was dominated by Siphoviridae. The archaeal and bacterial assemblages and the associated energy metabolism were significantly related to the various depths, in accordance with the stratification of physicochemical parameters. Reconstructed elemental nutrient cycles of the three layers were interconnected, including co-occurrence of denitrification and nitrogen fixation in each layer and involved unique processes due to specific biogeochemical properties at the respective depths. According to the gene annotation, several pre-dominant yet unknown and uncultured bacteria also play potentially important roles in nutrient cycling. Reciprocal BLAST analysis revealed that the viruses were specific to the host archaea and bacteria in the mixolimnion. This study provides insights into the bacterial, archaeal, and viral assemblages and the corresponding capacity potentials in Lake Shunet, one of the three meromictic lakes in central Asia. Lake Shunet was determined to harbor specific and diverse viral, bacterial, and archaeal communities that intimately interacted, revealing patterns shaped by indigenous physicochemical parameters

    Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition

    No full text
    © 2011 Dr. Isaam SaeedTapping into the remarkable power of the uncultured majority of microbial organisms is the driving force of metagenomics. Metagenomics is the study of a microbial community’s genetic content when sampled directly from the environment. Given that microbial genomes within an environmental sample are fragmented prior to sequencing, the association of a genomic DNA fragment to its original genome is not known. As a result, the underlying population structure of the sampled microbial community is also unknown. While it is still possible to analyse the overall function of a microbial community, the functional roles of individual populations and the interactions between them cannot be examined. An approach to infer the underlying population structure of a metagenome is to group sequenced DNA fragments using common patterns in nucleotide base composition that are representative of a particular population (or a group of related populations). The primary challenges for any such method however are the taxonomic resolution and accuracy at which sequences are grouped. These are dependent on both the representation of patterns in DNA sequences and the method of grouping similar patterns. In this study, the oligonucleotide frequency derived error gradient (OFDEG), a novel representation of metagenomic sequences, is first proposed. In addition to grouping related metagenomic sequences, the OFDEG measure is also used to examine how patterns in base composition vary within a microbial genome. A model-based clustering framework is then developed to deal with the ambiguity and noise that affect the cluster distribution of patterns extracted from real-world metagenomic data. The concept of patterns in base composition is then extended to short metagenomic sequences (less than 1000 base-pairs in length), with the proposal of two novel representations based on dinucleotide frequency. The methods developed in this study are evaluated on simulated benchmark data sets and are shown to perform with greater accuracy and resolution than currently available methods. Further validation against publically available metagenomes produced results which were in accordance with reported analyses of sample diversity. Finally, the proposed methods are applied to four pyrosequenced metagenomic libraries of samples taken from a mud volcano in southwestern Taiwan. The inferred population structure and function were found to be consistent with complementary marker gene analysis as well as the local geochemistry of the sampling site
    corecore