131 research outputs found

    CAMISIM: Simulating metagenomes and microbial communities

    Get PDF
    © 2019 The Author(s). Background: Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Results: We describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series, and differential abundance studies, includes real and simulated strain-level diversity, and generates second- and third-generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes, we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT, and metaSPAdes, on several thousand small data sets generated with CAMISIM. Conclusions: CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with standards of truth for method evaluation

    Species-level functional profiling of metagenomes and metatranscriptomes.

    Get PDF
    Functional profiles of microbial communities are typically generated using comprehensive metagenomic or metatranscriptomic sequence read searches, which are time-consuming, prone to spurious mapping, and often limited to community-level quantification. We developed HUMAnN2, a tiered search strategy that enables fast, accurate, and species-resolved functional profiling of host-associated and environmental communities. HUMAnN2 identifies a community's known species, aligns reads to their pangenomes, performs translated search on unclassified reads, and finally quantifies gene families and pathways. Relative to pure translated search, HUMAnN2 is faster and produces more accurate gene family profiles. We applied HUMAnN2 to study clinal variation in marine metabolism, ecological contribution patterns among human microbiome pathways, variation in species' genomic versus transcriptional contributions, and strain profiling. Further, we introduce 'contributional diversity' to explain patterns of ecological assembly across different microbial community types

    Decontamination of MDA Reagents for Single Cell Whole Genome Amplification

    Get PDF
    Single cell genomics is a powerful and increasingly popular tool for studying the genetic make-up of uncultured microbes. A key challenge for successful single cell sequencing and analysis is the removal of exogenous DNA from whole genome amplification reagents. We found that UV irradiation of the multiple displacement amplification (MDA) reagents, including the Phi29 polymerase and random hexamer primers, effectively eliminates the amplification of contaminating DNA. The methodology is quick, simple, and highly effective, thus significantly improving whole genome amplification from single cells

    Streaming histogram sketching for rapid microbiome analytics

    Get PDF
    Background: The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for the compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time. Results: We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a ‘real life’ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s. Conclusions: Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space

    Exploring neighborhoods in large metagenome assembly graphs reveals hidden sequence diversity

    Get PDF
    Genomes computationally inferred from large metagenomic data sets are often incomplete and may be missing functionally important content and strain variation. We introduce an information retrieval system for large metagenomic data sets that exploits the sparsity of DNA assembly graphs to efficiently extract subgraphs surround- ing an inferred genome. We apply this system to recover missing content from genome bins and show that substantial genomic se- quence variation is present in a real metagenome. Our software implementation is available at https://github.com/spacegraphcats/ spacegraphcats under the 3-Clause BSD License

    A genomic catalog of Earth’s microbiomes

    Get PDF
    The reconstruction of bacterial and archaeal genomes from shotgun metagenomes has enabled insights into the ecology and evolution of environmental and host-associated microbiomes. Here we applied this approach to >10,000 metagenomes collected from diverse habitats covering all of Earth’s continents and oceans, including metagenomes from human and animal hosts, engineered environments, and natural and agricultural soils, to capture extant microbial, metabolic and functional potential. This comprehensive catalog includes 52,515 metagenome-assembled genomes representing 12,556 novel candidate species-level operational taxonomic units spanning 135 phyla. The catalog expands the known phylogenetic diversity of bacteria and archaea by 44% and is broadly available for streamlined comparative analyses, interactive exploration, metabolic modeling and bulk download. We demonstrate the utility of this collection for understanding secondary-metabolite biosynthetic potential and for resolving thousands of new host linkages to uncultivated viruses. This resource underscores the value of genome-centric approaches for revealing genomic properties of uncultivated microorganisms that affect ecosystem processes

    ViennaRNA Package 2.0

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Secondary structure forms an important intermediate level of description of nucleic acids that encapsulates the dominating part of the folding energy, is often well conserved in evolution, and is routinely used as a basis to explain experimental findings. Based on carefully measured thermodynamic parameters, exact dynamic programming algorithms can be used to compute ground states, base pairing probabilities, as well as thermodynamic properties.</p> <p>Results</p> <p>The <monospace>ViennaRNA</monospace> Package has been a widely used compilation of RNA secondary structure related computer programs for nearly two decades. Major changes in the structure of the standard energy model, the <it>Turner 2004 </it>parameters, the pervasive use of multi-core CPUs, and an increasing number of algorithmic variants prompted a major technical overhaul of both the underlying <monospace>RNAlib</monospace> and the interactive user programs. New features include an expanded repertoire of tools to assess RNA-RNA interactions and restricted ensembles of structures, additional output information such as <it>centroid </it>structures and <it>maximum expected accuracy </it>structures derived from base pairing probabilities, or <it>z</it>-<it>scores </it>for locally stable secondary structures, and support for input in <monospace>fasta</monospace> format. Updates were implemented without compromising the computational efficiency of the core algorithms and ensuring compatibility with earlier versions.</p> <p>Conclusions</p> <p>The <monospace>ViennaRNA Package 2.0</monospace>, supporting concurrent computations <monospace>via OpenMP</monospace>, can be downloaded from <url>http://www.tbi.univie.ac.at/RNA</url>.</p

    Critical Assessment of Metagenome Interpretation:A benchmark of metagenomics software

    Get PDF
    International audienceIn metagenome analysis, computational methods for assembly, taxonomic profilingand binning are key components facilitating downstream biological datainterpretation. However, a lack of consensus about benchmarking datasets andevaluation metrics complicates proper performance assessment. The CriticalAssessment of Metagenome Interpretation (CAMI) challenge has engaged the globaldeveloper community to benchmark their programs on datasets of unprecedentedcomplexity and realism. Benchmark metagenomes were generated from newlysequenced ~700 microorganisms and ~600 novel viruses and plasmids, includinggenomes with varying degrees of relatedness to each other and to publicly availableones and representing common experimental setups. Across all datasets, assemblyand genome binning programs performed well for species represented by individualgenomes, while performance was substantially affected by the presence of relatedstrains. Taxonomic profiling and binning programs were proficient at high taxonomicranks, with a notable performance decrease below the family level. Parametersettings substantially impacted performances, underscoring the importance ofprogram reproducibility. While highlighting current challenges in computationalmetagenomics, the CAMI results provide a roadmap for software selection to answerspecific research questions

    Large expert-curated database for benchmarking document similarity detection in biomedical literature search

    Get PDF
    Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency–Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research
    corecore