177 research outputs found

    Deciphering ancient microbes with modern population genomic databases

    No full text
    Metagenomics reveals the unprecedented genetic variation of microbial communities, including those from ancient human remains. The analysis of metagenomic data begins with taxonomic prediction of all microbes in the sample. Recent evaluation studies (1) demonstrate that current methods for taxonomic predictions either lack of sufficient sensitivity for species-level assignments or suffer from false positives, overestimating the number of species in the metagenome. Both are especially problematic for the identification of endogenous pathogens in low-abundance, common in ancient metagenomic samples. In addition, the reference genomes used in the predictions are limited and biased towards pathogens over environmental species. Reads from unknown sources, e.g. unknown environmental strains, can accidentally map onto distantly related pathogens. We designed a new method, SPARSE, which improves the taxonomic predictions of metagenomic data. SPARSE normalizes existing biased databases by grouping reference genomes into similarity-based hierarchical clusters (Fig. 1). SPARSE also filters out reads from unknown sources using a probabilistic model, hence avoiding over-enthusiastic matches to known pathogens. Our evaluation using both simulations and real ancient samples demonstrated SPARSE’s improved precision in comparison to other methods. We have also integrated SPARSE as part of EnteroBase. Enterobase is a centralized database that allows free access for the users to the genomes and molecular typing of ≥200K bacterial strains from several important pathogens through a graphical web interface. Enterobase includes automatic pipelines to characterize bacterial strains based on short reads from public databases or uploaded by registered users. Here we demonstrate the utility of SPARSE in Enterobase using 22 previously published ancient plague samples (2-7). The Yersinia pestis specific reads were extracted by SPARSE and compared with 714 modern relatives in the EnteroBase Yersinia database (Fig. 2). The combination of SPARSE and EnteroBase allows reliable placements of aDNA within the entire evolutionary history of Y. pestis. 1. A. Sczyrba et al., BioRxiv (2017).2. K. I. Bos et al., Nature 478, 506 (2011).3. K. I. Bos et al., Elife. 5, (2016).4. M. A. Spyrou et al., Cell Host. Microbe 19, 874 (2016).5. V. A. Andrades et al., Curr. Biol 27, 3683 (2017).6. M. Feldman et al., Mol Biol Evol 33, 2911 (2016).7. S. Rasmussen et al., Cell 163, 571 (2015).</p

    Population structure of <i>E. coli</i> according to MLST and core genome sequences.

    No full text
    <p>MLST provides much lower resolution than do genomic sequences, but both types of data indicate that much of the general population structure consists of clusters of related bacterial isolates that are more distantly related to those in discrete clusters. In both approaches, genetic distances are calculated on genes within the core genome and exclude genes on mobile genetic elements in the accessory genome (plasmids, bacteriophages, ICEs, transposons, and IS elements), which are readily transmitted between unrelated bacterial clusters and are also frequently lost. (A) Minimal spanning tree of allelic differences at seven MLST gene fragments for 540 bacterial isolates that are in the related ST95 (267 isolates), ST131 (193), and ST648 (80) complexes. The data is from the <i>E. coli</i> MLST website (<a href="http://mlst.warwick.ac.uk" target="_blank">http://mlst.warwick.ac.uk</a>), and color-coding reflects pathogen type. (B) Minimal spanning tree of pairwise differences at core genome SNPs from 91 Shiga toxin-producing <i>E. coli</i> (STEC) <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1004874#pgen.1004874-Trees1" target="_blank">[21]</a> (O6:H16: 2 isolates; O121:H19:26; O145:NM: 7; O157:H7/H-: 56). Color-coded by serotype. The genomic analysis was performed by Hannes Pouseele (Applied Maths, Belgium) with the permission of Rebecca Lindsey, Eija Trees, Nancy Strockbine, and Peter Gerner-Smidt (Centers for Disease Control and Prevention (CDC), Atlanta, Georgia). Minimal spanning trees were calculated with Bionumerics (Applied Maths).</p

    Age estimates and Bayes Factors from BEAST analyses of 864 non-repetitive, non-recombinant, non-homoplastic core SNPs from 73 eBG54 (Agona) genomes.

    No full text
    <p>Note: Highest Bayes factors are indicated by bold, italic fonts. Path sampling and Stepping-stone analyses were performed along a series of 100 steps along the path, with a chain of 1M samples per step.</p><p>Age estimates and Bayes Factors from BEAST analyses of 864 non-repetitive, non-recombinant, non-homoplastic core SNPs from 73 eBG54 (Agona) genomes.</p

    Comparisons of treeModel.rootHeight estimates by BEAST with different SNP calls and different numbers of genomes.

    No full text
    <p>A. Distribution of numbers of estimates of rootHeight as a percentage of all estimates in BEAST analyses according to the best model in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0134435#pone.0134435.t001" target="_blank">Table 1</a>. The numbers were from samples taken every 1000 steps over a total of 200 million steps (4 genomes) or 50 million steps (73 genomes), after excluding the first 10 million steps as burn-in. Mean values of rootHeight are indicated next to arrows. Inset, different scale for values of rootHeight over 500 years. B. Representation of the individual rootHeight values for each sample over the last 40 million steps. Pettengill, 4 genomes: uses the SNP calls calculated by Pettingill [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0134435#pone.0134435.ref015" target="_blank">15</a>]; Zhou, 4 genomes: uses the SNP calls for the same four genomes extracted from the core genomes in Zhou <i>et al</i>., [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0134435#pone.0134435.ref009" target="_blank">9</a>]; Zhou 73 genomes, uses the core genome SNPs from all 73 genomes in [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0134435#pone.0134435.ref009" target="_blank">9</a>].</p

    Minimal spanning tree of 150 eBGs and 1,368 STs within 6,309 isolates of <i>S</i>. <i>enterica</i> subspecies <i>enterica</i>.

    No full text
    <p>Each circle is one ST, whose radius is proportional to the number of entries of that ST at the <i>S</i>. <i>enterica</i> MLST website (<a href="http://mlst.warwick.ac.uk/" target="_blank">http://mlst.warwick.ac.uk/</a>, May, 2015), and presented as a pie-chart colored according to source of isolates, or white for isolates from other sources or with missing data. STs that differ by 1/7 MLST loci are connected by a thick line and STs that differ by 2/7 are connected by a thin line. eBGs (groups of STs linked by thick lines) are emphasized by gray shading outside the circles. eBGs and STs referred to explicitly in the Introduction are designated by arrows plus information about their eBG/ST designation and serovar. Lineage 3 is the set of STs and eBGs radiating towards 08:00.</p

    Table S9 from Enterobase: hierarchical clustering of 100,000 s of bacterial genomes into species/sub-species and populations

    No full text
    The definition of bacterial species is traditionally a taxonomic issue while bacterial populations are identified by population genetics. These assignments are species specific, and depend on the practitioner. Legacy multilocus sequence typing is commonly used to identify sequence types (STs) and clusters (ST Complexes). However, these approaches are not adequate for the millions of genomic sequences from bacterial pathogens that have been generated since 2012. EnteroBase (http://enterobase.warwick.ac.uk) automatically clusters core genome MLST allelic profiles into hierarchical clusters (HierCC) after assembling annotated draft genomes from short-read sequences. HierCC clusters span core sequence diversity from the species level down to individual transmission chains. Here we evaluate HierCC's ability to correctly assign 100 000 s of genomes to the species/subspecies and population levels for Salmonella, Escherichia, Clostridoides, Yersinia, Vibrio and Streptococcus. HierCC assignments were more consistent with maximum-likelihood super-trees of core SNPs or presence/absence of accessory genes than classical taxonomic assignments or 95% ANI. However, neither HierCC nor ANI were uniformly consistent with classical taxonomy of Streptococcus. HierCC was also consistent with legacy eBGs/ST Complexes in Salmonella or Escherichia and with O serogroups in Salmonella. Thus, EnteroBase HierCC supports the automated identification of and assignment to species/subspecies and populations for multiple genera.This article is part of a discussion meeting issue ‘Genomic population structures of microbial pathogens’

    Table S3 from Enterobase: hierarchical clustering of 100,000 s of bacterial genomes into species/sub-species and populations

    No full text
    The definition of bacterial species is traditionally a taxonomic issue while bacterial populations are identified by population genetics. These assignments are species specific, and depend on the practitioner. Legacy multilocus sequence typing is commonly used to identify sequence types (STs) and clusters (ST Complexes). However, these approaches are not adequate for the millions of genomic sequences from bacterial pathogens that have been generated since 2012. EnteroBase (http://enterobase.warwick.ac.uk) automatically clusters core genome MLST allelic profiles into hierarchical clusters (HierCC) after assembling annotated draft genomes from short-read sequences. HierCC clusters span core sequence diversity from the species level down to individual transmission chains. Here we evaluate HierCC's ability to correctly assign 100 000 s of genomes to the species/subspecies and population levels for Salmonella, Escherichia, Clostridoides, Yersinia, Vibrio and Streptococcus. HierCC assignments were more consistent with maximum-likelihood super-trees of core SNPs or presence/absence of accessory genes than classical taxonomic assignments or 95% ANI. However, neither HierCC nor ANI were uniformly consistent with classical taxonomy of Streptococcus. HierCC was also consistent with legacy eBGs/ST Complexes in Salmonella or Escherichia and with O serogroups in Salmonella. Thus, EnteroBase HierCC supports the automated identification of and assignment to species/subspecies and populations for multiple genera.This article is part of a discussion meeting issue ‘Genomic population structures of microbial pathogens’

    Figure S1 from Enterobase: hierarchical clustering of 100,000 s of bacterial genomes into species/sub-species and populations

    No full text
    The definition of bacterial species is traditionally a taxonomic issue while bacterial populations are identified by population genetics. These assignments are species specific, and depend on the practitioner. Legacy multilocus sequence typing is commonly used to identify sequence types (STs) and clusters (ST Complexes). However, these approaches are not adequate for the millions of genomic sequences from bacterial pathogens that have been generated since 2012. EnteroBase (http://enterobase.warwick.ac.uk) automatically clusters core genome MLST allelic profiles into hierarchical clusters (HierCC) after assembling annotated draft genomes from short-read sequences. HierCC clusters span core sequence diversity from the species level down to individual transmission chains. Here we evaluate HierCC's ability to correctly assign 100 000 s of genomes to the species/subspecies and population levels for Salmonella, Escherichia, Clostridoides, Yersinia, Vibrio and Streptococcus. HierCC assignments were more consistent with maximum-likelihood super-trees of core SNPs or presence/absence of accessory genes than classical taxonomic assignments or 95% ANI. However, neither HierCC nor ANI were uniformly consistent with classical taxonomy of Streptococcus. HierCC was also consistent with legacy eBGs/ST Complexes in Salmonella or Escherichia and with O serogroups in Salmonella. Thus, EnteroBase HierCC supports the automated identification of and assignment to species/subspecies and populations for multiple genera.This article is part of a discussion meeting issue ‘Genomic population structures of microbial pathogens’

    Supplementary Text from Enterobase: hierarchical clustering of 100,000 s of bacterial genomes into species/sub-species and populations

    No full text
    The definition of bacterial species is traditionally a taxonomic issue while bacterial populations are identified by population genetics. These assignments are species specific, and depend on the practitioner. Legacy multilocus sequence typing is commonly used to identify sequence types (STs) and clusters (ST Complexes). However, these approaches are not adequate for the millions of genomic sequences from bacterial pathogens that have been generated since 2012. EnteroBase (http://enterobase.warwick.ac.uk) automatically clusters core genome MLST allelic profiles into hierarchical clusters (HierCC) after assembling annotated draft genomes from short-read sequences. HierCC clusters span core sequence diversity from the species level down to individual transmission chains. Here we evaluate HierCC's ability to correctly assign 100 000 s of genomes to the species/subspecies and population levels for Salmonella, Escherichia, Clostridoides, Yersinia, Vibrio and Streptococcus. HierCC assignments were more consistent with maximum-likelihood super-trees of core SNPs or presence/absence of accessory genes than classical taxonomic assignments or 95% ANI. However, neither HierCC nor ANI were uniformly consistent with classical taxonomy of Streptococcus. HierCC was also consistent with legacy eBGs/ST Complexes in Salmonella or Escherichia and with O serogroups in Salmonella. Thus, EnteroBase HierCC supports the automated identification of and assignment to species/subspecies and populations for multiple genera.This article is part of a discussion meeting issue ‘Genomic population structures of microbial pathogens’
    corecore