30 research outputs found

    Original large DBs.

    No full text
    DBs were assembled based on taxonomy as assigned at NCBI, aiming for roughly even distribution of all genomes collected. After assignment to GTDB r202 species, small numbers of genomes were found to be improperly treated by placement in a different large DB than the bulk of the species’ genomes; these improperly placed genomes were excluded from further analysis and are not counted in the Species or Genomes columns. The two DBs limited to genus (Salmonella and Campylobacter) were not treated further because taxonomic outreach could not be studied, leaving 11 DBs studied herein, with totals at bottom. For these 11 DBs, the genomes treated and species composition are reported in S1 and S2 Files in S1 Data, respectively.</p

    Utility of higher taxa for GI detection.

    No full text
    For each GI in the full set, the list of supporting genomes from the large DB was filtered by removing all genomes from the same species, or from the same genus, family or order. This simulates small species that may have no other genomes available from the same species, or the same genus, etc. For the three large DBs that are limited to a single family, all GIs are lost when omitting same-family genomes, so the GIs from these DBs were excluded from the denominator for the family-omission and order-omission treatments. GIs from the single-order DB EnterobacteralesOther were similarly excluded from the denominator of the order-omission treatment.</p

    S2 Text -

    No full text
    BackgroundGenomic islands (GIs) are mobile genetic elements that integrate site-specifically into bacterial chromosomes, bearing genes that affect phenotypes such as pathogenicity and metabolism. GIs typically occur sporadically among related bacterial strains, enabling comparative genomic approaches to GI identification. For a candidate GI in a query genome, the number of reference genomes with a precise deletion of the GI serves as a support value for the GI. Our comparative software for GI identification was slowed by our original use of large reference genome databases (DBs). Here we explore smaller species-focused DBs.ResultsWith increasing DB size, recovery of our reliable prophage GI calls reached a plateau, while recovery of less reliable GI calls (FPs) increased rapidly as DB sizes exceeded ~500 genomes; i.e., overlarge DBs can increase FP rates. Paradoxically, relative to prophages, FPs were both more frequently supported only by genomes outside the species and more frequently supported only by genomes inside the species; this may be due to their generally lower support values. Setting a DB size limit for our SMAll Ranked Tailored (SMART) DB design speeded runtime ~65-fold. Strictly intra-species DBs would tend to lower yields of prophages for small species (with few genomes available); simulations with large species showed that this could be partially overcome by reaching outside the species to closely related taxa, without an FP burden. Employing such taxonomic outreach in DB design generated redundancy in the DB set; as few as 2984 DBs were needed to cover all 47894 prokaryotic species.ConclusionsRuntime decreased dramatically with SMART DB design, with only minor losses of prophages. We also describe potential utility in other comparative genomics projects.</div

    Large species studied herein.

    No full text
    The bacterial species representative tree from GTDB release 202 was pared to the 29 large species of interest using our wrapper (pare_tree_gtdb) for PareTree (http://emmahodcroft.com/PareTree.html) and visualized in FigTree (http://tree.bio.ed.ac.uk/software/figtree). GTDB has subsumed the standard class for Burkholderia (Betaproteobacteria) into the Gammaproteobacteria.</p

    Actual/possible support.

    No full text
    Main panel C: For each GI (separately treating the three main types: Phage1, NonPI and Reject), there was list of possible supporting genomes, i.e. all the genomes in the large DB used to evaluate it. The tree distance (substitutions per site) was taken from the GI’s source genome to all possible supporting genomes based on the multi-protein species trees of GTDB release 202; intra-species genome pairs always receive a distance score of zero. The possible support distance counts were placed into 50 bins from 0 to 3.7. TIGER also reports a list of actual supporting genomes for each GI, whose distances to the GI source genome were likewise taken. To aggregate data for all islands, the actual and possible support counts were summed (by type) in each bin. Finally actual support totals were divided by possible support totals for each bin. Note the logarithmic y-axis. Panel B: For each genome pair in every large DB, the shared taxonomic rank was taken, and species distances tallied (middle panel, with counts for each rank’s trace normalized to the maximum count in the trace). Panel A: For each genome pair, Mash distances at or below the reliability threshold (0.2) were taken and binned by species distance; by tree distance 0.2, Mash distance has plateaued at its maximum, after which percentages of measurable genome pairs decrease until cutting off reporting after tree distance 0.5.</p

    SMART DB sets for two GTDB releases.

    No full text
    Submaximal DBs were those containing fewer genomes than the maximum. Taxonomic filling was stopped at the rank of order.</p

    S1 Text -

    No full text
    BackgroundGenomic islands (GIs) are mobile genetic elements that integrate site-specifically into bacterial chromosomes, bearing genes that affect phenotypes such as pathogenicity and metabolism. GIs typically occur sporadically among related bacterial strains, enabling comparative genomic approaches to GI identification. For a candidate GI in a query genome, the number of reference genomes with a precise deletion of the GI serves as a support value for the GI. Our comparative software for GI identification was slowed by our original use of large reference genome databases (DBs). Here we explore smaller species-focused DBs.ResultsWith increasing DB size, recovery of our reliable prophage GI calls reached a plateau, while recovery of less reliable GI calls (FPs) increased rapidly as DB sizes exceeded ~500 genomes; i.e., overlarge DBs can increase FP rates. Paradoxically, relative to prophages, FPs were both more frequently supported only by genomes outside the species and more frequently supported only by genomes inside the species; this may be due to their generally lower support values. Setting a DB size limit for our SMAll Ranked Tailored (SMART) DB design speeded runtime ~65-fold. Strictly intra-species DBs would tend to lower yields of prophages for small species (with few genomes available); simulations with large species showed that this could be partially overcome by reaching outside the species to closely related taxa, without an FP burden. Employing such taxonomic outreach in DB design generated redundancy in the DB set; as few as 2984 DBs were needed to cover all 47894 prokaryotic species.ConclusionsRuntime decreased dramatically with SMART DB design, with only minor losses of prophages. We also describe potential utility in other comparative genomics projects.</div

    SMART DB software pipeline.

    No full text
    As described in Methods, the Design mode (blue) operates on an initial GTDB release or update, collects needed genomes, and designs and builds the DB set. The Quick Update mode starts with a precalculated DB design file (and an optional list of desired species) and builds DBs. Scripts employed at each step (and a potential manual genome collection phase) are in parentheses.</p

    <i>E</i>. <i>flexneri</i> GI calls recovered with various DB sizes.

    No full text
    The GI identification program TIGER was run on 9932 E. flexneri query genomes using the large reference DB Enterobacteriaceae that contained a set (“All”) of 9089 reference E. flexneri genomes. GI calls were typed, and those calls that either had no support from All or were in tandem arrays were discarded. DBs of various sizes that were subsets of All, were designed using the random or ranked protocols (Methods). Average count of GIs recovered per query genome (supported by at least one genome in the test DB) were taken for each GI type. Here, the PhageFil lines are obscured by the ICE1 lines.</p

    S1 Data -

    No full text
    BackgroundGenomic islands (GIs) are mobile genetic elements that integrate site-specifically into bacterial chromosomes, bearing genes that affect phenotypes such as pathogenicity and metabolism. GIs typically occur sporadically among related bacterial strains, enabling comparative genomic approaches to GI identification. For a candidate GI in a query genome, the number of reference genomes with a precise deletion of the GI serves as a support value for the GI. Our comparative software for GI identification was slowed by our original use of large reference genome databases (DBs). Here we explore smaller species-focused DBs.ResultsWith increasing DB size, recovery of our reliable prophage GI calls reached a plateau, while recovery of less reliable GI calls (FPs) increased rapidly as DB sizes exceeded ~500 genomes; i.e., overlarge DBs can increase FP rates. Paradoxically, relative to prophages, FPs were both more frequently supported only by genomes outside the species and more frequently supported only by genomes inside the species; this may be due to their generally lower support values. Setting a DB size limit for our SMAll Ranked Tailored (SMART) DB design speeded runtime ~65-fold. Strictly intra-species DBs would tend to lower yields of prophages for small species (with few genomes available); simulations with large species showed that this could be partially overcome by reaching outside the species to closely related taxa, without an FP burden. Employing such taxonomic outreach in DB design generated redundancy in the DB set; as few as 2984 DBs were needed to cover all 47894 prokaryotic species.ConclusionsRuntime decreased dramatically with SMART DB design, with only minor losses of prophages. We also describe potential utility in other comparative genomics projects.</div
    corecore