24 research outputs found

    ML tree estimated by MEGA 5.

    No full text
    <p>The <i>predicted</i> phenotypes are indicated by the text color of the taxa: Red = pathogenic, Blue = commensal. The ancestral phenotypes are indicated by P (pathogenic) or C (commensal) at the internal nodes. Branches along which the phenotype changed are indicated in magenta. Numbers above some branches indicate the branch number given in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0090490#pone.0090490.s007" target="_blank">Table S7</a>. The dashed line indicates the virtual rooting with <i>Escherichia fergusonii</i>.</p

    SNP-Associations and Phenotype Predictions from Hundreds of Microbial Genomes without Genome Alignments

    No full text
    <div><p>SNP-association studies are a starting point for identifying genes that may be responsible for specific phenotypes, such as disease traits. The vast bulk of tools for SNP-association studies are directed toward SNPs in the human genome, and I am unaware of any tools designed specifically for such studies in bacterial or viral genomes. The PPFS (Predict Phenotypes From SNPs) package described here is an add-on to <b><i>kSNP</i></b>, a program that can identify SNPs in a data set of hundreds of microbial genomes. PPFS identifies those SNPs that are non-randomly associated with a phenotype based on the χ<sup>2</sup> probability, then uses those diagnostic SNPs for two distinct, but related, purposes: (1) to predict the phenotypes of strains whose phenotypes are unknown, and (2) to identify those diagnostic SNPs that are most likely to be causally related to the phenotype. In the example illustrated here, from a set of 68 <i>E. coli</i> genomes, for 67 of which the pathogenicity phenotype was known, there were 418,500 SNPs. Using the phenotypes of 36 of those strains, PPFS identified 207 diagnostic SNPs. The diagnostic SNPs predicted the phenotypes of all of the genomes with 97% accuracy. It then identified 97 SNPs whose probability of being causally related to the pathogenic phenotype was >0.999. In a second example, from a set of 116 <i>E. coli</i> genome sequences, using the phenotypes of 65 strains PPFS identified 101 SNPs that predicted the source host (human or non-human) with 90% accuracy.</p></div

    When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes

    No full text
    <div><p>Effective use of rapid and inexpensive whole genome sequencing for microbes requires fast, memory efficient bioinformatics tools for sequence comparison. The kSNP v2 software finds single nucleotide polymorphisms (SNPs) in whole genome data. kSNP v2 has numerous improvements over kSNP v1 including SNP gene annotation; better scaling for draft genomes available as assembled contigs or raw, unassembled reads; a tool to identify the optimal value of k; distribution of packages of executables for Linux and Mac OS X for ease of installation and user-friendly use; and a detailed User Guide. SNP discovery is based on k-mer analysis, and requires no multiple sequence alignment or the selection of a single reference genome. Most target sets with hundreds of genomes complete in minutes to hours. SNP phylogenies are built by maximum likelihood, parsimony, and distance, based on all SNPs, only core SNPs, or SNPs present in some intermediate user-specified fraction of targets. The SNP-based trees that result are consistent with known taxonomy. kSNP v2 can handle many gigabases of sequence in a single run, and if one or more annotated genomes are included in the target set, SNPs are annotated with protein coding and other information (UTRs, etc.) from Genbank file(s). We demonstrate application of kSNP v2 on sets of viral and bacterial genomes, and discuss in detail analysis of a set of 68 finished <i>E. coli</i> and <i>Shigella</i> genomes and a set of the same genomes to which have been added 47 assemblies and four “raw read” genomes of H104:H4 strains from the recent European <i>E. coli</i> outbreak that resulted in both bloody diarrhea and hemolytic uremic syndrome (HUS), and caused at least 50 deaths.</p></div

    Minimum Spanning Tree of 68 finished <i>E. coli</i> genomes. Nodes are colored according to pathogenicity phenotype.

    No full text
    <p>Minimum Spanning Tree of 68 finished <i>E. coli</i> genomes. Nodes are colored according to pathogenicity phenotype.</p

    kSNP v2 timings for some examples.

    No full text
    <p><sup>1</sup> kSNP was run at the optimum setting of k as determined by <b>Kchooser</b>. See <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0081760#pone-0081760-t001" target="_blank">Table 1</a>.</p><p><sup>2</sup> Linux cluster: Linux OS TOSS 2.0, 2.8 GHz Xeon EP X5660 processor, 12 cores, 48 GB RAM.</p><p><sup>3</sup> iMac Desktop: OS X 7.5.3, 3.4 GHz Intel Core i7 processor, 4 cores, 16 GB RAM.</p><p><sup>4</sup> Example 1 data set (provided with kSNP) consists of 11 equine encephalitis virus finished genomes.</p><p><sup>5</sup> Example 2 data set provided with kSNP consists of 7 finished, 5 assembled and 2 raw read <i>E. coli</i> genomes.</p

    Maximum Likelihood tree of 119 <i>E. coli</i> strains.

    No full text
    <p>Tree is shown in the rectangular cladogram format, but readers are reminded that this is an unrooted tree. Genomes consisting of raw reads are labeled in blue. Numbers at the internal nodes indicate the number of alleles that are shared exclusively by the descendants of each node. Zeros are not shown. Numbers in parentheses following the genome names are exclusive to that genome. Node A, leading to the 2011-12 European outbreak strains,and nodes B and C, also leading to particularly pathogenic strains, are discussed in the text.</p

    kSNP efficiency vs mean branch lengths of true trees from simulated data sets.

    No full text
    <p>kSNP efficiency vs mean branch lengths of true trees from simulated data sets.</p

    Diagram of the kSNP v2 process.

    No full text
    <p>Diagram of the kSNP v2 process.</p

    Optimum values of k for the examples in Table 2.

    No full text
    <p><sup>1</sup> Example 1 data set (provided with kSNP) consists of 11 equine encephalitis virus finished genomes.</p><p><sup>2</sup> Example 2 data set provided with kSNP consists of 7 finished, 5 assembled and 2 raw read <i>E. coli</i> genomes.</p

    Maximum Likelihood tree of O104:H4 <i>E. coli</i> strains.

    No full text
    <p>Tree is shown in the rectangular cladogram format and has been rooted with the outgroup consisting of two commensal strains (labeled in magenta). Genomes consisting of raw reads are labeled in blue. Colored dots indicate country of origin where known. Numbers at the internal nodes indicate the number of alleles that are shared exclusively by the descendants of each node. Zeros are not shown. Numbers in parentheses following the genome names are number of alleles exclusive to that genome.</p
    corecore