13 research outputs found
Random Mutations.
<p>This figure shows how a one million base pair DNA sequence responds to random mutations. Euclidean distance from the initial sequence is plotted for tetranucleotide (A) and heptanucleotide (B) verses iteration number. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0067337#pone-0067337-g007" target="_blank">Figure 7C</a> shows tetranucleotide verses heptanucleotide Euclidean distance by iteration with a 1∶1 line (red) to show equivalence. These plots show that heptanucleotide signatures demonstrate a faster increase in Euclidean distance from small changes in the DNA sequence, compared to tetranucleotide signatures, while leveling off and responding little to changes beyond approximately 600,000 iterations. Conversely, tetranucleotide signatures demonstrate smaller increases in Euclidean distance as a result of small perturbations in the DNA sequence, but continue to fluctuate to one million iterations.</p
Oligonucleotide vs. 16S rRNA Comparisons.
<p>The ability place phylogenetically similar organisms together on a cladogram using mononucleotide through nonanucleotide signatures was tested against a cladogram generated using 16S rRNA for 1,424 completed prokaryotic genomes. This figure shows the percentage correct cladogram placement for oligonucleotide signature (x-axis) verses the percentage of correct cladogram placement for 16S rRNA (y-axis). Taxonomic level is show along top axis using: same species (S), same genus (G), same family (F), same phylum (P) and same domain (D). Mononucleotides through nonanucleotide signature trend lines are color-coded (see figure legend).</p
Leave-one-out Histograms.
<p>Histograms show the results of a leave-one-out analysis where the oligonucleotide-based Euclidean distance was calculated between all organisms (except self comparisons) and the percentage of organism matches which contain identical taxonomy for tetranucleotide (A) and heptanucleotide (B) signatures was binned based on genus normalized Euclidean distance. Plots are colored based on the highest shared taxonomic level of the two organisms being compared: same species are in orange, same genus (purple), same family (green), same order (red), same phylum (blue), same domain (yellow) and different domain (black). These plots are useful for determining the statistical likelihood of taxonomic matches between unknown sequences, as the percentages can be used to determine likelihood of a taxonomic match when the Euclidean distance between two unknown sequences has been calculated.</p
Resolving Prokaryotic Taxonomy without rRNA: Longer Oligonucleotide Word Lengths Improve Genome and Metagenome Taxonomic Classification
<div><p>Oligonucleotide signatures, especially tetranucleotide signatures, have been used as method for homology binning by exploiting an organism’s inherent biases towards the use of specific oligonucleotide words. Tetranucleotide signatures have been especially useful in environmental metagenomics samples as many of these samples contain organisms from poorly classified phyla which cannot be easily identified using traditional homology methods, including NCBI BLAST. This study examines oligonucleotide signatures across 1,424 completed genomes from across the tree of life, substantially expanding upon previous work. A comprehensive analysis of mononucleotide through nonanucleotide word lengths suggests that longer word lengths substantially improve the classification of DNA fragments across a range of sizes of relevance to high throughput sequencing. We find that, at present, heptanucleotide signatures represent an optimal balance between prediction accuracy and computational time for resolving taxonomy using both genomic and metagenomic fragments. We directly compare the ability of tetranucleotide and heptanucleotide world lengths (tetranucleotide signatures are the current standard for oligonucleotide word usage analyses) for taxonomic binning of metagenome reads. We present evidence that heptanucleotide word lengths consistently provide more taxonomic resolving power, particularly in distinguishing between closely related organisms that are often present in metagenomic samples. This implies that longer oligonucleotide word lengths should replace tetranucleotide signatures for most analyses. Finally, we show that the application of longer word lengths to metagenomic datasets leads to more accurate taxonomic binning of DNA scaffolds and have the potential to substantially improve taxonomic assignment and assembly of metagenomic data.</p></div
Heptanucleotide Signature Based Cladogram.
<p>Cladogram derived from heptanucleotide signatures using Euclidean distances between 1,424 sequenced microbes. Terminal branches are color-coded to depict nearest neighbor taxonomic relationships as: strong relationships (same species or same genus) in red, good relationships (phylum or better) in blue, same domain in yellow and different domain in black. This figure shows that heptanucleotide signatures are conserved amongst phylogenetically similar organisms across the tree of life. The tendency for phylogenetically similar organisms to maintain similar oligonucleotide biases is the basis oligonucleotide-based clustering techniques.</p
Metagenomic Sized Fragments.
<p>Completed prokaryotic genomes were broken into metagenomically relevant fragments sizes of: 1,000 bp, 2,500 bp, 5,000 bp, 10,000 bp, 15,000 bp, 25,000 bp and 50,000 bp by extracting a random fragment of each length from each of the 1,424 genomes. The tetranucleotide and heptanucleotide based Euclidean distance was calculated between each fragment and these distances were used to construct cladograms. Each cladogram was analyzed for the percentage of organisms with a nearest neighbor belonging to the same genus and this percentage is plotted verses fragment length. Improvement is seen as fragment length is increased, but the improvement levels off at approximately 10,000 bp for tetranucleotide signatures and approximately 5,000 bp for heptanucleotide signatures, with heptanucleotide signatures are performing better at all fragment lengths.</p
BPEG binning and consensus genome statistics.
*<p>Read assignment. <sup>1</sup>With the exception of homology binning information, all other statistics shown use phylum-level tetranucleotide binning data.</p
Total counts of genes associated with carbon-fixation via five cycles/pathways.
<p>Shown are genes associated with the reductive tricarboxylic-acid cycle (rTCA), including citryl co-A synthase/lyase, and pyruvate ferridoxin oxidoreductase (<i>oorABCD</i>). The Calvin cycle (CBB) is represented by the gene for ribulose-1,5-bisphosphate carboxylase, and the reductive acetyl Co-A pathway (rACP) is estimated by CO dehydrogenase. Malonate semialdehyde reductase and 4-hydroxybutyryl-CoA dehydratase are used as proxies for the 3-hydroxypropionate cycle (3-HP), and the 3-hydroxypropionate/4-hydroxybutyrate cycle (3-4HP), respectively. All columns are normalized to the smallest total dataset.</p
Selected geochemical trends moving downstream at BP.
<p>Top and middle; chloride and oxygen isotope (of water) measurements, respectively, showing calculated evaporation trendlines imposed on the data; the slopes of the lines are set by the extent of evaporation required to account for the temperature decrease. Bottom; dissolved oxygen concentrations, representing redox processes in BP. All plots show chemosynthetic (far right), transition “fringe” (grey bar), and photosynthetic zones (far left).</p