20 research outputs found

    GOS-Only Clusters Are Enriched for Sequences of Viral Origin Independently of the Kingdom Assignment Method Employed

    No full text
    <p>For each panel, clusters are as in <a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#pbio-0050016-g004" target="_blank">Figure 4</a>. For (A–C), a kingdom is assigned to each neighboring ORF within each cluster set; the percentage of all neighboring ORFs with a given kingdom assignment is plotted. For (D–F), a kingdom is assigned to each cluster if more than 50% of all that cluster's neighbors with a kingdom assignment share the same assignment; the percentage of clusters in each set with a given assignment is plotted. In (A) and (D), a kingdom is assigned to a neighboring ORF by a majority vote of the top four BLAST matches to a protein in NCBI-nr (<a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#s3" target="_blank">Materials and Methods</a>). In (B) and (E), a kingdom is assigned if all eight highest-scoring BLAST matches agree in kingdom. In (C) and (F), all ORFs on a scaffold are assigned the same kingdom by voting among all ORFs with BLAST matches to NCBI-nr on that scaffold (<a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#s3" target="_blank">Materials and Methods</a>). In all graphs, only clusters with at least one assignable neighbor are considered. When compared to the size-matched controls, in all cases the GOS-only clusters show enrichment for viral sequences.</p

    Enrichment in the GOS-Only Set of Clusters for Viral Neighbors

    No full text
    <p>Cluster sets from left to right are: I, GOS-only clusters with detectable BLAST, HMM, or profile-profile homology (Group I); II, GOS-only clusters with no detectable homology (Group II); I-S, a sample from all clusters chosen to have the same size distribution as Group I; II-S, a sample from all clusters chosen to have the same size distribution as Group II; I-V, a subset of clusters in Group I containing sequences collected from the viral size fraction; II-V, a subset of clusters in Group II from the viral size fraction; and all clusters. Notice that although predominantly bacterial, GOS-only clusters are assigned as viral based on their neighbors more often than the size-matched samples and the set of all clusters.</p

    Log–Log Plots of Cluster Size Distributions

    No full text
    <div><p>The <i>x</i>-axis is logarithm of the cluster size <i>X</i> and the <i>y</i>-axis is the logarithm of the number of clusters of size at least <i>X;</i> logarithms are base 10.</p> <p>(A) Plot comparing the sizes of clusters produced by our clustering approach (red) to those of clusters produced by Pfams (green). The curves track each other quite well, with both of them having an inflection point around cluster size 2,500 (approximately 3.4 on the <i>x</i>-axis). Each sequence is assigned to the highest scoring Pfam that it matches. Two sequences that are assigned to the same Pfam can nevertheless be assigned to different clusters by the full-sequence–based clustering approach if they differ in the remaining portion. This is especially true for commonly occurring domains that are present in different multidomain proteins. Thus, there tends to be a larger number of big clusters in the Pfam approach as compared to the full-sequence–based approach. Hence, the green curve is above the red curve at the higher sizes.</p> <p>(B) Plot of the cluster size distributions for core sets (green) and for final clusters (red). Both curves have an inflection point around cluster size 2,500 (approximately 3.4 on the <i>x</i>-axis). Note that these plots give the cumulative distribution function (cdf), while the power law exponents reported in the text are for the number of clusters of size <i>X</i> (i.e., the probability density function [pdf]). The relationship between these exponents is β<sub>pdf</sub> = 1 + β<sub>cdf</sub>.</p></div

    Coverage of GOS-100 and Public-100 by Pfam and Relative Sizes of Pfam Families by Kingdom, Sorted by Size

    No full text
    <p>The public-100 sequences are annotated using the NCBI taxonomy and the source public database annotations. GOS-100 sequences were given kingdom weights as described in <a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#s3" target="_blank">Materials and Methods</a>. For each kingdom, the fraction of sequences with ≥1 Pfam match are shown, while the ten largest Pfam families shown as discrete sections whose size is proportional to the number of matches between that family and GOS-100 or public-100 sequences. Pfam families that are smaller than the ten largest are binned together in each column's bottom section. Pfam covers public-100 better than GOS-100 in all kingdoms, with the greatest difference occurring in the viral kingdom, where 89.1% of public-100 viral sequences match a Pfam domain, while only 27.5% of GOS-100s have a sequence match.</p
    corecore