20 research outputs found
Novel GOS-Only Clusters Are More Interconnected Than a Size-Matched Sample of Clusters
<p>Red line, novel clusters; green line, size-matched sample; blue line (right axis),
log<sub>2</sub> ratio of fraction novel clusters recovered divided by fraction
sample clusters recovered.</p
GOS-Only Clusters Are Enriched for Sequences of Viral Origin Independently of the Kingdom Assignment Method Employed
<p>For each panel, clusters are as in <a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#pbio-0050016-g004" target="_blank">Figure 4</a>. For (A–C), a kingdom is assigned to each neighboring ORF within
each cluster set; the percentage of all neighboring ORFs with a given kingdom
assignment is plotted. For (D–F), a kingdom is assigned to each cluster if more than
50% of all that cluster's neighbors with a kingdom assignment share the same
assignment; the percentage of clusters in each set with a given assignment is plotted.
In (A) and (D), a kingdom is assigned to a neighboring ORF by a majority vote of the
top four BLAST matches to a protein in NCBI-nr (<a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#s3" target="_blank">Materials and Methods</a>). In (B) and (E), a kingdom is assigned if all eight
highest-scoring BLAST matches agree in kingdom. In (C) and (F), all ORFs on a scaffold
are assigned the same kingdom by voting among all ORFs with BLAST matches to NCBI-nr
on that scaffold (<a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#s3" target="_blank">Materials and Methods</a>). In all
graphs, only clusters with at least one assignable neighbor are considered. When
compared to the size-matched controls, in all cases the GOS-only clusters show
enrichment for viral sequences.</p
Enrichment in the GOS-Only Set of Clusters for Viral Neighbors
<p>Cluster sets from left to right are: I, GOS-only clusters with detectable BLAST, HMM,
or profile-profile homology (Group I); II, GOS-only clusters with no detectable
homology (Group II); I-S, a sample from all clusters chosen to have the same size
distribution as Group I; II-S, a sample from all clusters chosen to have the same size
distribution as Group II; I-V, a subset of clusters in Group I containing sequences
collected from the viral size fraction; II-V, a subset of clusters in Group II from
the viral size fraction; and all clusters. Notice that although predominantly
bacterial, GOS-only clusters are assigned as viral based on their neighbors more often
than the size-matched samples and the set of all clusters.</p
Log–Log Plots of Cluster Size Distributions
<div><p>The <i>x</i>-axis is logarithm of the cluster size <i>X</i> and
the <i>y</i>-axis is the logarithm of the number of clusters of size at
least <i>X;</i> logarithms are base 10.</p>
<p>(A) Plot comparing the sizes of clusters produced by our clustering approach (red) to
those of clusters produced by Pfams (green). The curves track each other quite well,
with both of them having an inflection point around cluster size 2,500 (approximately
3.4 on the <i>x</i>-axis). Each sequence is assigned to the highest scoring
Pfam that it matches. Two sequences that are assigned to the same Pfam can
nevertheless be assigned to different clusters by the full-sequence–based clustering
approach if they differ in the remaining portion. This is especially true for commonly
occurring domains that are present in different multidomain proteins. Thus, there
tends to be a larger number of big clusters in the Pfam approach as compared to the
full-sequence–based approach. Hence, the green curve is above the red curve at the
higher sizes.</p>
<p>(B) Plot of the cluster size distributions for core sets (green) and for final
clusters (red). Both curves have an inflection point around cluster size 2,500
(approximately 3.4 on the <i>x</i>-axis). Note that these plots give the
cumulative distribution function (cdf), while the power law exponents reported in the
text are for the number of clusters of size <i>X</i> (i.e., the probability
density function [pdf]). The relationship between these exponents is β<sub>pdf</sub> =
1 + β<sub>cdf</sub>.</p></div
Distribution of Average HMM Score Difference between GOS and Public (NCBI-nr, MG, TGI-EST, and ENS)
<p>Only matches to the full length of an HMM are considered, and only HMMs that have at
least 100 matches to each of GOS and public databases are considered. This results in
1,686 HMMs whose average scores to GOS and public databases are considered. The mean
of the distribution is −50, showing that GOS sequences tend to score lower than
sequences in public, thereby reflecting diversity compared to sequences in public.</p
Coverage of GOS-100 and Public-100 by Pfam and Relative Sizes of Pfam Families by Kingdom, Sorted by Size
<p>The public-100 sequences are annotated using the NCBI taxonomy and the source public
database annotations. GOS-100 sequences were given kingdom weights as described in
<a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#s3" target="_blank">Materials and Methods</a>. For each kingdom, the
fraction of sequences with ≥1 Pfam match are shown, while the ten largest Pfam
families shown as discrete sections whose size is proportional to the number of
matches between that family and GOS-100 or public-100 sequences. Pfam families that
are smaller than the ten largest are binned together in each column's bottom section.
Pfam covers public-100 better than GOS-100 in all kingdoms, with the greatest
difference occurring in the viral kingdom, where 89.1% of public-100 viral sequences
match a Pfam domain, while only 27.5% of GOS-100s have a sequence match.</p
Log–Log plot of Slopes <i>m</i>(<i>d</i>) of Linear Regression Fit to the Rate of Growth in Figure 2 for Different Values of Cluster Size <i>d</i>
<p>According to the equation derived in the text, <i>m</i>(<i>d</i>)
<i>= md<sup>1</sup></i><sup>−β</sup> for some constant <i>m</i>.
The best linear fit to log [<i>m</i>(<i>d</i>)] gives a line with
slope −0.91 (<i>R</i><sup>2</sup> = 0.98) that is close to the predicted
value 1 − β = −0.99.</p
Structure and GOS Homologs of Hypothetical Protein AF1548
<p>Yellow bars represent β-strands. Highlighted are predicted catalytic residues: 38D,
51E, and 53K.</p
Venn Diagram Showing Breakdown of the 17,067 Medium and Large Clusters by Three Categories—GOS, Known Prokaryotic, and Known Nonprokaryotic
<p>Venn Diagram Showing Breakdown of the 17,067 Medium and Large Clusters by Three
Categories—GOS, Known Prokaryotic, and Known Nonprokaryotic</p
Content of Protease Types in NCBI-nr and GOS, and Kingdom Distribution of All Proteases
<p>Due to the highly redundant nature of some NCBI-nr protease groups, nonredundant sets
for both NCBI-nr and GOS are computed; these nonredundant sets are referred to as
NCBI-nr60 and GOS60.</p