Search CORE

20 research outputs found

Novel GOS-Only Clusters Are More Interconnected Than a Size-Matched Sample of Clusters

Red line, novel clusters; green line, size-matched sample; blue line (right axis), log2 ratio of fraction novel clusters recovered divided by fraction sample clusters recovered.</p

FigShare

GOS-Only Clusters Are Enriched for Sequences of Viral Origin Independently of the Kingdom Assignment Method Employed

For each panel, clusters are as in <a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#pbio-0050016-g004" target="_blank">Figure 4</a>. For (A–C), a kingdom is assigned to each neighboring ORF within each cluster set; the percentage of all neighboring ORFs with a given kingdom assignment is plotted. For (D–F), a kingdom is assigned to each cluster if more than 50% of all that cluster's neighbors with a kingdom assignment share the same assignment; the percentage of clusters in each set with a given assignment is plotted. In (A) and (D), a kingdom is assigned to a neighboring ORF by a majority vote of the top four BLAST matches to a protein in NCBI-nr (<a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#s3" target="_blank">Materials and Methods</a>). In (B) and (E), a kingdom is assigned if all eight highest-scoring BLAST matches agree in kingdom. In (C) and (F), all ORFs on a scaffold are assigned the same kingdom by voting among all ORFs with BLAST matches to NCBI-nr on that scaffold (<a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#s3" target="_blank">Materials and Methods</a>). In all graphs, only clusters with at least one assignable neighbor are considered. When compared to the size-matched controls, in all cases the GOS-only clusters show enrichment for viral sequences.</p

FigShare

Enrichment in the GOS-Only Set of Clusters for Viral Neighbors

Cluster sets from left to right are: I, GOS-only clusters with detectable BLAST, HMM, or profile-profile homology (Group I); II, GOS-only clusters with no detectable homology (Group II); I-S, a sample from all clusters chosen to have the same size distribution as Group I; II-S, a sample from all clusters chosen to have the same size distribution as Group II; I-V, a subset of clusters in Group I containing sequences collected from the viral size fraction; II-V, a subset of clusters in Group II from the viral size fraction; and all clusters. Notice that although predominantly bacterial, GOS-only clusters are assigned as viral based on their neighbors more often than the size-matched samples and the set of all clusters.</p

FigShare

Log–Log Plots of Cluster Size Distributions

<div>The x-axis is logarithm of the cluster size X and the y-axis is the logarithm of the number of clusters of size at least X; logarithms are base 10. (A) Plot comparing the sizes of clusters produced by our clustering approach (red) to those of clusters produced by Pfams (green). The curves track each other quite well, with both of them having an inflection point around cluster size 2,500 (approximately 3.4 on the x-axis). Each sequence is assigned to the highest scoring Pfam that it matches. Two sequences that are assigned to the same Pfam can nevertheless be assigned to different clusters by the full-sequence–based clustering approach if they differ in the remaining portion. This is especially true for commonly occurring domains that are present in different multidomain proteins. Thus, there tends to be a larger number of big clusters in the Pfam approach as compared to the full-sequence–based approach. Hence, the green curve is above the red curve at the higher sizes. (B) Plot of the cluster size distributions for core sets (green) and for final clusters (red). Both curves have an inflection point around cluster size 2,500 (approximately 3.4 on the x-axis). Note that these plots give the cumulative distribution function (cdf), while the power law exponents reported in the text are for the number of clusters of size X (i.e., the probability density function [pdf]). The relationship between these exponents is βpdf = 1 + βcdf.</div

FigShare

Distribution of Average HMM Score Difference between GOS and Public (NCBI-nr, MG, TGI-EST, and ENS)

Only matches to the full length of an HMM are considered, and only HMMs that have at least 100 matches to each of GOS and public databases are considered. This results in 1,686 HMMs whose average scores to GOS and public databases are considered. The mean of the distribution is −50, showing that GOS sequences tend to score lower than sequences in public, thereby reflecting diversity compared to sequences in public.</p

FigShare

Coverage of GOS-100 and Public-100 by Pfam and Relative Sizes of Pfam Families by Kingdom, Sorted by Size

The public-100 sequences are annotated using the NCBI taxonomy and the source public database annotations. GOS-100 sequences were given kingdom weights as described in <a href="http://www.plosbiology.org/article/info:doi/10.1371/journal.pbio.0050016#s3" target="_blank">Materials and Methods</a>. For each kingdom, the fraction of sequences with ≥1 Pfam match are shown, while the ten largest Pfam families shown as discrete sections whose size is proportional to the number of matches between that family and GOS-100 or public-100 sequences. Pfam families that are smaller than the ten largest are binned together in each column's bottom section. Pfam covers public-100 better than GOS-100 in all kingdoms, with the greatest difference occurring in the viral kingdom, where 89.1% of public-100 viral sequences match a Pfam domain, while only 27.5% of GOS-100s have a sequence match.</p

FigShare

Log–Log plot of Slopes m(d) of Linear Regression Fit to the Rate of Growth in Figure 2 for Different Values of Cluster Size d

According to the equation derived in the text, m(d) = md1−β for some constant m. The best linear fit to log [m(d)] gives a line with slope −0.91 (R2 = 0.98) that is close to the predicted value 1 − β = −0.99.</p

FigShare

Structure and GOS Homologs of Hypothetical Protein AF1548

Yellow bars represent β-strands. Highlighted are predicted catalytic residues: 38D, 51E, and 53K.</p

FigShare

Venn Diagram Showing Breakdown of the 17,067 Medium and Large Clusters by Three Categories—GOS, Known Prokaryotic, and Known Nonprokaryotic

Venn Diagram Showing Breakdown of the 17,067 Medium and Large Clusters by Three Categories—GOS, Known Prokaryotic, and Known Nonprokaryotic</p

FigShare

Content of Protease Types in NCBI-nr and GOS, and Kingdom Distribution of All Proteases

Due to the highly redundant nature of some NCBI-nr protease groups, nonredundant sets for both NCBI-nr and GOS are computed; these nonredundant sets are referred to as NCBI-nr60 and GOS60.</p

FigShare