44 research outputs found

    An Artificial Functional Family Filter in Homolog Searching in Next-generation Sequencing Metagenomics

    No full text
    <div><p>In functional metagenomics, BLAST homology search is a common method to classify metagenomic reads into protein/domain sequence families such as Clusters of Orthologous Groups of proteins (COGs) in order to quantify the abundance of each COG in the community. The resulting functional profile of the community is then used in downstream analysis to correlate the change in abundance to environmental perturbation, clinical variation, and so on. However, the short read length coupled with next-generation sequencing technologies poses a barrier in this approach, essentially because similarity significance cannot be discerned by searching with short reads. Consequently, artificial functional families are produced, in which those with a large number of reads assigned decreases the accuracy of functional profile dramatically. There is no method available to address this problem. We intended to fill this gap in this paper. We revealed that BLAST similarity scores of homologues for short reads from COG protein members coding sequences are distributed differently from the scores of those derived elsewhere. We showed that, by choosing an appropriate score cut-off, we are able to filter out most artificial families and simultaneously to preserve sufficient information in order to build the functional profile. We also showed that, by incorporated application of BLAST and RPS-BLAST, some artificial families with large read counts can be further identified after the score cutoff filtration. Evaluated on three experimental metagenomic datasets with different coverages, we found that the proposed method is robust against read coverage and consistently outperforms the other E-value cutoff methods currently used in literatures.</p> </div

    The plot of the normalized penalty versus the score cutoff value.

    No full text
    <p>The bold dark green curve is for the simulated combined metagenome, and the other colored curves are for single genomes. On each curve, the filled black point-down triangle denotes the least normalized penalty.</p

    Comparison of p values obtained from normal and uniform approximations.

    No full text
    <p>The p values (negative base 10 logarithm) from the normal approximation are plotted against those from the uniform approximation. The red line is the identity line and the two blue lines represent the cut-off p value of 0.05 with Bonferroni correction. Panels (a) and (c) are for the two microarray datasets. Panels (b) and (d) are for the two RNA-Seq datasets.</p

    Influential artificial COGs in M3_2X detected by Step 1 and Step 2.

    No full text
    <p>Note: Two columns of read counts are obtained before score filtration (Step 1) and after score filtration (Step 2).</p

    Influential artificial COGs identified in <b>Step 2</b>.

    No full text
    <p>Influential artificial COGs identified in <b>Step 2</b>.</p

    Influential artificial COGs defined in the simulated ∼100 nt metagenome.

    No full text
    <p>Influential artificial COGs defined in the simulated ∼100 nt metagenome.</p

    Effect of local statistics on the comparison of the two approximation methods.

    No full text
    <p>The plot are of negative base 10 logarithm of the p-value from normal approximation versus that of the p-value from uniform approximation when using fold change (microarray) and log likelihood ratio (RNA-Seq) as local statistics. Panels (a) and (b) are two different microarray datasets. Panels (c) and (d) are the RNA-seq datasets with Poisson assumption. Panels (e) and (f) are the RNA-seq datasets with Negative Binomial assumption.</p

    Empirical densities of similarity scores by BLAST and RPS-BLAST (left) and the normalized penalty plot by RPS-BLAST (right).

    No full text
    <p>The plot demonstrates that the penalty is minimized at the similarity score of 61.</p

    The plot of RRCs for the COGs with 95% percentile or above read counts.

    No full text
    <p>The RRC values for five influential artificial COGs range between 0 and 0.04. For the other COG families, the RRCs are farther away from 0, with only one RRC being less than 0.05.</p

    Influential artificial COGs in M3_1X detected by Step 1 and Step 2.

    No full text
    <p>Note: Two columns of read counts are obtained before score filtration (Step 1) and after score filtration (Step 2).</p
    corecore