21 research outputs found

    Report of the 13th Genomic Standards Consortium Meeting, Shenzhen, China, March 4–7, 2012

    Get PDF
    This report details the outcome of the 13th Meeting of the Genomic Standards Consortium. The three-day conference was held at the Kingkey Palace Hotel, Shenzhen, China, on March 5–7, 2012, and was hosted by the Beijing Genomics Institute. The meeting, titled From Genomes to Interactions to Communities to Models, highlighted the role of data standards associated with genomic, metagenomic, and amplicon sequence data and the contextual information associated with the sample. To this end the meeting focused on genomic projects for animals, plants, fungi, and viruses; metagenomic studies in host-microbe interactions; and the dynamics of microbial communities. In addition, the meeting hosted a Genomic Observatories Network session, a Genomic Standards Consortium biodiversity working group session, and a Microbiology of the Built Environment session sponsored by the Alfred P. Sloan Foundatio

    A Platform-Independent Method for Detecting Errors in Metagenomic Sequencing Data: DRISEE

    Get PDF
    We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as “noise” or “error”) within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms

    The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools

    Get PDF
    Background Computing of sequence similarity results is becoming a limiting factor in metagenome analysis. Sequence similarity search results encoded in an open, exchangeable format have the potential to limit the needs for computational reanalysis of these data sets. A prerequisite for sharing of similarity results is a common reference. Description We introduce a mechanism for automatically maintaining a comprehensive, non-redundant protein database and for creating a quarterly release of this resource. In addition, we present tools for translating similarity searches into many annotation namespaces, e.g. KEGG or NCBI's GenBank. Conclusions The data and tools we present allow the creation of multiple result sets using a single computation, permitting computational results to be shared between groups for large sequence data sets

    DRISEE error profiles for metagenomic sequencing data sets.

    No full text
    <p>Total (% substitutions + % insertions + % deletions) DRISEE error (Y-axis) as a function of read position (X-axis) for all considered reads. (<b>a</b>)<b> and </b>(<b>b</b>)<b>: Phred vs. DRISEE</b>: Total DRISEE (red) and average Phred (blue) derived errors (Q values converted to percent error) for (a) 20 metagenomic 454 samples and (b) 12 metagenomic Illumina samples. <b>(c): </b><b>DRISEE total error of several Illumina-based sample sets</b>: DRISEE total error profiles are displayed for 5 different Illumina experiments/sample sets. Parentheses indicate the number of samples in each experiment/sample set. <b>(d): </b><b>DRISEE total error of single samples</b>: DRISEE total error profiles are displayed for two individual samples. The samples represent the lowest and highest averaged DRISEE total errors (averaged across all read positions), observed in Sample Set 3 (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002541#pcbi-1002541-g004" target="_blank">Figure 4c</a> above). Pie charts indicate a summary of MG-RAST-based annotation of the two samples. The upper pie chart was produced from the data set that corresponds to the purple DRISEE profile (average DRISEE error = 45%). The lower pie chart corresponds to annotation of the data set that produced the green DRISEE profile (average DRISEE error = 1%).</p

    (a) Error detection capabilities of Score, Reference-genome, and DRISEE methods.

    No full text
    <p>(1) Simplified procedural diagram of a typical sequencing protocol. <b>Sample collection</b>: First, the biological sample is collected, <b>Extraction/Initial purification</b>: Then the RNA/DNA undergoes extraction and initial purification procedures, <b>Pre-sequencing amplification(s)</b>: Next, the extracted genetic material may undergo amplification (e.g. whole genome amplification – see main text) followed by additional purifications and/or other processing procedures, <b>“Sequencing”</b>: Genetic material is placed in the sequencer itself, and is sequenced. Note that sequencing itself frequently involves additional rounds of amplification, <b>Analyses of sequencing output</b>: Sequencer outputs are analyzed. (2) Given a procedure such as A, the portion of the procedure over which score/Phred-based methods can detect error is indicated in red. (3) Given a procedure such as A, the portion of the procedure over which reference-genome-based methods can detect error is indicated in green. Note that reference-genome-based methods are only applicable to single genome data; they cannot consider metagenomic data. (4) Given a procedure such as A, the portion of the procedure over which DRISEE-based methods can detect error is indicated in blue. Note that DRISEE methods can be applied to metagenomic or genomic data, provided that certain requirements are met. See methods. 1: BMC Bioinformatics. 2008 Sep 19;9:386. 2: Nat Methods. 2010 May;7(5):335–6. Epub 2010 Apr 11. <b>(b) DRISEE workflow</b> The steps in a typical DRISEE workflow are depicted and briefly described (in figure captions). Please see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002541#pcbi.1002541.s002" target="_blank">Text S1</a> (Supplemental Methods, <i>Typical DRISEE workflow</i>) for a much more detailed description of each depicted step.</p

    A RESTful API for Accessing Microbial Community Data for MG-RAST

    No full text
    <div><p>Metagenomic sequencing has produced significant amounts of data in recent years. For example, as of summer 2013, MG-RAST has been used to annotate over 110,000 data sets totaling over 43 Terabases. With metagenomic sequencing finding even wider adoption in the scientific community, the existing web-based analysis tools and infrastructure in MG-RAST provide limited capability for data retrieval and analysis, such as comparative analysis between multiple data sets. Moreover, although the system provides many analysis tools, it is not comprehensive. By opening MG-RAST up via a web services API (application programmers interface) we have greatly expanded access to MG-RAST data, as well as provided a mechanism for the use of third-party analysis tools with MG-RAST data. This RESTful API makes all data and data objects created by the MG-RAST pipeline accessible as JSON objects. As part of the DOE Systems Biology Knowledgebase project (KBase, <a href="http://kbase.us" target="_blank">http://kbase.us</a>) we have implemented a web services API for MG-RAST. This API complements the existing MG-RAST web interface and constitutes the basis of KBase's microbial community capabilities. In addition, the API exposes a comprehensive collection of data to programmers. This API, which uses a RESTful (Representational State Transfer) implementation, is compatible with most programming environments and should be easy to use for end users and third parties. It provides comprehensive access to sequence data, quality control results, annotations, and many other data types. Where feasible, we have used standards to expose data and metadata. Code examples are provided in a number of languages both to show the versatility of the API and to provide a starting point for users. We present an API that exposes the data in MG-RAST for consumption by our users, greatly enhancing the utility of the MG-RAST service.</p></div

    Total DRISEE errors of genomic and metagenomic data produced by 454 and Illumina technologies.

    No full text
    <p>A boxplot (conventional five number summary) presents the distribution of averaged total DRISEE errors observed among 476 sequencing samples. The average total DRISEE error is plotted on the Y-axis. X-axis labels indicate the technology (454 or Illumina), type of sample (shotgun genomic or shotgun metagenomic), and in parenthesis, number of samples represented by each individual boxplot. Gray highlight indicates the range of values that have been previously reported for error on 454 and Illumina sequencing platforms (0.25–4%).</p

    DRISEE performance on simulated and real data.

    No full text
    <p>(<b>a</b>) Simulated data sets were generated from real whole genome sequences <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002541#pcbi.1002541-Niu1" target="_blank">[12]</a>, taken from a single sequenced genome, and randomly fragmented into reads that exhibit length distributions consistent with different sequencing technologies (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002541#s4" target="_blank">Methods</a>). Total DRISEE error rates for each sample (Y-axis) are plotted against the known, artificially introduced error rates (X-axis). The equation and R<sup>2</sup> values represent a linear regression of displayed data. (<b>b</b>) DRISEE and a conventional reference-genome-based error method were applied to a set of published genomic data sets <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002541#pcbi.1002541-Niu1" target="_blank">[12]</a> (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002541#s4" target="_blank">Methods</a>). Cumulative DRISEE errors (Y axis) are plotted against reference-genome errors determined for the same sample. The equations and R<sup>2</sup> values represent linear regressions of displayed data. The regression for all samples is plotted as a black line; red lines indicate this regression plus or minus one standard deviation. Red points indicate values further than one standard deviation from the “All Samples” regression. Orange indicates a single point that may disproportionately inflate the observed R<sup>2</sup>. Equations and R<sup>2</sup> values for the “All Samples” regression are provided as well as for regressions that exclude only the red points or the red and orange points.</p
    corecore