10 research outputs found

    Illumina error profiles : resolving fine-scale variation in metagenomic sequencing data

    Get PDF
    Background: Illumina’s sequencing platforms are currently the most utilised sequencing systems worldwide. The technology has rapidly evolved over recent years and provides high throughput at low costs with increasing read-lengths and true paired-end reads. However, data from any sequencing technology contains noise and our understanding of the peculiarities and sequencing errors encountered in Illumina data has lagged behind this rapid development. Results: We conducted a systematic investigation of errors and biases in Illumina data based on the largest collection of in vitro metagenomic data sets to date. We evaluated the Genome Analyzer II, HiSeq and MiSeq and tested state-of-the-art low input library preparation methods. Analysing in vitro metagenomic sequencing data allowed us to determine biases directly associated with the actual sequencing process. The position- and nucleotide-specific analysis revealed a substantial bias related to motifs (3mers preceding errors) ending in “GG”. On average the top three motifs were linked to 16 % of all substitution errors. Furthermore, a preferential incorporation of ddGTPs was recorded. We hypothesise that all of these biases are related to the engineered polymerase and ddNTPs which are intrinsic to any sequencing-by-synthesis method. We show that quality-score-based error removal strategies can on average remove 69 % of the substitution errors - however, the motif-bias remains. Conclusion: Single-nucleotide polymorphism changes in bacterial genomes can cause significant changes in phenotype, including antibiotic resistance and virulence, detecting them within metagenomes is therefore vital. Current error removal techniques are not designed to target the peculiarities encountered in Illumina sequencing data and other sequencing-by-synthesis methods, causing biases to persist and potentially affect any conclusions drawn from the data. In order to develop effective diagnostic and therapeutic approaches we need to be able to identify systematic sequencing errors and distinguish these errors from true genetic variation

    A Comprehensive Benchmarking Study of Protocols and Sequencing Platforms for 16s Rrna Community Profiling

    Get PDF
    In the last 5 years, the rapid pace of innovations and improvements in sequencing technologies has completely changed the landscape of metagenomic and metagenetic experiments. Therefore, it is critical to benchmark the various methodologies for interrogating the composition of microbial communities, so that we can assess their strengths and limitations. The most common phylogenetic marker for microbial community diversity studies is the 16S ribosomal RNA gene and in the last 10 years the field has moved from sequencing a small number of amplicons and samples to more complex studies where thousands of samples and multiple different gene regions are interrogated. Results: Weassembled2syntheticcommunitieswithaneven(EM)anduneven(UM)distributionofarchaealand bacterial strains and species, as metagenomic control material, to assess performance of different experimental strategies. The 2 synthetic communities were used in this study, to highlight the limitations and the advantages of the leading sequencing platforms: MiSeq (Illumina), The Pacific Biosciences RSII, 454 GS-FLX/+ (Roche), and IonTorrent (Life Technologies). We describe an extensive survey based on synthetic communities using 3 experimental designs (fusion primers, universal tailed tag, ligated adaptors) across the 9 hypervariable 16S rDNA regions. We demonstrate that library preparation methodology can affect data interpretation due to different error and chimera rates generated during the procedure. The observed community composition was always biased, to a degree that depended on the platform, sequenced region and primer choice. However, crucially, our analysis suggests that 16S rRNA sequencing is still quantitative, in that relative changes in abundance of taxa between samples can be recovered, despite these biases. Conclusion: Wehaveassessedarangeofexperimentalconditionsacrossseveralnextgenerationsequencing platforms using the most up-to-date configurations. We propose that the choice of sequencing platform and experimental design needs to be taken into consideration in the early stage of a project by running a small trial consisting of several hypervariable regions to quantify the discriminatory power of each region. We also suggest that the use of a synthetic community as a positive control would be beneficial to identify the potential biases and procedural drawbacks that may lead to data misinterpretation. The results of this study will serve as a guideline for making decisions on which experimental condition and sequencing platform to consider to achieve the best microbial profiling

    A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling

    Get PDF
    Background: In the last 5 years, the rapid pace of innovations and improvements in sequencing technologies has completely changed the landscape of metagenomic and metagenetic experiments. Therefore, it is critical to benchmark the various methodologies for interrogating the composition of microbial communities, so that we can assess their strengths and limitations. The most common phylogenetic marker for microbial community diversity studies is the 16S ribosomal RNA gene and in the last 10 years the field has moved from sequencing a small number of amplicons and samples to more complex studies where thousands of samples and multiple different gene regions are interrogated. Results: We assembled 2 synthetic communities with an even (EM) and uneven (UM) distribution of archaeal and bacterial strains and species, as metagenomic control material, to assess performance of different experimental strategies. The 2 synthetic communities were used in this study, to highlight the limitations and the advantages of the leading sequencing platforms: MiSeq (Illumina), The Pacific Biosciences RSII, 454 GS-FLX/+ (Roche), and IonTorrent (Life Technologies). We describe an extensive survey based on synthetic communities using 3 experimental designs (fusion primers, universal tailed tag, ligated adaptors) across the 9 hypervariable 16S rDNA regions. We demonstrate that library preparation methodology can affect data interpretation due to different error and chimera rates generated during the procedure. The observed community composition was always biased, to a degree that depended on the platform, sequenced region and primer choice. However, crucially, our analysis suggests that 16S rRNA sequencing is still quantitative, in that relative changes in abundance of taxa between samples can be recovered, despite these biases. Conclusion: We have assessed a range of experimental conditions across several next generation sequencing platforms using the most up-to-date configurations. We propose that the choice of sequencing platform and experimental design needs to be taken into consideration in the early stage of a project by running a small trial consisting of several hypervariable regions to quantify the discriminatory power of each region. We also suggest that the use of a synthetic community as a positive control would be beneficial to identify the potential biases and procedural drawbacks that may lead to data misinterpretation. The results of this study will serve as a guideline for making decisions on which experimental condition and sequencing platform to consider to achieve the best microbial profiling

    Comparison of the human gastric microbiota in hypochlorhydric states arising as a result of <i>Helicobacter pylori</i>-induced atrophic gastritis, autoimmune atrophic gastritis and proton pump inhibitor use - Fig 4

    Get PDF
    <p>Nonmetric distance scaling (NMDS) demonstrating clustering of patient groups using (A) unweighted Unifrac distance (pair-wise distance between samples is calculated as a normalised difference in cumulative branch lengths of the observed OTUs for each sample on the phylogenetic tree without taking into account their abundances in samples), (B) Bray-Curtis distance (abundance of OTUs alone and not considering the phylogenetic distance) and (C) weighted Unifrac (unweighted unifrac distance weighted by abundances of OTUs). Serum gastrin concentration indicated by size of each point. Ellipses represent 95% CI of standard error for a given group. Dotted ellipses represent the 95% CI of standard error when <i>H</i>. <i>pylori</i> were removed from the analysis. Atrophy = <i>H</i>. <i>pylori</i> associated atrophic gastritis, Auto = autoimmune atrophic gastritis, Control = normal, HP Gastr = <i>H</i>. <i>pylori</i> associated gastritis and PPI = proton pump inhibitor. PERMANOVA (distances against groups) suggests significant differences (P<0.001 for all three distances) in microbial community explaining the following variations (R<sup>2</sup>) between groups: 10% (8.6% without <i>H</i>. <i>pylori</i> when using Unweighted Unifrac; 58% (14.5% without <i>H</i>. <i>pylori</i>) when using Weighted Unifrac; and 15% when using Bray-Curtis distance. No significant explanation was observed (P>0.05) for age, BMI, or serum gastrin concentration in the PERMANOVA test. (D) Data from betadisper plots (a mean to compare the spread/variability of samples for different groups) representing difference in distances (Bray-Curtis, Unweighted and weighted Unifrac) of group members from the centre/mean of individual groups after obtaining a reduced-order representation of abundance table using Principle Coordinate Analysis. The pair-wise differences in distances from group centre/mean were then subjected to ANOVA and if significant (P<0.001), the p-values were drawn on top.</p

    Analysis of the bread wheat genome using whole-genome shotgun sequencing

    Get PDF
    Citation: Brenchley, R., Spannagl, M., Pfeifer, M., . . . & Hall, N. (2012). Analysis of the bread wheat genome using whole-genome shotgun sequencing. Nature, 491(1), 705-709. https://doi.org/10.1038/nature11650Bread wheat (Triticum aestivum) is a globally important crop, accounting for 20 per cent of the calories consumed by humans. Major efforts are underway worldwide to increase wheat production by extending genetic diversity and analysing key traits, and genomic resources can accelerate progress. But so far the very large size and polyploid complexity of the bread wheat genome have been substantial barriers to genome analysis. Here we report the sequencing of its large, 17-gigabase-pair, hexaploid genome using 454 pyrosequencing, and comparison of this with the sequences of diploid ancestral and progenitor genomes. We identified between 94,000 and 96,000 genes, and assigned two-thirds to the three component genomes (A, B and D) of hexaploid wheat. High-resolution synteny maps identified many small disruptions to conserved gene order. We show that the hexaploid genome is highly dynamic, with significant loss of gene family members on polyploidization and domestication, and an abundance of gene fragments. Several classes of genes involved in energy harvesting, metabolism and growth are among expanded gene families that could be associated with crop productivity. Our analyses, coupled with the identification of extensive genetic variation, provide a resource for accelerating gene discovery and improving this major crop
    corecore