647 research outputs found

    DUDE-Seq: Fast, Flexible, and Robust Denoising for Targeted Amplicon Sequencing

    Full text link
    We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq

    PCR biases distort bacterial and archaeal community structure in pyrosequencing datasets

    Get PDF
    As 16S rRNA gene targeted massively parallel sequencing has become a common tool for microbial diversity investigations, numerous advances have been made to minimize the influence of sequencing and chimeric PCR artifacts through rigorous quality control measures. However, there has been little effort towards understanding the effect of multi-template PCR biases on microbial community structure. In this study, we used three bacterial and three archaeal mock communities consisting of, respectively, 33 bacterial and 24 archaeal 16S rRNA gene sequences combined in different proportions to compare the influences of (1) sequencing depth, (2) sequencing artifacts (sequencing errors and chimeric PCR artifacts), and (3) biases in multi-template PCR, towards the interpretation of community structure in pyrosequencing datasets. We also assessed the influence of each of these three variables on α- and β-diversity metrics that rely on the number of OTUs alone (richness) and those that include both membership and the relative abundance of detected OTUs (diversity). As part of this study, we redesigned bacterial and archaeal primer sets that target the V3–V5 region of the 16S rRNA gene, along with multiplexing barcodes, to permit simultaneous sequencing of PCR products from the two domains. We conclude that the benefits of deeper sequencing efforts extend beyond greater OTU detection and result in higher precision in β-diversity analyses by reducing the variability between replicate libraries, despite the presence of more sequencing artifacts. Additionally, spurious OTUs resulting from sequencing errors have a significant impact on richness or shared-richness based α- and β-diversity metrics, whereas metrics that utilize community structure (including both richness and relative abundance of OTUs) are minimally affected by spurious OTUs. However, the greatest obstacle towards accurately evaluating community structure are the errors in estimated mean relative abundance of each detected OTU due to biases associated with multi-template PCR reactions

    Blind Biological Sequence Denoising with Self-Supervised Set Learning

    Full text link
    Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the midpoint of the subreads in both the latent and sequence spaces. This set embedding represents the "average" of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL methods denoise small reads of 6\leq 6 subreads with 17% fewer errors and large reads of >6>6 subreads with 8% fewer errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60% of the test set. By accurately denoising these reads, SSSL promises to better realize the potential of high-throughput DNA sequencing data for downstream scientific applications

    Metagenomics : tools and insights for analyzing next-generation sequencing data derived from biodiversity studies

    Get PDF
    Advances in next-generation sequencing (NGS) have allowed significant breakthroughs in microbial ecology studies. This has led to the rapid expansion of research in the field and the establishment of “metagenomics”, often defined as the analysis of DNA from microbial communities in environmental samples without prior need for culturing. Many metagenomics statistical/computational tools and databases have been developed in order to allow the exploitation of the huge influx of data. In this review article, we provide an overview of the sequencing technologies and how they are uniquely suited to various types of metagenomic studies. We focus on the currently available bioinformatics techniques, tools, and methodologies for performing each individual step of a typical metagenomic dataset analysis. We also provide future trends in the field with respect to tools and technologies currently under development. Moreover, we discuss data management, distribution, and integration tools that are capable of performing comparative metagenomic analyses of multiple datasets using well-established databases, as well as commonly used annotation standards

    Viral Diversity by Deep Sequencing: Approaches to Analyzing Effects of Anti-HIV Treatments

    Get PDF
    HIV is a deadly virus responsible for the AIDS pandemic, which has claimed countless lives since its origins in the early 1980s. A cure for HIV is still elusive - HIV can exist as a diverse and dynamic population that adapts quickly to immune and drug pressures, making elimination of infection difficult. Advances in antiretroviral (ARV) therapy have resulted in effective control of HIV for some but not all patients. This dissertation reports case studies of the response of viral populations to selection pressures exerted by emerging anti-HIV therapies. Deep sequencing technology was used to probe viral swarms at high-resolution, which helped make clinically relevant conclusions. Further, novel computational approaches were implemented to control procedural noise and carefully interpret signal. In one study, we examine HIV integrase inhibitors (INIs), which are among the latest ARV drugs. INIs act at a pre-integration level by aborting viral integration, which would normally lead to lasting infection. Raltegravir (RAL) is the only FDA-approved INI to date. Investigating drug resistance is crucial to informing future course of ARV therapy. We describe evolving HIV swarms in patients exhibiting a switch in RAL-resistance profiles. To understand implications of RAL administration, we analyzed the pre-therapy or treatment-naïve context for the viral populations in-depth. Our findings suggest that predominant mutations arise only in presence of RAL - in its absence, they do not constitute fit polymorphisms. For all their effectiveness, drugs have not eradicated HIV. A recent clinical case, however, involving transfer of HIV-resistant cells to an infected patient, resulted for the first time in possible cure. This emphasized the importance of gene-modification and cell-based therapies to treat HIV. One such strategy showing promise uses an antisense to target HIV. The approach has been safe although clinical efficacy has not been fully determined. In support of one such study, we deep-sequenced viral swarms in the presence of antisense-modified cells. Encouragingly, we observed minority strains harboring evidence of antisense pressure in vivo, demonstrating the potential of alternative therapy. Finally, this dissertation underscores the significance of rare signatures in HIV populations, and outlines methods to investigate them

    Eukaryotic metabarcoding pipelines for biodiversity assessment of marine benthic communities affected by ocean acidification

    Get PDF
    The development of high-throughput sequencing technologies has provided ecologists with an efficient approach to assess biodiversity in benthic communities, particularly with the recent advances in metabarcoding technologies using universal primers. However, analyzing such high-throughput data is posing important computational challenges, requiring specialized bioinformatics solutions at different stages during the processing pipeline, such as assembly of paired-end reads, chimera removal, correction of sequencing errors, and clustering of obtained sequences into Molecular Operational Taxonomic Units (MOTUs). The inferred MOTUs can then be used to estimate species diversity, composition, and richness. Although a number of methods have been developed and commonly used to cluster the sequences into MOTUs, relatively little guidance is available on their relative performance. We focused our study in the benthic community from a natural CO2 vent present in the Canary Islands, as it can be used as a natural laboratory in which to investigate the impacts of chronic ocean acidification. Here, we propose a pipeline for studying this community using a fragment of the mitochondrial cytochrome c oxidase I (COI) sequence. We compared two DNA extraction methods, two clustering methods and validated a robust method to eliminate false positives. We found that we can obtain optimal results purifying DNA from 0.3 g of sample. Using the step-by-step aggregation algorithm implemented in SWARM for clustering yields similar results as using the Bayesian clustering method of CROP, in much less time. We introduced the new algorithm MINT (Multiple Intersection of N Tags), in order to eliminate false positives due to random errors produced before or after the sequencing. Our results show that a fully-automated analysis pipeline can be used for assessing biodiversity of marine benthic communities using COI as a metabarcoding marker in an objective, accurate and affordable manner

    A Comparison of rpoB and 16S rRNA as Markers in Pyrosequencing Studies of Bacterial Diversity

    Get PDF
    Background: The 16S rRNA gene is the gold standard in molecular surveys of bacterial and archaeal diversity, but it has the disadvantages that it is often multiple-copy, has little resolution below the species level and cannot be readily interpreted in an evolutionary framework. We compared the 16S rRNA marker with the single-copy, protein-coding rpoB marker by amplifying and sequencing both from a single soil sample. Because the higher genetic resolution of the rpoB gene prohibits its use as a universal marker, we employed consensus-degenerate primers targeting the Proteobacteria. <p/>Methodology/Principal Findings: Pyrosequencing can be problematic because of the poor resolution of homopolymer runs. As these erroneous runs disrupt the reading frame of protein-coding sequences, removal of sequences containing nonsense mutations was found to be a valuable filter in addition to flowgram-based denoising. Although both markers gave similar estimates of total diversity, the rpoB marker revealed more species, requiring an order of magnitude fewer reads to obtain 90% of the true diversity. The application of population genetic methods was demonstrated on a particularly abundant sequence cluster. <p/>Conclusions/Significance: The rpoB marker can be a complement to the 16S rRNA marker for high throughput microbial diversity studies focusing on specific taxonomic groups. Additional error filtering is possible and tests for recombination or selection can be employed
    corecore