64 research outputs found

    Aligning Flowgrams to DNA Sequences

    Get PDF
    A read from 454 or Ion Torrent sequencers is natively represented as a flowgram, which is a sequence of pairs of a nucleotide and its (fractional) intensity. Recent work has focused on improving the accuracy of base calling (conversion of flowgrams to DNA sequences) in order to facilitate read mapping and downstream analysis of sequence variants. However, base calling always incurs a loss of information by discarding fractional intensity information. We argue that base calling can be avoided entirely by directly aligning the flowgrams to DNA sequences. We introduce an algorithm for flowgram-string alignment based on dynamic programming, but covering more cases than standard local or global sequence alignment. We also propose a scoring scheme that takes into account sequence variations (from substitutions, insertions, deletions) and sequencing errors (flow intensities contradicting the homopolymer length) separately. This allows to resolve fractional intensities, ambiguous homopolymer lengths and editing events at alignment time by choosing the most likely read sequence given both the nucleotide intensities and the reference sequence. We provide a proof-of-concept implementation and demonstrate the advantages of flowgram-string alignment compared to base-called alignments

    An analysis of algorithms to estimate the characteristics of the underlying population in Massively Parallel Pyrosequencing data.

    Get PDF
    Masters Degree. University of KwaZulu-Natal, Pietermaritzburg.Massively Parallel Pyrosequencing (MPP) is a next generation DNA sequencing technique that is becoming ubiquitous because it is considerably faster, cheaper and produces a higher throughput than long established sequencing techniques like Sanger sequencing. The MPP methodology is also much less labor intensive than Sanger sequencing. Indeed, MPP has become a preferred technology in experiments that seek to determine the distinctive genetic variation present in homologous genomic regions. However there arises a problem in the interpretation of the reads derived from an MPP experiment. Specifically MPP reads are characteristically error prone. This means that it becomes difficult to separate the authentic genomic variation underlying a set of MPP reads from variation that is a consequence of sequencing error. The difficulty of inferring authentic variation is further compounded by the fact that MPP reads are also characteristically short. As a consequence of this, the correct alignment of an MPP read with respect to the genomic region from which it was derived may not be intuitive. To this end, several computational algorithms that seek to correctly align and remove the non-authentic genetic variation from MPP reads have been proposed in literature. We refer to the removal of non-authentic variation from a set of MPP reads as error correction. Computational algorithms that process MPP data are classified as sequence-space algorithms and flow-space algorithms. Sequence-space algorithms work with MPP sequencing reads as raw data, whereas flow-space algorithms work with MPP flowgrams as raw data. A flowgram is an intermediate product of MPP, which is subsequently converted into a sequencing read. In theory, flow-space computations should produce more accurate results than sequence-space computations. In this thesis, we make a qualitative comparison of the distinct solutions delivered by selected MPP read alignment algorithms. Further we make a qualitative comparison of the distinct solutions delivered by selected MPP error correction algorithms. Our comparisons between different algorithms with the same niche are facilitated by the design of a platform for MPP simulation, PyroSim. PyroSim is designed to encapsulate the error rate that is characteristic of MPP. We implement a selection of sequence-space and flow-space alignment algorithms in a software package, MPPAlign. We derive a quality ranking for the distinct algorithms implemented in MPPAlign through a series of qualitative comparisons. Further, we implement a selection of sequence-space and flow-space error correction algorithms in a software package, MPPErrorCorrect. Similarly, we derive a quality ranking for the distinct algorithms implemented in MPPErrorCorrect through a series of qualitative comparisons. Contrary to the view expressed in literature which postulates that flowspace computations are more accurate than sequence-space computations, we find that in general the sequence-space algorithms that we implement outperform the flow-space algorithms. We surmise that flow-space is a more sensitive domain for conducting computations and can only yield consistently good results under stringent quality control measures. In sequence-space, however, we find that base calling, the process that converts flowgrams (flow-space raw data) into sequencing reads (sequence-space raw data), leads to more reliable computations

    Groundtruthing next-gen sequencing for microbial ecology-biases and errors in community structure estimates from PCR amplicon pyrosequencing

    Get PDF
    Analysis of microbial communities by high-throughput pyrosequencing of SSU rRNA gene PCR amplicons has transformed microbial ecology research and led to the observation that many communities contain a diverse assortment of rare taxa-a phenomenon termed the Rare Biosphere. Multiple studies have investigated the effect of pyrosequencing read quality on operational taxonomic unit (OTU) richness for contrived communities, yet there is limited information on the fidelity of community structure estimates obtained through this approach. Given that PCR biases are widely recognized, and further unknown biases may arise from the sequencing process itself, a priori assumptions about the neutrality of the data generation process are at best unvalidated. Furthermore, post-sequencing quality control algorithms have not been explicitly evaluated for the accuracy of recovered representative sequences and its impact on downstream analyses, reducing useful discussion on pyrosequencing reads to their diversity and abundances. Here we report on community structures and sequences recovered for in vitro-simulated communities consisting of twenty 16S rRNA gene clones tiered at known proportions. PCR amplicon libraries of the V3-V4 and V6 hypervariable regions from the in vitro-simulated communities were sequenced using the Roche 454 GS FLX Titanium platform. Commonly used quality control protocols resulted in the formation of OTUs with >1% abundance composed entirely of erroneous sequences, while over-aggressive clustering approaches obfuscated real, expected OTUs. The pyrosequencing process itself did not appear to impose significant biases on overall community structure estimates, although the detection limit for rare taxa may be affected by PCR amplicon size and quality control approach employed. Meanwhile, PCR biases associated with the initial amplicon generation may impose greater distortions in the observed community structure

    A vertebrate case study of the quality of assemblies derived from next-generation sequences

    Get PDF
    The unparalleled efficiency of next-generation sequencing (NGS) has prompted widespread adoption, but significant problems remain in the use of NGS data for whole genome assembly. We explore the advantages and disadvantages of chicken genome assemblies generated using a variety of sequencing and assembly methodologies. NGS assemblies are equivalent in some ways to a Sanger-based assembly yet deficient in others. Nonetheless, these assemblies are sufficient for the identification of the majority of genes and can reveal novel sequences when compared to existing assembly references

    DUDE-Seq: Fast, Flexible, and Robust Denoising for Targeted Amplicon Sequencing

    Full text link
    We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq

    FAAST: Flow-space Assisted Alignment Search Tool

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High throughput pyrosequencing (454 sequencing) is the major sequencing platform for producing long read high throughput data. While most other sequencing techniques produce reading errors mainly comparable with substitutions, pyrosequencing produce errors mainly comparable with gaps. These errors are less efficiently detected by most conventional alignment programs and may produce inaccurate alignments.</p> <p>Results</p> <p>We suggest a novel algorithm for calculating the optimal local alignment which utilises flowpeak information in order to improve alignment accuracy. Flowpeak information can be retained from a 454 sequencing run through interpretation of the binary SFF-file format. This novel algorithm has been implemented in a program named FAAST (Flow-space Assisted Alignment Search Tool).</p> <p>Conclusions</p> <p>We present and discuss the results of simulations that show that FAAST, through the use of the novel algorithm, can gain several percentage points of accuracy compared to Smith-Waterman-Gotoh alignments, depending on the 454 data quality. Furthermore, through an efficient multi-thread aware implementation, FAAST is able to perform these high quality alignments at high speed.</p> <p>The tool is available at <url>http://www.ifm.liu.se/bioinfo/</url></p

    FAAST: Flow-space Assisted Alignment Search Tool

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High throughput pyrosequencing (454 sequencing) is the major sequencing platform for producing long read high throughput data. While most other sequencing techniques produce reading errors mainly comparable with substitutions, pyrosequencing produce errors mainly comparable with gaps. These errors are less efficiently detected by most conventional alignment programs and may produce inaccurate alignments.</p> <p>Results</p> <p>We suggest a novel algorithm for calculating the optimal local alignment which utilises flowpeak information in order to improve alignment accuracy. Flowpeak information can be retained from a 454 sequencing run through interpretation of the binary SFF-file format. This novel algorithm has been implemented in a program named FAAST (Flow-space Assisted Alignment Search Tool).</p> <p>Conclusions</p> <p>We present and discuss the results of simulations that show that FAAST, through the use of the novel algorithm, can gain several percentage points of accuracy compared to Smith-Waterman-Gotoh alignments, depending on the 454 data quality. Furthermore, through an efficient multi-thread aware implementation, FAAST is able to perform these high quality alignments at high speed.</p> <p>The tool is available at <url>http://www.ifm.liu.se/bioinfo/</url></p
    corecore