73 research outputs found
Fundamental Bounds and Approaches to Sequence Reconstruction from Nanopore Sequencers
Nanopore sequencers are emerging as promising new platforms for
high-throughput sequencing. As with other technologies, sequencer errors pose a
major challenge for their effective use. In this paper, we present a novel
information theoretic analysis of the impact of insertion-deletion (indel)
errors in nanopore sequencers. In particular, we consider the following
problems: (i) for given indel error characteristics and rate, what is the
probability of accurate reconstruction as a function of sequence length; (ii)
what is the number of `typical' sequences within the distortion bound induced
by indel errors; (iii) using replicated extrusion (the process of passing a DNA
strand through the nanopore), what is the number of replicas needed to reduce
the distortion bound so that only one typical sequence exists within the
distortion bound.
Our results provide a number of important insights: (i) the maximum length of
a sequence that can be accurately reconstructed in the presence of indel and
substitution errors is relatively small; (ii) the number of typical sequences
within the distortion bound is large; and (iii) replicated extrusion is an
effective technique for unique reconstruction. In particular, we show that the
number of replicas is a slow function (logarithmic) of sequence length --
implying that through replicated extrusion, we can sequence large reads using
nanopore sequencers. Our model considers indel and substitution errors
separately. In this sense, it can be viewed as providing (tight) bounds on
reconstruction lengths and repetitions for accurate reconstruction when the two
error modes are considered in a single model.Comment: 12 pages, 5 figure
Models and information-theoretic bounds for nanopore sequencing
Nanopore sequencing is an emerging new technology for sequencing DNA, which
can read long fragments of DNA (~50,000 bases) in contrast to most current
short-read sequencing technologies which can only read hundreds of bases. While
nanopore sequencers can acquire long reads, the high error rates (20%-30%) pose
a technical challenge. In a nanopore sequencer, a DNA is migrated through a
nanopore and current variations are measured. The DNA sequence is inferred from
this observed current pattern using an algorithm called a base-caller. In this
paper, we propose a mathematical model for the "channel" from the input DNA
sequence to the observed current, and calculate bounds on the information
extraction capacity of the nanopore sequencer. This model incorporates
impairments like (non-linear) inter-symbol interference, deletions, as well as
random response. These information bounds have two-fold application: (1) The
decoding rate with a uniform input distribution can be used to calculate the
average size of the plausible list of DNA sequences given an observed current
trace. This bound can be used to benchmark existing base-calling algorithms, as
well as serving a performance objective to design better nanopores. (2) When
the nanopore sequencer is used as a reader in a DNA storage system, the storage
capacity is quantified by our bounds
DNAâbased data storage system
Despite the many advances in traditional data recording techniques, the surge of Big Data platforms and energy conservation issues has imposed new challenges to the storage community in terms of identifying extremely high volume, non-volatile and durable recording media. The potential for using macromolecules for ultra-dense storage was recognized as early as 1959 when Richard Feynman outlined his vision for nanotechnology in a lecture, âThere is plenty of room at the bottomâ. Among known macromolecules, DNA is unique insofar as it lends itself to implementations of non-volatile recording media of outstanding integrity and extremely high storage capacity.
The basic system implementation steps for DNA-based data storage systems include synthesizing DNA strings that contain user information and subsequently retrieving them via high-throughput sequencing technologies. Existing architectures enable reading and writing but do not offer random-access and error-free data recovery from low-cost, portable devices, which is crucial for making the storage technology competitive with classical recorders.
In this work we advance the field of macromolecular data storage in three directions. First, we introduce the notion of weakly mutually uncorrelated (WMU) sequences. WMU sequences are characterized by the property that no sufficiently long suffix of one sequence is the prefix of the same or another sequence. For this purpose, WMU sequences used for primer design in DNAbased data storage systems are also required to be at large mutual Hamming distance from each other, have balanced compositions of symbols, and avoid primer-dimer byproducts. We derive bounds on the size of WMU and various constrained WMU codes and present a number of constructions for balanced, error-correcting, primer-dimer free WMU codes using Dyck paths, prefixsynchronized and cyclic codes.
Second, we describe the first DNA-based storage architecture that enables random access to data blocks and rewriting of information stored at arbitrary locations within the blocks. The newly developed architecture overcomes drawbacks of existing read-only methods that require decoding the whole file in order to read one data fragment. Our system is based on the newly developed WMU coding techniques and accompanying DNA editing methods that ensure data reliability, specificity and sensitivity of access, and at the same time provide exceptionally high data storage capacity. As a proof of concept, we encoded parts of the Wikipedia pages of six universities in the USA, and selected and edited parts of the text written in DNA corresponding to three of these schools. The results suggest that DNA is a versatile media suitable for both ultrahigh density archival and rewritable storage applications.
Third, we demonstrate for the first time that a portable, random-access platform may be implemented in practice using nanopore sequencers. Every solution for DNA-based data storage systems so far has exclusively focused on Illumina sequencing devices, but such sequencers are expensive and designed for laboratory use only. Instead, we propose using a new technology, MinIONâOxford Nanoporeâs handheld sequencer. Nanopore sequencing is fast and cheap, but it results in reads with high error rates. To deal with this issue, we designed an integrated processing pipeline that encodes data to avoid costly synthesis and sequencing errors, enables random access through addressing, and leverages efficient portable sequencing via new iterative alignment and deletion error-correcting codes. As a proof of concept, we stored and sequenced around 3.6 kB of binary data that includes two compressed images (a Citizen Kane poster and a smiley face emoji), using a portable data storage system, and obtained error-free read-outs
Recommended from our members
Models, Algorithms, and Downstream Applications of Nanopore Sequencing
The advent of nanopore sequencing technology represents a significant leap forward in the ability to read long fragments of DNA, up to 4M bases, surpassing the capabilities of traditional short-read sequencing methods that can read a few hundred bases. Despite its potential, nanopore sequencing is challenged by high error rates (5% â 15%). In this dissertation, we presents a comprehensive examination of various computational approaches to address these challenges and enhance the utility of nanopore sequencing technology in genomic analysis by using an underlying physics-based model of nanopore sequencers to guide our methods.First, we describe a mathematical model that describes the ânanopore channelâ which takes a DNA sequence as input and outputs observed current variations in a nanopore sequencer. This model accounts for impairments such as inter-symbol interference, insertions-deletions, channel fading, and random responses. Moreover, the model also provides insights for the error profiles in the nanopore sequencer that can be utilized to develop algorithms for downstream applications. We further study the bounds on the information extraction capacity of nanopore sequencers that provide benchmarks for existing base-calling algorithms and guidelines for designing improved nanopores.Our first main algorithmic work introduces QAlign, a preprocessing tool that improves the accuracy and efficiency of long-read aligners by converting nucleotide reads into discretized current levels. This transformation captures the error characteristics of nanopore sequencers studied in the previous work, enhancing alignment rates of nanopore reads to reference from around 80% to 90%, improving overlap quality for read-to-read alignments, and read-to-transcriptome alignment rates significantly across multiple datasets.Our second main algorithmic work focuses on the detection of structural variants (SVs) using nanopore sequenced reads. We present HQAlign, an aligner designed to leverage the physics of nanopore sequencing and SV-specific modifications to enhance alignment accuracy. HQAlign demonstrates a 4% â 6% improvement in detecting complementary SVs compared to the minimap2 aligner, along with substantial improvements in breakpoint accuracy and overall alignment rates for read-to-reference alignments as compared to QAlign and minimap2.The final algorithmic work addresses the challenge of identifying heterozygous variants using the highly erroneous nanopore reads data for developing algorithms for diploid genome assembly. We propose an algorithm that identifies heterozygous variants with a recall of 90% and precision of 70%, facilitating the reconstruction of diploid genomes without additional reference information or preliminary draft assemblies.Collectively, these studies advance the understanding and application of nanopore sequencing technology, offering novel computational methods to mitigate high error rates and improve genomic analyses, including alignment, structural variant detection, and diploid genome assembly
Novel computational techniques for mapping and classifying Next-Generation Sequencing data
Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing.
In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing.
An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk.
Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up
Computational pan-genomics: status, promises and challenges
International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains
- âŠ