6 research outputs found
Models and information-theoretic bounds for nanopore sequencing
Nanopore sequencing is an emerging new technology for sequencing DNA, which
can read long fragments of DNA (~50,000 bases) in contrast to most current
short-read sequencing technologies which can only read hundreds of bases. While
nanopore sequencers can acquire long reads, the high error rates (20%-30%) pose
a technical challenge. In a nanopore sequencer, a DNA is migrated through a
nanopore and current variations are measured. The DNA sequence is inferred from
this observed current pattern using an algorithm called a base-caller. In this
paper, we propose a mathematical model for the "channel" from the input DNA
sequence to the observed current, and calculate bounds on the information
extraction capacity of the nanopore sequencer. This model incorporates
impairments like (non-linear) inter-symbol interference, deletions, as well as
random response. These information bounds have two-fold application: (1) The
decoding rate with a uniform input distribution can be used to calculate the
average size of the plausible list of DNA sequences given an observed current
trace. This bound can be used to benchmark existing base-calling algorithms, as
well as serving a performance objective to design better nanopores. (2) When
the nanopore sequencer is used as a reader in a DNA storage system, the storage
capacity is quantified by our bounds