Nanopore sequencers are emerging as promising new platforms for
high-throughput sequencing. As with other technologies, sequencer errors pose a
major challenge for their effective use. In this paper, we present a novel
information theoretic analysis of the impact of insertion-deletion (indel)
errors in nanopore sequencers. In particular, we consider the following
problems: (i) for given indel error characteristics and rate, what is the
probability of accurate reconstruction as a function of sequence length; (ii)
what is the number of `typical' sequences within the distortion bound induced
by indel errors; (iii) using replicated extrusion (the process of passing a DNA
strand through the nanopore), what is the number of replicas needed to reduce
the distortion bound so that only one typical sequence exists within the
distortion bound.
Our results provide a number of important insights: (i) the maximum length of
a sequence that can be accurately reconstructed in the presence of indel and
substitution errors is relatively small; (ii) the number of typical sequences
within the distortion bound is large; and (iii) replicated extrusion is an
effective technique for unique reconstruction. In particular, we show that the
number of replicas is a slow function (logarithmic) of sequence length --
implying that through replicated extrusion, we can sequence large reads using
nanopore sequencers. Our model considers indel and substitution errors
separately. In this sense, it can be viewed as providing (tight) bounds on
reconstruction lengths and repetitions for accurate reconstruction when the two
error modes are considered in a single model.Comment: 12 pages, 5 figure