While most current high-throughput DNA sequencing technologies generate short
reads with low error rates, emerging sequencing technologies generate long
reads with high error rates. A basic question of interest is the tradeoff
between read length and error rate in terms of the information needed for the
perfect assembly of the genome. Using an adversarial erasure error model, we
make progress on this problem by establishing a critical read length, as a
function of the genome and the error rate, above which perfect assembly is
guaranteed. For several real genomes, including those from the GAGE dataset, we
verify that this critical read length is not significantly greater than the
read length required for perfect assembly from reads without errors.Comment: Submitted to ISIT 201