20 research outputs found
Optimal Assembly for High Throughput Shotgun Sequencing
We present a framework for the design of optimal assembly algorithms for
shotgun sequencing under the criterion of complete reconstruction. We derive a
lower bound on the read length and the coverage depth required for
reconstruction in terms of the repeat statistics of the genome. Building on
earlier works, we design a de Brujin graph based assembly algorithm which can
achieve very close to the lower bound for repeat statistics of a wide range of
sequenced genomes, including the GAGE datasets. The results are based on a set
of necessary and sufficient conditions on the DNA sequence and the reads for
reconstruction. The conditions can be viewed as the shotgun sequencing analogue
of Ukkonen-Pevzner's necessary and sufficient conditions for Sequencing by
Hybridization.Comment: 26 pages, 18 figure
Do Read Errors Matter for Genome Assembly?
While most current high-throughput DNA sequencing technologies generate short
reads with low error rates, emerging sequencing technologies generate long
reads with high error rates. A basic question of interest is the tradeoff
between read length and error rate in terms of the information needed for the
perfect assembly of the genome. Using an adversarial erasure error model, we
make progress on this problem by establishing a critical read length, as a
function of the genome and the error rate, above which perfect assembly is
guaranteed. For several real genomes, including those from the GAGE dataset, we
verify that this critical read length is not significantly greater than the
read length required for perfect assembly from reads without errors.Comment: Submitted to ISIT 201
Partial DNA Assembly: A Rate-Distortion Perspective
Earlier formulations of the DNA assembly problem were all in the context of
perfect assembly; i.e., given a set of reads from a long genome sequence, is it
possible to perfectly reconstruct the original sequence? In practice, however,
it is very often the case that the read data is not sufficiently rich to permit
unambiguous reconstruction of the original sequence. While a natural
generalization of the perfect assembly formulation to these cases would be to
consider a rate-distortion framework, partial assemblies are usually
represented in terms of an assembly graph, making the definition of a
distortion measure challenging. In this work, we introduce a distortion
function for assembly graphs that can be understood as the logarithm of the
number of Eulerian cycles in the assembly graph, each of which correspond to a
candidate assembly that could have generated the observed reads. We also
introduce an algorithm for the construction of an assembly graph and analyze
its performance on real genomes.Comment: To be published at ISIT-2016. 11 pages, 10 figure
Near-optimal Assembly for Shotgun Sequencing with Noisy Reads
Recent work identified the fundamental limits on the information requirements
in terms of read length and coverage depth required for successful de novo
genome reconstruction from shotgun sequencing data, based on the idealistic
assumption of no errors in the reads (noiseless reads). In this work, we show
that even when there is noise in the reads, one can successfully reconstruct
with information requirements close to the noiseless fundamental limit. A new
assembly algorithm, X-phased Multibridging, is designed based on a
probabilistic model of the genome. It is shown through analysis to perform well
on the model, and through simulations to perform well on real genomes
Safe and complete contig assembly via omnitigs
Contig assembly is the first stage that most assemblers solve when
reconstructing a genome from a set of reads. Its output consists of contigs --
a set of strings that are promised to appear in any genome that could have
generated the reads. From the introduction of contigs 20 years ago, assemblers
have tried to obtain longer and longer contigs, but the following question was
never solved: given a genome graph (e.g. a de Bruijn, or a string graph),
what are all the strings that can be safely reported from as contigs? In
this paper we finally answer this question, and also give a polynomial time
algorithm to find them. Our experiments show that these strings, which we call
omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of
dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201