5,646 research outputs found
Do Read Errors Matter for Genome Assembly?
While most current high-throughput DNA sequencing technologies generate short
reads with low error rates, emerging sequencing technologies generate long
reads with high error rates. A basic question of interest is the tradeoff
between read length and error rate in terms of the information needed for the
perfect assembly of the genome. Using an adversarial erasure error model, we
make progress on this problem by establishing a critical read length, as a
function of the genome and the error rate, above which perfect assembly is
guaranteed. For several real genomes, including those from the GAGE dataset, we
verify that this critical read length is not significantly greater than the
read length required for perfect assembly from reads without errors.Comment: Submitted to ISIT 201
Coverage statistics for sequence census methods
Background: We study the statistical properties of fragment coverage in
genome sequencing experiments. In an extension of the classic Lander-Waterman
model, we consider the effect of the length distribution of fragments. We also
introduce the notion of the shape of a coverage function, which can be used to
detect abberations in coverage. The probability theory underlying these
problems is essential for constructing models of current high-throughput
sequencing experiments, where both sample preparation protocols and sequencing
technology particulars can affect fragment length distributions.
Results: We show that regardless of fragment length distribution and under
the mild assumption that fragment start sites are Poisson distributed, the
fragments produced in a sequencing experiment can be viewed as resulting from a
two-dimensional spatial Poisson process. We then study the jump skeleton of the
the coverage function, and show that the induced trees are Galton-Watson trees
whose parameters can be computed.
Conclusions: Our results extend standard analyses of shotgun sequencing that
focus on coverage statistics at individual sites, and provide a null model for
detecting deviations from random coverage in high-throughput sequence census
based experiments. By focusing on fragments, we are also led to a new approach
for visualizing sequencing data that should be of independent interest.Comment: 10 pages, 4 figure
A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data
Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified
single-cell genomes, and metagenomes has enabled investigation of a wide range
of organisms and ecosystems. However, sampling variation in short-read data
sets and high sequencing error rates of modern sequencers present many new
computational challenges in data interpretation. These challenges have led to
the development of new classes of mapping tools and {\em de novo} assemblers.
These algorithms are challenged by the continued improvement in sequencing
throughput. We here describe digital normalization, a single-pass computational
algorithm that systematizes coverage in shotgun sequencing data sets, thereby
decreasing sampling variation, discarding redundant data, and removing the
majority of errors. Digital normalization substantially reduces the size of
shotgun data sets and decreases the memory and time requirements for {\em de
novo} sequence assembly, all without significantly impacting content of the
generated contigs. We apply digital normalization to the assembly of microbial
genomic data, amplified single-cell genomic data, and transcriptomic data. Our
implementation is freely available for use and modification
Optimal Assembly for High Throughput Shotgun Sequencing
We present a framework for the design of optimal assembly algorithms for
shotgun sequencing under the criterion of complete reconstruction. We derive a
lower bound on the read length and the coverage depth required for
reconstruction in terms of the repeat statistics of the genome. Building on
earlier works, we design a de Brujin graph based assembly algorithm which can
achieve very close to the lower bound for repeat statistics of a wide range of
sequenced genomes, including the GAGE datasets. The results are based on a set
of necessary and sufficient conditions on the DNA sequence and the reads for
reconstruction. The conditions can be viewed as the shotgun sequencing analogue
of Ukkonen-Pevzner's necessary and sufficient conditions for Sequencing by
Hybridization.Comment: 26 pages, 18 figure
- …