83,421 research outputs found
Searching of gapped repeats and subrepetitions in a word
A gapped repeat is a factor of the form where and are nonempty
words. The period of the gapped repeat is defined as . The gapped
repeat is maximal if it cannot be extended to the left or to the right by at
least one letter with preserving its period. The gapped repeat is called
-gapped if its period is not greater than . A
-subrepetition is a factor which exponent is less than 2 but is not
less than (the exponent of the factor is the quotient of the length
and the minimal period of the factor). The -subrepetition is maximal if
it cannot be extended to the left or to the right by at least one letter with
preserving its minimal period. We reveal a close relation between maximal
gapped repeats and maximal subrepetitions. Moreover, we show that in a word of
length the number of maximal -gapped repeats is bounded by
and the number of maximal -subrepetitions is bounded by
. Using the obtained upper bounds, we propose algorithms for
finding all maximal -gapped repeats and all maximal
-subrepetitions in a word of length . The algorithm for finding all
maximal -gapped repeats has time complexity for the case
of constant alphabet size and time complexity for the
general case. For finding all maximal -subrepetitions we propose two
algorithms. The first algorithm has time
complexity for the case of constant alphabet size and time complexity for the general case. The
second algorithm has
expected time complexity
Optimal Assembly for High Throughput Shotgun Sequencing
We present a framework for the design of optimal assembly algorithms for
shotgun sequencing under the criterion of complete reconstruction. We derive a
lower bound on the read length and the coverage depth required for
reconstruction in terms of the repeat statistics of the genome. Building on
earlier works, we design a de Brujin graph based assembly algorithm which can
achieve very close to the lower bound for repeat statistics of a wide range of
sequenced genomes, including the GAGE datasets. The results are based on a set
of necessary and sufficient conditions on the DNA sequence and the reads for
reconstruction. The conditions can be viewed as the shotgun sequencing analogue
of Ukkonen-Pevzner's necessary and sufficient conditions for Sequencing by
Hybridization.Comment: 26 pages, 18 figure
Inverted and mirror repeats in model nucleotide sequences
We analytically and numerically study the probabilistic properties of
inverted and mirror repeats in model sequences of nucleic acids. We consider
both perfect and non-perfect repeats, i.e. repeats with mismatches and gaps.
The considered sequence models are independent identically distributed (i.i.d.)
sequences, Markov processes and long range sequences. We show that the number
of repeats in correlated sequences is significantly larger than in i.i.d.
sequences and that this discrepancy increases exponentially with the repeat
length for long range sequences.Comment: 12 pages, 6 figure
Telescoper: de novo assembly of highly repetitive regions.
MotivationWith advances in sequencing technology, it has become faster and cheaper to obtain short-read data from which to assemble genomes. Although there has been considerable progress in the field of genome assembly, producing high-quality de novo assemblies from short-reads remains challenging, primarily because of the complex repeat structures found in the genomes of most higher organisms. The telomeric regions of many genomes are particularly difficult to assemble, though much could be gained from the study of these regions, as their evolution has not been fully characterized and they have been linked to aging.ResultsIn this article, we tackle the problem of assembling highly repetitive regions by developing a novel algorithm that iteratively extends long paths through a series of read-overlap graphs and evaluates them based on a statistical framework. Our algorithm, Telescoper, uses short- and long-insert libraries in an integrated way throughout the assembly process. Results on real and simulated data demonstrate that our approach can effectively resolve much of the complex repeat structures found in the telomeres of yeast genomes, especially when longer long-insert libraries are used.AvailabilityTelescoper is publicly available for download at sourceforge.net/p/[email protected] informationSupplementary data are available at Bioinformatics online
- …