10,446 research outputs found
REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads.
Repeat elements are important components of eukaryotic genomes. One limitation in our understanding of repeat elements is that most analyses rely on reference genomes that are incomplete and often contain missing data in highly repetitive regions that are difficult to assemble. To overcome this problem we develop a new method, REPdenovo, which assembles repeat sequences directly from raw shotgun sequencing data. REPdenovo can construct various types of repeats that are highly repetitive and have low sequence divergence within copies. We show that REPdenovo is substantially better than existing methods both in terms of the number and the completeness of the repeat sequences that it recovers. The key advantage of REPdenovo is that it can reconstruct long repeats from sequence reads. We apply the method to human data and discover a number of potentially new repeats sequences that have been missed by previous repeat annotations. Many of these sequences are incorporated into various parasite genomes, possibly because the filtering process for host DNA involved in the sequencing of the parasite genomes failed to exclude the host derived repeat sequences. REPdenovo is a new powerful computational tool for annotating genomes and for addressing questions regarding the evolution of repeat families. The software tool, REPdenovo, is available for download at https://github.com/Reedwarbler/REPdenovo
Cerulean: A hybrid assembly using high throughput short and long reads
Genome assembly using high throughput data with short reads, arguably,
remains an unresolvable task in repetitive genomes, since when the length of a
repeat exceeds the read length, it becomes difficult to unambiguously connect
the flanking regions. The emergence of third generation sequencing (Pacific
Biosciences) with long reads enables the opportunity to resolve complicated
repeats that could not be resolved by the short read data. However, these long
reads have high error rate and it is an uphill task to assemble the genome
without using additional high quality short reads. Recently, Koren et al. 2012
proposed an approach to use high quality short reads data to correct these long
reads and, thus, make the assembly from long reads possible. However, due to
the large size of both dataset (short and long reads), error-correction of
these long reads requires excessively high computational resources, even on
small bacterial genomes. In this work, instead of error correction of long
reads, we first assemble the short reads and later map these long reads on the
assembly graph to resolve repeats.
Contribution: We present a hybrid assembly approach that is both
computationally effective and produces high quality assemblies. Our algorithm
first operates with a simplified version of the assembly graph consisting only
of long contigs and gradually improves the assembly by adding smaller contigs
in each iteration. In contrast to the state-of-the-art long reads error
correction technique, which requires high computational resources and long
running time on a supercomputer even for bacterial genome datasets, our
software can produce comparable assembly using only a standard desktop in a
short running time.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
Recommended from our members
Shotgun metagenome data of a defined mock community using Oxford Nanopore, PacBio and Illumina technologies.
Metagenomic sequence data from defined mock communities is crucial for the assessment of sequencing platform performance and downstream analyses, including assembly, binning and taxonomic assignment. We report a comparison of shotgun metagenome sequencing and assembly metrics of a defined microbial mock community using the Oxford Nanopore Technologies (ONT) MinION, PacBio and Illumina sequencing platforms. Our synthetic microbial community BMock12 consists of 12 bacterial strains with genome sizes spanning 3.2-7.2 Mbp, 40-73% GC content, and 1.5-7.3% repeats. Size selection of both PacBio and ONT sequencing libraries prior to sequencing was essential to yield comparable relative abundances of organisms among all sequencing technologies. While the Illumina-based metagenome assembly yielded good coverage with few misassemblies, contiguity was greatly improved by both, Illumina + ONT and Illumina + PacBio hybrid assemblies but increased misassemblies, most notably in genomes with high sequence similarity to each other. Our resulting datasets allow evaluation and benchmarking of bioinformatics software on Illumina, PacBio and ONT platforms in parallel
Telescoper: de novo assembly of highly repetitive regions.
MotivationWith advances in sequencing technology, it has become faster and cheaper to obtain short-read data from which to assemble genomes. Although there has been considerable progress in the field of genome assembly, producing high-quality de novo assemblies from short-reads remains challenging, primarily because of the complex repeat structures found in the genomes of most higher organisms. The telomeric regions of many genomes are particularly difficult to assemble, though much could be gained from the study of these regions, as their evolution has not been fully characterized and they have been linked to aging.ResultsIn this article, we tackle the problem of assembling highly repetitive regions by developing a novel algorithm that iteratively extends long paths through a series of read-overlap graphs and evaluates them based on a statistical framework. Our algorithm, Telescoper, uses short- and long-insert libraries in an integrated way throughout the assembly process. Results on real and simulated data demonstrate that our approach can effectively resolve much of the complex repeat structures found in the telomeres of yeast genomes, especially when longer long-insert libraries are used.AvailabilityTelescoper is publicly available for download at sourceforge.net/p/[email protected] informationSupplementary data are available at Bioinformatics online
Diminishing Return for Increased Mappability with Longer Sequencing Reads: Implications of the k-mer Distributions in the Human Genome
The amount of non-unique sequence (non-singletons) in a genome directly
affects the difficulty of read alignment to a reference assembly for high
throughput-sequencing data. Although a greater length increases the chance for
reads being uniquely mapped to the reference genome, a quantitative analysis of
the influence of read lengths on mappability has been lacking. To address this
question, we evaluate the k-mer distribution of the human reference genome. The
k-mer frequency is determined for k ranging from 20 to 1000 basepairs. We use
the proportion of non-singleton k-mers to evaluate the mappability of reads for
a corresponding read length. We observe that the proportion of non-singletons
decreases slowly with increasing k, and can be fitted by piecewise power-law
functions with different exponents at different k ranges. A faster decay at
smaller values for k indicates more limited gains for read lengths > 200
basepairs. The frequency distributions of k-mers exhibit long tails in a
power-law-like trend, and rank frequency plots exhibit a concave Zipf's curve.
The location of the most frequent 1000-mers comprises 172 kilobase-ranged
regions, including four large stretches on chromosomes 1 and X, containing
genes with biomedical implications. Even the read length 1000 would be
insufficient to reliably sequence these specific regions.Comment: 5 figure
Genome assembly in the telomere-to-telomere era
De novo assembly is the process of reconstructing the genome sequence of an
organism from sequencing reads. Genome sequences are essential to biology, and
assembly has been a central problem in bioinformatics for four decades. Until
recently, genomes were typically assembled into fragments of a few megabases at
best but technological advances in long-read sequencing now enable near
complete chromosome-level assembly, also known as telomere-to-telomere
assembly, for many organisms. Here we review recent progress on assembly
algorithms and protocols. We focus on how to derive near telomere-to-telomere
assemblies and discuss potential future developments
- …