Search CORE

314 research outputs found

Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms

Author: Koren Sergey
Publication venue
Publication date: 01/01/2012
Field of study

Genome assembly is a critical first step for biological discovery. All current sequencing technologies share the fundamental limitation that segments read from a genome are much shorter than even the smallest genomes. Traditionally, whole- genome shotgun (WGS) sequencing over-samples a single clonal (or inbred) target chromosome with segments from random positions. The amount of over-sampling is known as the coverage. Assembly software then reconstructs the target. So called next-generation (or second-generation) sequencing has reduced the cost and increased throughput exponentially over first-generation sequencing. Unfortunately, next-generation sequences present their own challenges to genome assembly: (1) they require amplification of source DNA prior to sequencing leading to artifacts and biased coverage of the genome; (2) they produce relatively short reads: 100bp- 700bp; (3) the sizeable runtime of most second-generation instruments is prohibitive for applications requiring rapid analysis, with an Illumina HiSeq 2000 instrument requiring 11 days for the sequencing reaction. Recently, successors to the second-generation instruments (third-generation) have become available. These instruments promise to alleviate many of the down- sides of second-generation sequencing and can generate multi-kilobase sequences. The long sequences have the potential to dramatically improve genome and transcriptome assembly. However, the high error rate of these reads is challenging and has limited their use. To address this limitation, we introduce a novel correction algorithm and assembly strategy that utilizes shorter, high-identity sequences to correct the error in single-molecule sequences. Our approach achieves over 99% read accuracy and produces substantially better assemblies than current sequencing strategies. The availability of cheaper sequencing has made new sequencing targets, such as multiple displacement amplified (MDA) single-cells and metagenomes, popular. Current algorithms assume assembly of a single clonal target, an assumption that is violated in these sequencing projects. We developed Bambus 2, a new scaffolder that works for metagenomics and single cell datasets. It can accurately detect repeats without assumptions about the taxonomic composition of a dataset. It can also identify biological variations present in a sample. We have developed a novel end-to-end analysis pipeline leveraging Bambus 2. Due to its modular nature, it is applicable to clonal, metagenomic, and MDA single-cell targets and allows a user to rapidly go from sequences to assembly, annotation, genes, and taxonomic info. We have incorporated a novel viewer, allowing a user to interactively explore the variation present in a genomic project on a laptop. Together, these developments make genome assembly applicable to novel targets while utilizing emerging sequencing technologies. As genome assembly is critical for all aspects of bioinformatics, these developments will enable novel biological discovery

Digital Repository at the University of Maryland

Population-level transcriptome sequencing of nonmodel organisms Erynnis propertius and Papilio zelicaon

Author: Carmichael Rory D
Dzurisin Jason DK
Emrich Scott J
Hellmann Jessica J
Lobo Neil F
O'Neil Shawn T
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Several recent studies have demonstrated the use of Roche 454 sequencing technology for <it>de novo </it>transcriptome analysis. Low error rates and high coverage also allow for effective SNP discovery and genetic diversity estimates. However, genetically diverse datasets, such as those sourced from natural populations, pose challenges for assembly programs and subsequent analysis. Further, estimating the effectiveness of transcript discovery using Roche 454 transcriptome data is still a difficult task. Results Using the Roche 454 FLX Titanium platform, we sequenced and assembled larval transcriptomes for two butterfly species: the Propertius duskywing, <it>Erynnis propertius </it>(Lepidoptera: Hesperiidae) and the Anise swallowtail, <it>Papilio zelicaon </it>(Lepidoptera: Papilionidae). The Expressed Sequence Tags (ESTs) generated represent a diverse sample drawn from multiple populations, developmental stages, and stress treatments. Despite this diversity, > 95% of the ESTs assembled into long (> 714 bp on average) and highly covered (> 9.6× on average) contigs. To estimate the effectiveness of transcript discovery, we compared the number of bases in the hit region of unigenes (contigs and singletons) to the length of the best match silkworm (<it>Bombyx mori</it>) protein--this "ortholog hit ratio" gives a close estimate on the amount of the transcript discovered relative to a model lepidopteran genome. For each species, we tested two assembly programs and two parameter sets; although CAP3 is commonly used for such data, the assemblies produced by Celera Assembler with modified parameters were chosen over those produced by CAP3 based on contig and singleton counts as well as ortholog hit ratio analysis. In the final assemblies, 1,413 <it>E. propertius </it>and 1,940 <it>P. zelicaon </it>unigenes had a ratio > 0.8; 2,866 <it>E. propertius </it>and 4,015 <it>P. zelicaon </it>unigenes had a ratio > 0.5. Conclusions Ultimately, these assemblies and SNP data will be used to generate microarrays for ecoinformatics examining climate change tolerance of different natural populations. These studies will benefit from high quality assemblies with few singletons (less than 26% of bases for each assembled transcriptome are present in unassembled singleton ESTs) and effective transcript discovery (over 6,500 of our putative orthologs cover at least 50% of the corresponding model silkworm gene).</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Assembly algorithms for next-generation sequencing data

Author: Koren Sergey
Miller Jason R.
Sutton Granger
Publication venue: Elsevier Inc.
Publication date: 01/06/2010
Field of study

AbstractThe emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly

Elsevier - Publisher Connector

PubMed Central

Long walk to genomics : history and current approaches to genome sequencing and assembly

Author: A.M. Giani
G. Formenti
G.R. Gallo
L. Gianfranceschi
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

Genomes represent the starting point of genetic studies. Since the discovery of DNA structure, scientists have devoted great efforts to determine their sequence in an exact way. In this review we provide a comprehensive historical background of the improvements in DNA sequencing technologies that have accompanied the major milestones in genome sequencing and assembly, ranging from early sequencing methods to Next-Generation Sequencing platforms. We then focus on the advantages and challenges of the current technologies and approaches, collectively known as Third Generation Sequencing. As these technical advancements have been accompanied by progress in analytical methods, we also review the bioinformatic tools currently employed in de novo genome assembly, as well as some applications of Third Generation Sequencing technologies and high-quality reference genomes

AIR Universita degli studi di Milano

The Diploid Genome Sequence of an Individual Human

Presented here is a genome sequence of an individual human. It was produced from ∼32 million random DNA fragments, sequenced by Sanger dideoxy technology and assembled into 4,528 scaffolds, comprising 2,810 million bases (Mb) of contiguous sequence with approximately 7.5-fold coverage for any given region. We developed a modified version of the Celera assembler to facilitate the identification and comparison of alternate alleles within this individual diploid genome. Comparison of this genome and the National Center for Biotechnology Information human reference assembly revealed more than 4.1 million DNA variants, encompassing 12.3 Mb. These variants (of which 1,288,319 were novel) included 3,213,401 single nucleotide polymorphisms (SNPs), 53,823 block substitutions (2–206 bp), 292,102 heterozygous insertion/deletion events (indels)(1–571 bp), 559,473 homozygous indels (1–82,711 bp), 90 inversions, as well as numerous segmental duplications and copy number variation regions. Non-SNP DNA variation accounts for 22% of all events identified in the donor, however they involve 74% of all variant bases. This suggests an important role for non-SNP genetic alterations in defining the diploid genome structure. Moreover, 44% of genes were heterozygous for one or more variants. Using a novel haplotype assembly strategy, we were able to span 1.5 Gb of genome sequence in segments >200 kb, providing further precision to the diploid nature of the genome. These data depict a definitive molecular portrait of a diploid human genome that provides a starting point for future genome comparisons and enables an era of individualized genomic information

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Diposit Digital de la Universitat de Barcelona

ScholarBank@NUS

Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

Author: Depristo
Durbin
Gingeras
H. Li
Homer
Idury
Iqbal
Lam
Levy
Myers
Myers
Myers
Peltola
Pevzner
Staden
Zerbino
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we proposed FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. Availability: http://github.com/lh3/fermi Contact: [email protected]: Rev2: submitted version with minor improvements; 7 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

Genomu hairetsu no asenburi, kaiseki oyobi hyoka no tame no keisan paipurain no kaihatsu

Author: Jayakumar Vasanthan
ジャヤクマルワサンタン
Publication venue: 慶應義塾大学大学院理工学研究科
Publication date
Field of study

KeiO Academic Resource Archive

Detecting and Correcting Errors in Genome Assemblies

Author: Subramanian Poorani
Publication venue
Publication date: 01/01/2010
Field of study

Genome assemblies have various types of deficiencies or misassemblies. This work is aimed at detecting and correcting a type of misassembly that we call Compression/Expansion or CE misassemblies whereby a section of sequence has been erroneously omitted or inserted in the assembly. Other types of deficiencies include gaps in the genome sequence. We developed a statistic for identifying Compression/Expansion misassemblies called the CE statistic. It is based on examining the placement of mate pairs of reads in the assembly. In addition to this, we developed an algorithm that is aimed at closing gaps and validating and/or correcting CE misassemblies detected by the CE statistic. This algorithm is similar to a shooting algorithm used in solving two-point boundary value problems in partial differential equations. We call this algorithm the Shooting Method. The Shooting Method finds all possible ways to assemble a local region of the genome contained between two target reads. We use a combination of the CE statistic and Shooting Method to detect and correct some CE misassemblies and close gaps in genome assemblies. We tested our techniques both on faux and real data. Applying this technique to 22 bacterial draft assemblies for which the finished genome sequence is known, we were able to identify 5 out of 8 real CE misassemblies. We applied the Shooting Method to a de novo assembly of the Bos taurus genome made from Sanger data. We were able to close 9,863 gaps out of 58,386. This added 8.34 Mbp of sequence to the assembly, and resulted in a 7 % increase of N50 contig size

Digital Repository at the University of Maryland

Whole-Genome Assembly: An Experimental Study of Computational Costs and Architectural Opportunities

Author: Espinosa Elena
Larrosa-Jiménez Rafael
López-Fernández Iván
Plata-Gonzalez Oscar Guillermo
Publication venue
Publication date: 01/01/2022
Field of study

Whole-genome sequencing (WGS) pro- vides a huge amount of reads from which a comple- te genome could be assembled. The recent advent of long read sequencing technologies, such as PacBio and Oxford Nanopore, and the subsequent appearance of high quality long reads (single molecule high-fidelity, or HiFi) have improved the scaffolding of the genome. However, both biology and computing communities still face great challenges in terms of computational cost. Thus, it is essential a high precision characte- rization of the methods for a correct identification of the main computing bottlenecks. This study will allow us to design new methods to mitigate compu- tational costs without losing accuracy and to adapt such methods to fully exploit new architectures that provide support to handle big amounts of data. In this paper, we experimentally study and characterize the most used whole-genome assemblers in order to design new approaches in this field.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

Repositorio Institucional Universidad de Málaga

Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome

Author: Deshpande P.
Ethe-Sayers S.
Goodwin S.
Gurtowski J.
McCombie W. R.
Schatz M. C.
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 15/07/2015
Field of study

Monitoring the progress of DNA molecules through a membrane pore has been postulated as a method for sequencing DNA for several decades. Recently, a nanopore-based sequencing instrument, the Oxford Nanopore MinION, has become available, and we used this for sequencing the Saccharomyces cerevisiae genome. To make use of these data, we developed a novel open-source hybrid error correction algorithm Nanocorr specifically for Oxford Nanopore reads, because existing packages were incapable of assembling the long read lengths (5-50 kbp) at such high error rates (between approximately 5% and 40% error). With this new method, we were able to perform a hybrid error correction of the nanopore reads using complementary MiSeq data and produce a de novo assembly that is highly contiguous and accurate: The contig N50 length is more than ten times greater than an Illumina-only assembly (678 kb versus 59.9 kbp) and has >99.88% consensus identity when compared to the reference. Furthermore, the assembly with the long nanopore reads presents a much more complete representation of the features of the genome and correctly assembles gene cassettes, rRNAs, transposable elements, and other genomic features that were almost entirely absent in the Illumina-only assembly

Cold Spring Harbor Laboratory Institutional Repository

PubMed Central