Search CORE

192 research outputs found

Bacterial and viral identification and differentiation by amplicon sequencing on the MinION nanopore sequencer

Author: Alvin T Liem
Andy Kilianski
C Nicole Rosenzweig
Dana R Kadavy
Elizabeth J Corriveau
Jamie L Haas
Kristen L Willis
Samuel S Minot
Publication venue: Oxford University Press (OUP)
Publication date: 01/01/2015
Field of study

Cerulean: A hybrid assembly using high throughput short and long reads

Author: D.R. Zerbino
E.W. Myers
E.W. Myers
F.J. Ribeiro
J. Schmutz
J.T. Simpson
J.T. Simpson
M. Eisenstein
M.J. Chaisson
M.J. Chaisson
P.A. Pevzner
R. Staden
R.M. Idury
S. Koren
T.D. Wu
Publication venue
Publication date: 01/01/2013
Field of study

Genome assembly using high throughput data with short reads, arguably, remains an unresolvable task in repetitive genomes, since when the length of a repeat exceeds the read length, it becomes difficult to unambiguously connect the flanking regions. The emergence of third generation sequencing (Pacific Biosciences) with long reads enables the opportunity to resolve complicated repeats that could not be resolved by the short read data. However, these long reads have high error rate and it is an uphill task to assemble the genome without using additional high quality short reads. Recently, Koren et al. 2012 proposed an approach to use high quality short reads data to correct these long reads and, thus, make the assembly from long reads possible. However, due to the large size of both dataset (short and long reads), error-correction of these long reads requires excessively high computational resources, even on small bacterial genomes. In this work, instead of error correction of long reads, we first assemble the short reads and later map these long reads on the assembly graph to resolve repeats. Contribution: We present a hybrid assembly approach that is both computationally effective and produces high quality assemblies. Our algorithm first operates with a simplified version of the assembly graph consisting only of long contigs and gradually improves the assembly by adding smaller contigs in each iteration. In contrast to the state-of-the-art long reads error correction technique, which requires high computational resources and long running time on a supercomputer even for bacterial genome datasets, our software can produce comparable assembly using only a standard desktop in a short running time.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013

arXiv.org e-Print Archive

Crossref

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

Author: Alkan Can
Alser Mohammed
Cali Damla Senol
Firtina Can
Ghiasi Nika Mansouri
Kanellopoulos Konstantinos
Kim Jeremie S.
Mutlu Onur
Park Jisung
Shahroodi Taha
Singh Gagandeep
Publication venue
Publication date: 20/07/2022
Field of study

Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either 1) increasing the use of the costly sequence alignment or 2) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seeds matches. BLEND 1) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and 2) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.6x-63.5x (on average 19.5x), has a lower memory footprint by 0.9x-9.7x (on average 3.6x), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.7x-3.7x (on average 1.7x) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND

arXiv.org e-Print Archive

Repository for Publications and Research Data

Sequence Alignment Tools: One Parallel Pattern to Rule Them All?

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

Crossref

Recommended from our members

De novo assembly of the cattle reference genome with single-molecule sequencing.

Author: Bickhart Derek M
Cole John B
Couldrey Christine
Dreischer Christian
Elsik Christine G
Ghurye Jay
Hagen Darren E
Hall Richard
Hammond John A
Hoffman Jinna
Koren Sergey
Li Wenli
Liu George
Low Wai Y
McDaneld Tara G
McKay Stephanie D
Medrano Juan F
Murdoch Brenda M
Nandolo Wilson
Phillippy Adam M
Rhie Arang
Rosen Benjamin D
Rowan Troy N
Schnabel Robert D
Schroeder Steven G
Schultheiss Sebastian J
Schwartz John C
Smith Timothy PL
Snelling Warren M
Thibaud-Nissen Françoise
Tseng Elizabeth
Van Tassell Curtis P
Zimin Aleksey
Publication venue: eScholarship, University of California
Publication date: 01/03/2020
Field of study

BackgroundMajor advances in selection progress for cattle have been made following the introduction of genomic tools over the past 10-12 years. These tools depend upon the Bos taurus reference genome (UMD3.1.1), which was created using now-outdated technologies and is hindered by a variety of deficiencies and inaccuracies.ResultsWe present the new reference genome for cattle, ARS-UCD1.2, based on the same animal as the original to facilitate transfer and interpretation of results obtained from the earlier version, but applying a combination of modern technologies in a de novo assembly to increase continuity, accuracy, and completeness. The assembly includes 2.7 Gb and is >250× more continuous than the original assembly, with contig N50 >25 Mb and L50 of 32. We also greatly expanded supporting RNA-based data for annotation that identifies 30,396 total genes (21,039 protein coding). The new reference assembly is accessible in annotated form for public use.ConclusionsWe demonstrate that improved continuity of assembled sequence warrants the adoption of ARS-UCD1.2 as the new cattle reference genome and that increased assembly accuracy will benefit future research on this species

eScholarship - University of California

SVIM: Structural Variant Identification using Mapped Long Reads

Author: Heller D.
Vingron M.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/09/2019
Field of study

Motivation: Structural variants are defined as genomic variants larger than 50bp. They have been shown to affect more bases in any given genome than SNPs or small indels. Additionally, they have great impact on human phenotype and diversity and have been linked to numerous diseases. Due to their size and association with repeats, they are difficult to detect by shotgun sequencing, especially when based on short reads. Long read, single molecule sequencing technologies like those offered by Pacific Biosciences or Oxford Nanopore Technologies produce reads with a length of several thousand base pairs. Despite the higher error rate and sequencing cost, long read sequencing offers many advantages for the detection of structural variants. Yet, available software tools still do not fully exploit the possibilities. Results: We present SVIM, a tool for the sensitive detection and precise characterization of structural variants from long read data. SVIM consists of three components for the collection, clustering and combination of structural variant signatures from read alignments. It discriminates five different variant classes including similar types, such as tandem and interspersed duplications and novel element insertions. SVIM is unique in its capability of extracting both the genomic origin and destination of duplications. It compares favorably with existing tools in evaluations on simulated data and real datasets from PacBio and Nanopore sequencing machines. Availability and implementation: The source code and executables of SVIM are available on Github: github.com/eldariont/svim. SVIM has been implemented in Python 3 and published on bioconda and the Python Package Index. Supplementary information: Supplementary data are available at Bioinformatics online

MPG.PuRe

Characterization of MinION nanopore data for resequencing analyses

Author: Giusti Betti
Magi Alberto
Tattini Lorenzo
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2016
Field of study

Florence Research

Jabba: hybrid error correction for long sequencing reads using maximal exact matches

Author: Audenaert P.
Demeester Piet
Fostier Jan
Heydari Mahdi
Miclotte Giles
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2015
Field of study

Third generation sequencing platforms produce longer reads with higher error rates than second generation sequencing technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of maximal exact matches in the context of third generation reads are presented

Ghent University Academic Bibliography

Read alignment using deep neural networks

Author: Shrestha Akash
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2019
Field of study

2019 Spring.Includes bibliographical references.Read alignment is the process of mapping short DNA sequences into the reference genome. With the advent of consecutively evolving "next generation" sequencing technologies, the need for sequence alignment tools appeared. Many scientific communities and the companies marketing the sequencing technologies developed a whole spectrum of read aligners/mappers for different error profiles and read length characteristics. Among the most recent successfully marketed sequencing technologies are Oxford Nanopore and PacBio SMRT sequencing, which are considered top players because of their extremely long reads and low cost. However, the reads may contain error up to 20% that are not generally uniformly distributed. To deal with that level of error rate and read length, proximity preserving hashing techniques, such as Minhash and Minimizers, were utilized to quickly map a read to the target region of the reference sequence. Subsequently, a variant of global or local alignment dynamic programming is then used to give the final alignment. In this research work, we train a Deep Neural Network (DNN) to yield a hashing scheme for the highly erroneous long reads, which is deemed superior to Minhash for mapping the reads. We implemented that idea to build a read alignment tool: DNNAligner. We evaluated the performance of our aligner against the popular read aligners in the bioinformatics community currently — minimap2, bwa-mem and graphmap. Our results show that the performance of DNNAligner is comparable to other tools without any code optimization or integration of other advanced features. Moreover, DNN exhibits superior performance in comparison with Minhashon neighborhood classification

Mountain Scholar (Digital Collections of Colorado and Wyoming)