Search CORE

1,531 research outputs found

Error Correction for DNA Sequencing via Disk Based Index and Box Queries

Author: Gu Yarong
Publication venue
Publication date: 07/04/2017
Field of study

The vast increase in DNA sequencing capacity over the last decade has quickly turned biology into a data-intensive science. Nevertheless, current sequencers such as Illumia HiSeq have high random per-base error rates, which makes sequencing error correction an indispensable require-ment for many sequence analysis applications. Most existing methods for error correction demand large expensive memory space, which limits their scalability for handling large datasets. In this thesis, we introduce a new disk based method, called DiskBQcor, for sequencing error correction. DiskBQcor stores k-mers for sequencing genome data along with their associated metadata in a disk based index tree, called the BoND-tree, and uses the index to efﬁciently process specially designed box queries to obtain relevant k-mers and their occurring frequencies. It takes an input read and locates the potential errors in the sequence. It then applies a comprehensive voting mech-anism and possibly an efﬁcient binary encoding based assembly technique to verify and correct an erroneous base in a genome sequence under various conditions. To overcome the drawback of an ofﬂine approach such as DiskBQcor for wasting computing resources while DNA sequecing is in process, we suggest an online approach to correcting sequencing errors. The online processing strategies and accuracy measures are discussed. An algorithm for deleting indexed k-mers from the BoND-tree, which is a step stone for the online sequencing error correction, is also introduced. Our experiments demonstrate that the proposed methods are quite promising in error correction for sequencing genome data on disk. The resulting BoND-tree with correct k-mers can also be used for sequence analysis applications such as variant detection.Master of ScienceComputer and Information Science, College of Engineering and Computer ScienceUniversity of Michigan-Dearbornhttps://deepblue.lib.umich.edu/bitstream/2027.42/136615/3/Thesis_YarongGu_519_corrected.pd

Deep Blue Documents at the University of Michigan

Computing Platforms for Big Biological Data Analytics: Perspectives and Challenges.

Author: Lan H
Liu W
Lu M
Tan G
Vasilakos AV
Yin Z
Publication venue: 'Elsevier BV'
Publication date: 23/08/2022
Field of study

The last decade has witnessed an explosion in the amount of available biological sequence data, due to the rapid progress of high-throughput sequencing projects. However, the biological data amount is becoming so great that traditional data analysis platforms and methods can no longer meet the need to rapidly perform data analysis tasks in life sciences. As a result, both biologists and computer scientists are facing the challenge of gaining a profound insight into the deepest biological functions from big biological data. This in turn requires massive computational resources. Therefore, high performance computing (HPC) platforms are highly needed as well as efficient and scalable algorithms that can take advantage of these platforms. In this paper, we survey the state-of-the-art HPC platforms for big biological data analytics. We first list the characteristics of big biological data and popular computing platforms. Then we provide a taxonomy of different biological data analysis applications and a survey of the way they have been mapped onto various computing platforms. After that, we present a case study to compare the efficiency of different computing platforms for handling the classical biological sequence alignment problem. At last we discuss the open issues in big biological data analytics

OPUS - University of Technology Sydney

High Performance Computing for DNA Sequence Alignment and Assembly

Author: Schatz Michael Christopher
Publication venue
Publication date: 01/01/2010
Field of study

Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

Digital Repository at the University of Maryland

Novel computational techniques for mapping and classifying Next-Generation Sequencing data

Author: Břinda Karel
Publication venue
Publication date
Field of study

Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

ZENODO

Practical algorithms for biological sequence analysis:methods and applications

Author: Retha Ahmad
Publication venue
Publication date: 01/06/2019
Field of study

King's Research Portal

Objective review of de novo stand-alone error correction methods for NGS data

Author: Aeschlimann
Alkio
Allam
Au
Bentley
Boudreau
Bradnam
Bragg
Burke
Busk
Chaisson
Coates
Cock
Cong
Cox
Darling
De Wit
Deorowicz
Do
Dohm
Dohm
Dominova
Duma
D′Agostino
Engel
Engström
Fertin
Fichot
Fiebig
Fitak
Friz
Fujimoto
Fuller
Gilchrist
Gilles
Goodwin
Greenfield
Grob
Gupta
Hackl
Henkel
Heo
Hernandez
Hill-Cawthorne
Hillier
Huang
Ilie
Ilie
Ištvánek
Jiao
Jnemann
Joppich
Kao
Kelley
Kenny
Kleigrewe
Koch
Koren
Kozarewa
Lada
Lada
Lambert
Le Duc
Li
Li
Li
Lieberman
Lim
Liu
Liu
Liu
Loman
Loman
Luo
MacManes
Marinier
Marçais
Medvedev
Miclotte
Mikheyev
Miller
Molnar
Mulley
Nakamura
Neumann
Nguyen
Nikolaichik
Nikolenko
Nobu
Nystedt
Ollier
Ono
Pevzner
Qu
Quail
Riedel
Rizk
Roccaro
Ross
Rödelsperger
Sahli
Salmela
Salmela
Salmela
Salzberg
Sanger
Schatz
Schirmer
Schröder
Schulz
Shao
Sheikhizadeh
Shen
Shi
Shi
Sleep
Song
Suzuki
Tahir
Taniguti
Walter
Wang
Wang
Watson
Weirather
Wijaya
Wirawan
Xu
Yan
Yang
Yang
Yang
Yang
Yoo
Zawada
Zerbino
Zhao
Zhao
Zhao
Zhu
Zimin
Publication venue: 'Wiley'
Publication date: 01/04/2016
Field of study

[EN] The sequencing market has increased steadily over the last few years, with different approaches to read DNA information prone to different types of errors. Multiple studies demonstrated the impact of sequencing errors on different applications of next-generation sequencing (NGS), making error correction a fundamental initial step. Different methods in the literature use different approaches and fit different types of problems. We analyzed 50 methods divided into five main approaches (k-spectrum, suffix arrays, multiple-sequence alignment, read clustering, and probabilistic models). They are not published as a part of a suite (stand-alone), and target raw, unprocessed data without an existing reference genome (de novo). These correctors handle one or more sequencing technologies using the same or different approaches. They face general challenges (sometimes with specific traits for specific technologies) such as repetitive regions, uncalled bases, and ploidy. Even assessing their performance is a challenge in itself because of the approach taken by various authors, the unknown factor (de novo), and the behavior of the third-party tools employed in the benchmarks. This study aims to help the researcher in the field to advance the field of error correction, the educator to have a brief but comprehensive companion, and the bioinformatician to choose the right tool for the right job. © 2016 John Wiley & Sons, LtdWe want to thank our colleague Eloy Romero Alcale who has provided valuable advice regarding the structure of the document. This work was supported by Generalitat Valenciana [GRISOLIA/2013/013 to A.A.].Alic, AS.; Ruzafa, D.; Dopazo, J.; Blanquer Espert, I. (2016). Objective review of de novo stand-alone error correction methods for NGS data. Wiley Interdisciplinary Reviews: Computational Molecular Science. 6(2):111-146. https://doi.org/10.1002/wcms.1239S1111466

Crossref

RiuNet

Computational recovery of enzyme haplotypes from a metagenome

Author: Nicholls Samuel
Publication venue
Publication date: 01/01/2018
Field of study

Aberystwyth Research Portal

University of Birmingham Research Portal

Efficient Algorithms for Prokaryotic Whole Genome Assembly and Finishing

Author: Biswas Abhishek
Publication venue: ODU Digital Commons
Publication date: 01/10/2015
Field of study

De-novo genome assembly from DNA fragments is primarily based on sequence overlap information. In addition, mate-pair reads or paired-end reads provide linking information for joining gaps and bridging repeat regions. Genome assemblers in general assemble long contiguous sequences (contigs) using both overlapping reads and linked reads until the assembly runs into an ambiguous repeat region. These contigs are further bridged into scaffolds using linked read information. However, errors can be made in both phases of assembly due to high error threshold of overlap acceptance and linking based on too few mate reads. Identical as well as similar repeat regions can often cause errors in overlap and mate-pair evidence. In addition, the problem of setting the correct threshold to minimize errors and optimize assembly of reads is not trivial and often requires a time-consuming trial and error process to obtain optimal results. The typical trial-and-error with multiple assembler, which can be computationally intensive, and is very inefficient, especially when users must learn how to use a wide variety of assemblers, many of which may be serial requiring long execution time and will not return usable or accurate results. Further, we show that the comparison of assembly results may not provide the users with a clear winner under all circumstances. Therefore, we propose a novel scaffolding tool, Correlative Algorithm for Repeat Placement (CARP), capable of joining short low error contigs using mate pair reads, computationally resolved repeat structures and synteny with one or more reference organisms. The CARP tool requires a set of repeat sequences such as insertion sequences (IS) that can be found computationally found without assembling the genome. Development of methods to identify such repeating regions directly from raw sequence reads or draft genomes led to the development of the ISQuest software package. ISQuest identifies bacterial ISs and their sequence elements—inverted and direct repeats—in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours; making it a valuable high-throughput tool for a global search of IS and repeat elements. The CARP tool matches very low error contigs with strong overlap using the ambiguous partial repeat sequence at the ends of the contig annotated using the repeat sequences discovered using ISQuest. These matches are verified by synteny with genomes of one or more reference organisms. We show that the CARP tool can be used to verify low mate pair evidence regions, independently find new joins and significantly reduce the number of scaffolds. Finally, we are demonstrate a novel viewer that presents to the user the computationally derived joins along with the evidence used to make the joins. The viewer allows the user to independently assess their confidence in the joins made by the finishing tools and make an informed decision of whether to invest the resources necessary to confirm a particular portion of the assembly. Further, we allow users to manually record join evidence, re-order contigs, and track the assembly finishing process

Old Dominion University

Genome Assembly Techniques

Author: Marcais Guillaume
Publication venue
Publication date: 01/01/2011
Field of study

Since the publication of the human genome in 2001, the price and the time of DNA sequencing have dropped dramatically. The genome of many more species have since been sequenced, and genome sequencing is an ever more important tool for biologists. This trend will likely revolutionize biology and medicine in the near future where the genome sequence of each individual person, instead of a model genome for the human, becomes readily accessible. Nevertheless, genome assembly remains a challenging computational problem, even more so with second generation sequencing technologies which generate a greater amount of data and make the assembly process more complex. Research to quickly, cheaply and accurately assemble the increasing amount of DNA sequenced is of great practical importance. In the first part of this thesis, we present two software developed to improve genome assemblies. First, Jellyfish is a fast k-mer counter, capable of handling large data sets. k-mer frequencies are central to many tasks in genome assembly (e.g. for error correction, finding read overlaps) and other study of the genome (e.g. finding highly repeated sequences such as transposons). Second, Chromosome Builder is a scaffolder and contig placement software. It aims at improving the accuracy of genome assembly. In the second part of this thesis we explore several problems dealing with graphs. The theory of graphs can be used to solve many computational problems. For example, the genome assembly problem can be represented as finding an Eulerian path in a de Bruijn graph. The physical interactions between proteins (PPI network), or between transcription factors and genes (regulatory networks), are naturally expressed as graphs. First, we introduce the concept of "exactly 3-edge-connected" graphs. These graphs have only a remote biological motivation but are interesting in their own right. Second, we study the reconstruction of ancestral network which aims at inferring the state of ancestral species' biological networks based on the networks of current species

Digital Repository at the University of Maryland

ProQuest OAI Repository