142 research outputs found

    Assembly algorithms for next-generation sequencing data

    Get PDF
    AbstractThe emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly

    Optimal reference sequence selection for genome assembly using minimum description length principle

    Get PDF
    Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that “counting the number of reads of the novel genome present in the reference sequence” is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of “counting the number of reads that align to the reference sequence” and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome

    Information Theory, Graph Theory and Bayesian Statistics based improved and robust methods in Genome Assembly

    Get PDF
    Bioinformatics skills required for genome sequencing often represent a significant hurdle for many researchers working in computational biology. This dissertation highlights the significance of genome assembly as a research area, focuses on its need to remain accurate, provides details about the characteristics of the raw data, examines some key metrics, emphasizes some tools and outlines the whole pipeline for next-generation sequencing. Currently, a major effort is being put towards the assembly of the genomes of all living organisms. Given the importance of comparative genome assembly, herein dissertation, the principle of Minimum Description Length (MDL) and its two variants, the Two-Part MDL and Sophisticated MDL, are explored in identifying the optimal reference sequence for genome assembly. Thereafter, a Modular Approach to Reference Assisted Genome Assembly Pipeline, referred to as MARAGAP, is developed. MARAGAP uses the principle of Minimum Description Length (MDL) to determine the optimal reference sequence for the assembly. The optimal reference sequence is used as a template to infer inversions, insertions, deletions and Single Nucleotide Polymorphisms (SNPs) in the target genome. MARAGAP uses an algorithmic approach to detect and correct inversions and deletions, a De-Bruijn graph based approach to infer insertions, an affine-match affine-gap local alignment tool to estimate the locations of insertions and a Bayesian estimation framework for detecting SNPs (called BECA). BECA effectively capitalizes on the `alignment-layout-consensus' paradigm and Quality (Q-) values for detecting and correcting SNPs by evaluating a number of probabilistic measures. However, the entire process is conducted once. BECA's framework is further extended by using Gibbs Sampling for further iterations of BECA. After each assembly the reference sequence is updated and the probabilistic score for each base call renewed. The revised reference sequence and probabilities are then further used to identify the alignments and consensus sequence, thereby, yielding an algorithm referred to as Gibbs-BECA. Gibbs-BECA further improves the performance both in terms of rectifying more SNPs and offering a robust performance even in the presence of a poor reference sequence. Lastly, another major effort in this dissertation was the development of two cohesive software platforms that combine many different genome assembly pipelines in two distinct environments, referred to as Baari and Genobuntu, respectively. Baari and Genobuntu support pre-assembly tools, genome assemblers as well as post-assembly tools. Additionally, a library of tools developed by the authors for Next Generation Sequencing (NGS) data and commonly used biological software have also been provided in these software platforms. Baari and Genobuntu are free, easily distributable and facilitate building laboratories and software workstations both for personal use as well as for a college/university laboratory. Baari is a customized Ubuntu OS packed with the tools mentioned beforehand whereas Genobuntu is a software package containing the same tools for users who already have Ubuntu OS pre-installed on their systems

    Computational approaches for metagenomic analysis of high-throughput sequencing data

    Get PDF
    High-throughput DNA sequencing has revolutionised microbiology and is the foundation on which the nascent field of metagenomics has been built. This ability to cheaply sample billions of DNA reads directly from environments has democratised sequencing and allowed researchers to gain unprecedented insights into diverse microbial communities. These technologies however are not without their limitations: the short length of the reads requires the production of vast amounts of data to ensure all information is captured. This “data deluge” has been a major bottleneck and has necessitated the development of new algorithms for analysis. Sequence alignment methods provide the most information about the composition of a sample as they allow both taxonomic and functional classification but algorithms are prohibitively slow. This inefficiency has led to the reliance on faster algorithms which only produce simple taxonomic classification or abundance estimation, losing the valuable information given by full alignments against annotated genomes. This thesis will describe k-SLAM, a novel ultra-fast method for the alignment and taxonomic classification of metagenomic data. Using a k -mer based method k-SLAM achieves speeds three orders of magnitude faster than current alignment based approaches, allowing a full taxonomic classification and gene identification to be tractable on modern large datasets. The alignments found by k-SLAM can also be used to find variants and identify genes, along with their nearest taxonomic origins. A novel pseudo-assembly method produces more specific taxonomic classifications on species which have high sequence identity within their genus. This provides a significant (up to 40%) increase in accuracy on these species. Also described is a re-analysis of a Shiga-toxin producing E. coli O104:H4 isolate via alignment against bacterial and viral species to find antibiotic resistance and toxin producing genes. k-SLAM has been used by a range of research projects including FLORINASH and is currently being used by a number of groups.Open Acces

    The A, C, G, and T of Genome Assembly

    Get PDF
    Genome assembly in its two decades of history has produced significant research, in terms of both biotechnology and computational biology. This contribution delineates sequencing platforms and their characteristics, examines key steps involved in filtering and processing raw data, explains assembly frameworks, and discusses quality statistics for the assessment of the assembled sequence. Furthermore, the paper explores recent Ubuntu-based software environments oriented towards genome assembly as well as some avenues for future research

    Objective review of de novo stand-alone error correction methods for NGS data

    Full text link
    [EN] The sequencing market has increased steadily over the last few years, with different approaches to read DNA information prone to different types of errors. Multiple studies demonstrated the impact of sequencing errors on different applications of next-generation sequencing (NGS), making error correction a fundamental initial step. Different methods in the literature use different approaches and fit different types of problems. We analyzed 50 methods divided into five main approaches (k-spectrum, suffix arrays, multiple-sequence alignment, read clustering, and probabilistic models). They are not published as a part of a suite (stand-alone), and target raw, unprocessed data without an existing reference genome (de novo). These correctors handle one or more sequencing technologies using the same or different approaches. They face general challenges (sometimes with specific traits for specific technologies) such as repetitive regions, uncalled bases, and ploidy. Even assessing their performance is a challenge in itself because of the approach taken by various authors, the unknown factor (de novo), and the behavior of the third-party tools employed in the benchmarks. This study aims to help the researcher in the field to advance the field of error correction, the educator to have a brief but comprehensive companion, and the bioinformatician to choose the right tool for the right job. © 2016 John Wiley & Sons, LtdWe want to thank our colleague Eloy Romero Alcale who has provided valuable advice regarding the structure of the document. This work was supported by Generalitat Valenciana [GRISOLIA/2013/013 to A.A.].Alic, AS.; Ruzafa, D.; Dopazo, J.; Blanquer Espert, I. (2016). Objective review of de novo stand-alone error correction methods for NGS data. Wiley Interdisciplinary Reviews: Computational Molecular Science. 6(2):111-146. https://doi.org/10.1002/wcms.1239S1111466

    The A, C, G, and T of Genome Assembly

    Get PDF

    Development of bioinformatics tools for the rapid and sensitive detection of known and unknown pathogens from next generation sequencing data

    Get PDF
    Infectious diseases still remain one of the main causes of death across the globe. Despite huge advances in clinical diagnostics, establishing a clear etiology remains impossible in a proportion of cases. Since the emergence of next generation sequencing (NGS), a multitude of new research fields based on this technology have evolved. Especially its application in metagenomics – denoting the research on genomic material taken directly from its environment – has led to a rapid development of new applications. Metagenomic NGS has proven to be a promising tool in the field of pathogen related research and diagnostics. In this thesis, I present different approaches for the detection of known and the discovery of unknown pathogens from NGS data. These contributions subdivide into three newly developed methods and one publication on a real-world use case of methodology we developed and data analysis based on it. First, I present LiveKraken, a real-time read classification tool based on the core algorithm of Kraken. LiveKraken uses streams of raw data from Illumina sequencers to classify reads taxonomically. This way, we are able to produce results identical to those of Kraken the moment the sequencer finishes. We are furthermore able to provide comparable results in early stages of a sequencing run, allowing saving up to a week of sequencing time. While the number of classified reads grows over time, false classifications appear in negligible numbers and proportions of identified taxa are only affected to a minor extent. In the second project, we designed and implemented PathoLive, a real-time diagnostics pipeline which allows the detection of pathogens from clinical samples before the sequencing procedure is finished. We adapted the core algorithm of HiLive, a real-time read mapper, and enhanced its accuracy for our use case. Furthermore, probably irrelevant sequences automatically marked. The results are visualized in an interactive taxonomic tree that provides an intuitive overview and detailed metrics regarding the relevance of each identified pathogen. Testing PathoLive on the sequencing of a real plasma sample spiked with viruses, we could prove that we ranked the results more accurately throughout the complete sequencing run than any other tested tool did at the end of the sequencing run. With PathoLive, we shift the focus of NGS-based diagnostics from read quantification towards a more meaningful assessment of results in unprecedented turnaround time. The third project aims at the detection of novel pathogens from NGS data. We developed RAMBO-K, a tool which allows rapid and sensitive removal of unwanted host sequences from NGS datasets. RAMBO-K is faster than any tool we tested, while showing a consistently high sensitivity and specificity across different datasets. RAMBO-K rapidly and reliably separates reads from different species. It is suitable as a straightforward standard solution for workflows dealing with mixed datasets. In the fourth project, we used RAMBO-K as well as several other data analyses to discover Berlin squirrelpox virus, a deviant new poxvirus establishing a new genus of poxviridae. Near Berlin, Germany, several juvenile red squirrels (Sciurus vulgaris) were found with moist, crusty skin lesions. Histology, electron microscopy, and cell culture isolation revealed an orthopoxvirus-like infection. After standard workflows yielded no significant results, poxviral reads were assigned using RAMBO-K, enabling the assembly of the genome of the novel virus. With these projects, we established three new application-related methods each of which closes different research gaps. Taken together, we enhance the available repertoire of NGS-based pathogen related research tools and alleviate and fasten a variety of research projects

    Quality value based models and methods for sequencing data

    Get PDF
    First isolated by Friedrich Miescher in 1869 and then identified by James Watson and Francis Crick in 1953, the double stranded DeoxyriboNucleic Acid (DNA) molecule of Homo sapiens took fifty years to be completely reconstructed and to finally be at disposal to researchers for deep studies and analyses. The first technologies for DNA sequencing appeared around the mid-1970s; among them the most successful has been chain termination method, usually referred to as Sanger method. They remained de-facto standard for sequencing until, at the beginning of the 2000s, Next Generation Sequencing (NGS) technologies started to be developed. These technologies are able to produce huge amount of data with competitive costs in terms of dollars per base, but now further advances are revealing themselves in form of Single Molecule Real Time (SMRT) based sequencer, like Pacific Biosciences, that promises to produce fragments of length never been available before. However, none of above technologies are able to read an entire DNA, they can only produce short fragments (called reads) of the sample in a process referred to as sequencing. Although all these technologies have different characteristics, one recurrent trend in their evolution has been represented by the constant grow of the fraction of errors injected into the final reads. While Sanger machines produce as low as 1 erroneous base in 1000, the recent PacBio sequencers have an average error rate of 15%; NGS machines place themselves roughly in the middle with the expected error rate around 1%. With such a heterogeneity of error profiles and, as more and more data is produced every day, algorithms being able to cope with different sequencing technologies are becoming fundamental; at the same time also models for the description of sequencing with the inclusion of error profiling are gaining importance. A key feature that can make these approaches really effective is the ability of sequencers of producing quality scores which measure the probability of observing a sequencing error. In this thesis we present a stochastic model for the sequencing process and show its application to the problems of clustering and filtering of reads. The novel idea is to use quality scores to build a probabilistic framework that models the entire process of sequencing. Although relatively straightforward, the developing of such a model goes through the proper definition of probability spaces and events on such spaces. To keep the model simple and tractable several simplification hypotheses need to be introduce, each of them, however, must be explicitly stated and extensively discussed. The final result is a model for sequencing process that can be used: to give probabilistic interpretation of the problems defined on sequencing data and to characterize corresponding probabilistic answers (i.e., solutions). To experimentally validate the aforementioned model, we apply it to two different problems: reads clustering and reads filtering. The first set of experiments goes through the introduction of a set of novel alignment-free measures D2 resulting from the extension of the well known D2 -type measures to incorporate quality values. More precisely, instead of adding a unit contribution to the k-mers count statistic (as for D2 statistics), each k- mer contributes with an additive term corresponding to its probability of being correct as defined by our stochastic model. We show that this new measures are effective when applied to clustering of reads, by employing clusters produced with D2 as input to the problems of metagenomic binning and de-novo assembly. In the second set of experiments conducted to validate our stochastic model, we applied the same definition of correct read to the problem of reads filtering. We first define rank filtering which is a lossless filtering technique that sorts reads based on a given criterion; then we used the sorted list of reads as input of algorithms for reads mapping and de-novo assembly. The idea is that, on the reordered set, reads ranking higher should have better quality than the ones at lower ranks. To test this conjecture, we use such filtering as pre-processing step of reads mapping and de-novo assembly; in both cases we observe improvements when our rank filtering approach is used
    • …
    corecore