7 research outputs found

    Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim

    Get PDF
    Motivation: The commercial launch of 454 pyrosequencing in 2005 was a milestone in genome sequencing in terms of performance and cost. Throughout the three available releases, average read lengths have increased to ∼500 base pairs and are thus approaching read lengths obtained from traditional Sanger sequencing. Study design of sequencing projects would benefit from being able to simulate experiments

    Grinder: a versatile amplicon and shotgun sequence simulator

    Get PDF
    We introduce Grinder (http://sourceforge.net/ projects/biogrinder/), an open-source bioinformatic tool to simulate amplicon and shotgun (genomic, metagenomic, transcriptomic and metatranscriptomic) datasets from reference sequences. This is the first tool to simulate amplicon datasets (e.g. 16S rRNA) widely used by microbial ecologists. Grinder can create sequence libraries with a specific community structure, α and β diversities and experimental biases (e.g. chimeras, gene copy number variation) for commonly used sequencing platforms. This versatility allows the creation of simple to complex read datasets necessary for hypothesis testing when developing bioinformatic software, benchmarking existing tools or designing sequence-based experiments. Grinder is particularly useful for simulating clinical or environmental microbial communities and complements the use of in vitro mock communities

    New hyper-heuristic algorithm for gene fragment assembly

    Get PDF
    Gene assembly is a technique to construct a gene sequence by referring to gene fragments generated by sequencing machine. The gene fragments are often short and come in large number. As the number of gene fragments increases, the complexity of the problem increases, and this situation produces a wider solution space. To solve the gene assembly problem, the gene fragments need to be arranged in the right order. However, due to the complexity and wide solution space, the accurate solution to this problem is difficult to be found. By looking from the computational perspective, gene assembly problem is considered as nondeterministic-polynomial (NP) problem, where the gene assembly problem can be solved by using metaheuristic algorithms. Metaheuristic algorithms optimize the problem by searching for almost optimal solution. In this research, a hyper-heuristic algorithm is proposed to solve gene assembly problem due to its advantages that overcome the metaheuristic algorithms. This research is conducted based on three objectives. First, to analyze two metaheuristic algorithms, Chemical Reaction Optimization (CRO) and Quantum Inspired Evolutionary Algorithm (QIEA), to solve the problem. Second, a new hyper-heuristic algorithm (QCRO) is developed based on CRO and QIEA. Third, the solutions generated from all three algorithms are evaluated by using statistical analysis. The performance of the algorithms is evaluated by convergence analysis. The similarities of the draft gene sequence generated by the algorithms are analyzed by using Basic Local Alignment Search Tool (BLAST). The findings show that QCRO is competent in finding the right order of the fragments and solving the gene assembly problem. In conclusion, this research presented a new hyper-heuristic algorithm to solve gene fragment assembly problem that is derived from two metaheuristic algorithms. This algorithm is capable of finding the right order of the gene fragments and thus solves the gene assembly problem

    Efficient Algorithms for Prokaryotic Whole Genome Assembly and Finishing

    Get PDF
    De-novo genome assembly from DNA fragments is primarily based on sequence overlap information. In addition, mate-pair reads or paired-end reads provide linking information for joining gaps and bridging repeat regions. Genome assemblers in general assemble long contiguous sequences (contigs) using both overlapping reads and linked reads until the assembly runs into an ambiguous repeat region. These contigs are further bridged into scaffolds using linked read information. However, errors can be made in both phases of assembly due to high error threshold of overlap acceptance and linking based on too few mate reads. Identical as well as similar repeat regions can often cause errors in overlap and mate-pair evidence. In addition, the problem of setting the correct threshold to minimize errors and optimize assembly of reads is not trivial and often requires a time-consuming trial and error process to obtain optimal results. The typical trial-and-error with multiple assembler, which can be computationally intensive, and is very inefficient, especially when users must learn how to use a wide variety of assemblers, many of which may be serial requiring long execution time and will not return usable or accurate results. Further, we show that the comparison of assembly results may not provide the users with a clear winner under all circumstances. Therefore, we propose a novel scaffolding tool, Correlative Algorithm for Repeat Placement (CARP), capable of joining short low error contigs using mate pair reads, computationally resolved repeat structures and synteny with one or more reference organisms. The CARP tool requires a set of repeat sequences such as insertion sequences (IS) that can be found computationally found without assembling the genome. Development of methods to identify such repeating regions directly from raw sequence reads or draft genomes led to the development of the ISQuest software package. ISQuest identifies bacterial ISs and their sequence elements—inverted and direct repeats—in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours; making it a valuable high-throughput tool for a global search of IS and repeat elements. The CARP tool matches very low error contigs with strong overlap using the ambiguous partial repeat sequence at the ends of the contig annotated using the repeat sequences discovered using ISQuest. These matches are verified by synteny with genomes of one or more reference organisms. We show that the CARP tool can be used to verify low mate pair evidence regions, independently find new joins and significantly reduce the number of scaffolds. Finally, we are demonstrate a novel viewer that presents to the user the computationally derived joins along with the evidence used to make the joins. The viewer allows the user to independently assess their confidence in the joins made by the finishing tools and make an informed decision of whether to invest the resources necessary to confirm a particular portion of the assembly. Further, we allow users to manually record join evidence, re-order contigs, and track the assembly finishing process

    차세대 염기서열 분석 장비로 생성한 메타지놈 데이터 분석을 위한 최적의 생물정보학 시스템 개발

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 협동과정 생물정보학전공, 2014. 2. 천종식.Metagenome is total DNA directly extracted from environment, and the purpose of metagenomics is to reveal the function of the metagenome as well as the taxonomic structure in the metagenome. There are two analysis approaches for metagenomics, namely amplicon based approach and random shotgun based approach. Both approaches require large scale sequencing reads which could not be satisfied through Sanger sequencing. However, high throughput sequencing of reads at relatively low cost by Next Generation Sequencing (NGS) technologies meets the requirement of metagenomics. In addition, the advent of NGS technologies gave rise to the development of bioinformatic algorithms necessary for processing this large and complex sequencing data. Consequently, the large amount of sequencing data obtained from NGS and corresponding proper bioinformatic algorithms facilitated the metagenomics to become essential tool for microbiology. However, limitations incurred by NGS sequencing errors, short read length, and lack of analysis system still hinder accurate metagenome analysis. Therefore, evaluation of currently used NGS error handling algorithms and development of systematic pipeline with more efficient algorithms are required to improve the accuracy of analysis. In this study, bioinformatic pipelines were constructed for both metagenome analysis approaches. The pipelines were dedicated to improve the accuracy of the final end result by minimizing the effect of errors and short read length. For the amplicon based metagenomics, two different analysis pipelines were developed for both 454 pyrosequencing and Illumina MiSeq. During the construction of 454 pyrosequencing pipeline, new error handling algorithm was developed to treat homo-polymer and PCR errors. Upon completion of the pipeline construction, household microbial community was analyzed using 454 pyrosequencing data as a case study. As for Illumina MiSeq data, the most appropriate sequencing conditions and sequencing target region were settled. Paired end merging programs were evaluated and correlation of the sequencing errors and quality was studied to correct the errors within 3 overlap regions. Novel iterative consensus clustering method was developed to correct the errors occurring ubiquitously in a single read. For shotgun metagenomics approach, bioinformatic analysis system for Illumina MiSeq paired end data was constructed. Unlike the targeted amplicon sequencing reads, most of the shotgun sequencing reads are not mergedthus short reads are used for both functional and taxonomical profiling. However, a short read has less information than longer contigs, so the use of short reads is likely to cause biased characterization of the metagenome. Therefore, the development of analysis system did focus on creating longer contigs by means of mapping and de novo assembly. For raw read mapping, a dynamic mapping genome set construction method was developed. A list of mapping genomes was selected from the taxonomic profile inferred from the ribosomal RNA profiles. The genome sequence of the selected genomes were downloaded from Ezbiocloud. By mapping raw reads to the genome sequences, the longer contigs can be obtained in case of the relatively simple metagenome such as fecal matter. However in case of the complex metagenomes such as soil sample, both mapping and de novo assembly did not perform properly due to a lack of sequencing coverage and numerousity of uncultured microorganisms in the metagenome. In addition to the pipeline construction, visualization tools were also developed to display resultant taxonomic and functional profile at the same time. Newly developed JAVA-based standalone sequence alignment editing application was named as EzEditor. As both, conserved functional coding sequences and 16S rRNA gene have been used copiously in bacterial molecular phylogenetics, the codon-based sequence alignment editing functions are required for the coding genes. EzEditor provides simultaneous DNA and protein sequence alignment editing interface which enables us with the robust sequence alignment for both protein and rRNA sequences. EzEditor can be applied to various molecular sequence involved analysis not only as a basic sequence editor but also for phylogenetic application.ABSTRACT I TABLE OF CONTENTS IV ABBREVIATIONS VI FIGURE LIST VII TABLE LIST XII Chapter 1 General Introduction 1 1.1 Bioinformatics 2 1.2 Next Generation Sequencing 5 1.3 Metagenomics 11 1.4 Objectives of This Study 21 Chapter 2 Amplicon-based Metagenome Analysis Systems 23 2.1 Introduction 24 2.2 Analysis System for 454 Pyrosequencing 35 2.2.1 Methods 36 2.2.2 Results 39 2.3 Analysis System for Illumina MiSeq 60 2.3.1 Methods 62 2.3.2 Results 68 2.4 Summary and Discussion 93 Chapter 3 Shotgun-based Metagenome Analysis System 99 3.1 Introduction 100 3.1.1 Tools for Metagenomics 101 3.2 Methods 118 3.3 Results 125 3.4 Summary and Discussion 165 Chapter 4 EzEditor: A versatile Molecular Sequence Editor for Both Ribosomal RNA and Protein Coding Genes 169 4.1 Overview 170 4.2 Features of EzEditor 172 4.2.1 Algorithms and Models Implemented in EzEditor 177 4.2.2 Miscellaneous Functions 178 4.3 Summary and Discussion 181 Conclusions 183 References 187 APPENDIX I. Estimated Diversity Index of Household Microbiome 217 국문 초록 (Abstract in Korean) 221Docto

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up
    corecore