1,449 research outputs found

    ALFALFA : fast and accurate mapping of long next generation sequencing reads

    Get PDF

    Unipept: computational exploration of metaproteome data

    Get PDF

    HSP-Wrap: The Design and Evaluation of Reusable Parallelism for a Subclass of Data-Intensive Applications

    Get PDF
    There is an increasing gap between the rate at which data is generated by scientific and non-scientific fields and the rate at which data can be processed by available computing resources. In this paper, we introduce the fields of Bioinformatics and Cheminformatics; two fields where big data has become a problem due to continuing advances in the technologies that drives these fields: such as gene sequencing and small ligand exploration. We introduce high performance computing as a means to process this growing base of data in order to facilitate knowledge discovery. We enumerate goals of the project including reusability, efficiency, reliability, and scalability. We then describe the implementation of a software scheduler which aims to improve input and output performance of a targeted collection of informatics tools, as well as the profiling and optimization needed to tune the software. We evaluate the performance of the software with a scalability study of the Bioinformatics tools BLAST, HMMER, and MUSCLE; as well as the Cheminformatics tool DOCK6

    Bioinformatics

    Get PDF
    This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Computational Methods for Gene Expression and Genomic Sequence Analysis

    Get PDF
    Advances in technologies currently produce more and more cost-effective, high-throughput, and large-scale biological data. As a result, there is an urgent need for developing efficient computational methods for analyzing these massive data. In this dissertation, we introduce methods to address several important issues in gene expression and genomic sequence analysis, two of the most important areas in bioinformatics.Firstly, we introduce a novel approach to predicting patterns of gene response to multiple treatments in case of small sample size. Researchers are increasingly interested in experiments with many treatments such as chemicals compounds or drug doses. However, due to cost, many experiments do not have large enough samples, making it difficult for conventional methods to predict patterns of gene response. Here we introduce an approach which exploited dependencies of pairwise comparisons outcomes and resampling techniques to predict true patterns of gene response in case of insufficient samples. This approach deduced more and better functionally enriched gene clusters than conventional methods. Our approach is therefore useful for multiple-treatment studies which have small sample size or contain highly variantly expressed genes.Secondly, we introduce a novel method for aligning short reads, which are DNA fragments extracted across genomes of individuals, to reference genomes. Results from short read alignment can be used for many studies such as measuring gene expression or detecting genetic variants. Here we introduce a method which employed an iterated randomized algorithm based on FM-index, an efficient data structure for full-text indexing, to align reads to the reference. This method improved alignment performance across a wide range of read lengths and error rates compared to several popular methods, making it a good choice for community to perform short read alignment.Finally, we introduce a novel approach to detecting genetic variants such as SNPs (single nucleotide polymorphisms) or INDELs (insertions/deletions). This study has great significance in a wide range of areas, from bioinformatics and genetic research to medical field. For example, one can predict how genomic changes are related to phenotype in their organism of interest, or associate genetic changes to disease risk or medical treatment efficacy. Here we introduce a method which leveraged known genetic variants existing in well-established databases to improve accuracy of detecting variants. This method had higher accuracy than several state-of-the-art methods in many cases, especially for detecting INDELs. Our method therefore has potential to be useful in research and clinical applications which rely on identifying genetic variants accurately

    多様なポストゲノムデータのためのアラインメントフリーなアルゴリズムの構造

    Get PDF
    学位の種別: 課程博士審査委員会委員 : (主査)東京大学教授 今井 浩, 東京大学教授 小林 直樹, 東京大学教授 五十嵐 健夫, 東京大学教授 杉山 将, 東京大学講師 笠原 雅弘University of Tokyo(東京大学
    corecore