139 research outputs found

    GHOSTM: A GPU-Accelerated Homology Search Tool for Metagenomics

    Get PDF
    A large number of sensitive homology searches are required for mapping DNA sequence fragments to known protein sequences in public and private databases during metagenomic analysis. BLAST is currently used for this purpose, but its calculation speed is insufficient, especially for analyzing the large quantities of sequence data obtained from a next-generation sequencer. However, faster search tools, such as BLAT, do not have sufficient search sensitivity for metagenomic analysis. Thus, a sensitive and efficient homology search tool is in high demand for this type of analysis.We developed a new, highly efficient homology search algorithm suitable for graphics processing unit (GPU) calculations that was implemented as a GPU system that we called GHOSTM. The system first searches for candidate alignment positions for a sequence from the database using pre-calculated indexes and then calculates local alignments around the candidate positions before calculating alignment scores. We implemented both of these processes on GPUs. The system achieved calculation speeds that were 130 and 407 times faster than BLAST with 1 GPU and 4 GPUs, respectively. The system also showed higher search sensitivity and had a calculation speed that was 4 and 15 times faster than BLAT with 1 GPU and 4 GPUs.We developed a GPU-optimized algorithm to perform sensitive sequence homology searches and implemented the system as GHOSTM. Currently, sequencing technology continues to improve, and sequencers are increasingly producing larger and larger quantities of data. This explosion of sequence data makes computational analysis with contemporary tools more difficult. We developed GHOSTM, which is a cost-efficient tool, and offer this tool as a potential solution to this problem

    MR-CUDASW - GPU accelerated Smith-Waterman algorithm for medium-length (meta)genomic data

    Get PDF
    The idea of using a graphics processing unit (GPU) for more than simply graphic output purposes has been around for quite some time in scientific communities. However, it is only recently that its benefits for a range of bioinformatics and life sciences compute-intensive tasks has been recognized. This thesis investigates the possibility of improving the performance of the overlap determination stage of an Overlap Layout Consensus (OLC)-based assembler by using a GPU-based implementation of the Smith-Waterman algorithm. In this thesis an existing GPU-accelerated sequence alignment algorithm is adapted and expanded to reduce its completion time. A number of improvements and changes are made to the original software. Workload distribution, query profile construction, and thread scheduling techniques implemented by the original program are replaced by custom methods specifically designed to handle medium-length reads. Accordingly, this algorithm is the first highly parallel solution that has been specifically optimized to process medium-length nucleotide reads (DNA/RNA) from modern sequencing machines (i.e. Ion Torrent). Results show that the software reaches up to 82 GCUPS (Giga Cell Updates Per Second) on a single-GPU graphic card running on a commodity desktop hardware. As a result it is the fastest GPU-based implemen- tation of the Smith-Waterman algorithm tailored for processing medium-length nucleotide reads. Despite being designed for performing the Smith-Waterman algorithm on medium-length nucleotide sequences, this program also presents great potential for improving heterogeneous computing with CUDA-enabled GPUs in general and is expected to make contributions to other research problems that require sensitive pairwise alignment to be applied to a large number of reads. Our results show that it is possible to improve the performance of bioinformatics algorithms by taking full advantage of the compute resources of the underlying commodity hardware and further, these results are especially encouraging since GPU performance grows faster than multi-core CPUs

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Homology sequence analysis using GPU acceleration

    Get PDF
    A number of problems in bioinformatics, systems biology and computational biology field require abstracting physical entities to mathematical or computational models. In such studies, the computational paradigms often involve algorithms that can be solved by the Central Processing Unit (CPU). Historically, those algorithms benefit from the advancements of computing power in the serial processing capabilities of individual CPU cores. However, the growth has slowed down over recent years, as scaling out CPU has been shown to be both cost-prohibitive and insecure. To overcome this problem, parallel computing approaches that employ the Graphics Processing Unit (GPU) have gained attention as complementing or replacing traditional CPU approaches. The premise of this research is to investigate the applicability of various parallel computing platforms to several problems in the detection and analysis of homology in biological sequence. I hypothesize that by exploiting the sheer amount of computation power and sequencing data, it is possible to deduce information from raw sequences without supplying the underlying prior knowledge to come up with an answer. I have developed such tools to perform analysis at scales that are traditionally unattainable with general-purpose CPU platforms. I have developed a method to accelerate sequence alignment on the GPU, and I used the method to investigate whether the Operational Taxonomic Unit (OTU) classification problem can be improved with such sheer amount of computational power. I have developed a method to accelerate pairwise k-mer comparison on the GPU, and I used the method to further develop PolyHomology, a framework to scaffold shared sequence motifs across large numbers of genomes to illuminate the structure of the regulatory network in yeasts. The results suggest that such approach to heterogeneous computing could help to answer questions in biology and is a viable path to new discoveries in the present and the future.Includes bibliographical reference

    Data-independent acquisition mass spectrometry for human gut microbiota metaproteome analysis

    Get PDF
    Human digestive tract microbiota is a diverse community of microorganisms having complex interactions between microbes and the human host. Observing the functions carried out by microbes is essential for gaining understanding on the role of gut microbiota in human health and associations to diseases. New methods and tools are needed for acquirement of functional information from complex microbial samples. Metagenomic approaches focus on taxonomy or gene based function potential but lack power in the discovery of the actual functions carried out by the microbes. Metaproteomic methods are required to uncover the functions. The current highthroughput metaproteomics methods are based on mass spectrometry which is capable of identifying and quantifying ionized protein fragments, called peptides. Proteins can be inferred from the peptides and the functions associated with protein expression can be determined by using protein databases. Currently the most widely used data-dependent acquisition (DDA) method records only the most intensive ions in a semi-stochastic manner, which reduces reproducibility and produces incomplete records impairing quantification. Alternative data-independent acquisition (DIA) systematically records all ions and has been proposed as a replacement for DDA. However, recording all ions produces highly convoluted spectra from multiple peptides and, for this reason, it has not been known if and how DIA can be applied to metaproteomics where the number of different peptides is high. This thesis work introduced the DIA method for metaproteomic data analysis. The method was shown to achieve high reproducibility enabling the usage of only a single analysis per sample while DDA requires multiple. An easy to use open source software package, DIAtools, was developed for the analysis. Finally, the DIA analysis method was applied to study human gut microbiota and carbohydrate-active enzymes expressed in members of gut microbiota.Ihmisen suolistomikrobiston analyysi DIAmassaspektrometriamenetelmällä Ihmisen suoliston mikrobisto on monien mikro-organismien yhteisö, joka on vuorovaikutuksessa ihmisen kehon kanssa. Suoliston mikrobien toiminnan ymmärtäminen on keskeistä niiden roolista ihmisen terveyteen ja sairauksiin. Uusia tutkimusmenetelmiä tarvitaan mikrobien toiminnallisuuden määrittämiseen monimutkaisista, useita mikrobeja sisältävistä, näytteistä. Yleisesti käytetyt metagenomiikan menetelmät keskittyvät taksonomiaan tai geenien perusteella ennustettuihin funktioihin, mutta metaproteomiikkaa tarvitaan mikrobien toiminnan selvittämiseen. Metaproteomiikka-analyysiin voidaan käyttää massaspektrometriaa, jolla pystytään tunnistamaan ja määrittämään ionisoitujen proteiinien osasten, peptidien, määrä. Proteiinit voidaan päätellä peptideistä ja näin pystytään määrittämään proteiineihin liittyviä toimintoja hyödyntäen proteiinitietokantoja. Nykyisin käytetty DDA-menetelmä tunnistaa vain runsaimmin esiintyvät ionit, mikä rajoittaa sen hyödyntämistä. Siinä mitattavien ionien valinta on jossain määrin satunnainen, mikä vähentää tulosten toistettavuutta. Vaihtoehtoinen DIA-menetelmä analysoi järjestelmällisesti kaikki ionit ja kyseistä menetelmää on ehdotettu DDA:n tilalle. DIA-menetelmä tuottaa päällekkäisiä peptidispektrejä ja siksi aiemmin ei ole ollut tiedossa, onko se soveltuva menetelmä tai miten sitä olisi mahdollista soveltaa metaproteomiikkaan, jossa on suuri määrä erilaisia peptidejä. Tämä tutkimus esittelee soveltuvia tapoja DIA-menetelmän käyttöön metaproteomiikkadatan analysoinnissa. Työssä osoitetaan, että DIA-metaproteomiikka tuottaa luotettavasti toistettavia tuloksia. DIA-menetelmää käyttäessä riittää, että näyte analysoidaan vain yhden kerran, kun vastaavasti DDA-menetelmän käyttö vaatii useamman analysointikerran. Tutkimuksessa kehitettiin avoimen lähdekoodin ohjelmisto DIAtools, joka toteuttaa kehitetyt DIA-datojen analysointimenetelmät. Lopuksi DIA-analyysiä sovellettiin ruoansulatuskanavan mikrobien ja niiden tuottamien CAZy-entsyymien tutkimiseksi

    Metagenomic characterisation of the gastrointestinal virome of neonatal pigs

    Get PDF
    Microorganisms that colonise the gastrointestinal tract are responsible for a large portion of the genetic diversity of the body. These microorganisms are of bacterial, archaeal and viral origin. The living space of these microorganisms, the microbiome, holds numerous interactions both between each other and the host. The viral part of the microbiome, the virome, consists of a multitude of virus species. These viruses infect and modulate cells from all three domains of life. Even though viruses have been acknowledged for their abilities to induce disease in its host, knowledge about the total diversity of viruses within the virome, and the role it plays in health and disease, is so far scarce. It is thought that the virome co-evolved with the host and that its establishment in mammals occurs early in life. The virome can be studied by the use of viral metagenomics, the study of all viral genetic material within a sample. Viral metagenomics was used in this thesis to generate datasets for comparative metagenomics. These datasets were then used for disease investigation and to compare similarities in the viromes of two mammalian species, pigs and humans. This thesis establishes a methodological framework for studying the virome in mammals, by use of viral metagenomics. A methodology for amplifying the metagenome prior to sequencing was assessed and a software for bioinformatics analysis of viral metagenomes was developed. With the methodologies developed herein, the eukaryotic virome of neonatal piglets suffering from diarrhoea was investigated. Several known enteric viruses were detected using viral metagenomics on healthy and diarrhoeic neonatal piglets. However, no virus was present exclusively within sick or healthy piglets and no virologial cause could be established for the neonatal diarrhoea. Comparative viral metagenomics was also used to establish if similarities existed between neonates of porcine and human origin, as well as between adults and neonates. Similarities were detected between adults of both species, who seems to be sharing a considerable part of their virome. There was also a notable difference between neonatal viromes and adult viromes, further supporting established theories about diversification over time of the virome

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    The MGX framework for microbial community analysis

    Get PDF
    Jaenicke S. The MGX framework for microbial community analysis. Bielefeld: Universität Bielefeld; 2020
    corecore