26 research outputs found

    Efficient Storage of Genomic Sequences in High Performance Computing Systems

    Get PDF
    ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction

    On The Parallelization Of Integer Polynomial Multiplication

    Get PDF
    With the advent of hardware accelerator technologies, multi-core processors and GPUs, much effort for taking advantage of those architectures by designing parallel algorithms has been made. To achieve this goal, one needs to consider both algebraic complexity and parallelism, plus making efficient use of memory traffic, cache, and reducing overheads in the implementations. Polynomial multiplication is at the core of many algorithms in symbolic computation such as real root isolation which will be our main application for now. In this thesis, we first investigate the multiplication of dense univariate polynomials with integer coefficients targeting multi-core processors. Some of the proposed methods are based on well-known serial classical algorithms, whereas a novel algorithm is designed to make efficient use of the targeted hardware. Experimentation confirms our theoretical analysis. Second, we report on the first implementation of subproduct tree techniques on many-core architectures. These techniques are basically another application of polynomial multiplication, but over a prime field. This technique is used in multi-point evaluation and interpolation of polynomials with coefficients over a prime field

    Data Tiling for Sparse Computation

    Get PDF
    Many real-world data contain internal relationships. Efficient analysis of these relationship data is crucial for important problems including genome alignment, network vulnerability analysis, ranking web pages, among others. Such relationship data is frequently sparse and analysis on it is called sparse computation. We demonstrate that the important technique of data tiling is more powerful than previously known by broadening its application space. We focus on three important sparse computation areas: graph analysis, linear algebra, and bioinformatics. We demonstrate data tiling's power by addressing key issues and providing significant improvements---to both runtime and solution quality---in each area. For graph analysis, we focus on fast data tiling techniques that can produce well-structured tiles and demonstrate theoretical hardness results. These tiles are suitable for graph problems as they reduce data movement and ultimately improve end-to-end runtime performance. For linear algebra, we introduce a new cache-aware tiling technique and apply it to the key kernel of sparse matrix by sparse matrix multiplication. This technique tiles the second input matrix and then uses a small, summary matrix to guide access to the tiles during computation. Our approach results in the fastest known implementation across three distinct CPU architectures. In bioinformatics, we develop a tiling based de novo genome assembly pipeline. We start with reads and develop either a graph or hypergraph that captures internal relationships between reads. This is then tiled to minimize connections while maintaining balance. We then treat each resulting tile independently as the input to an existing, shared-memory assembler. Our pipeline improves existing state-of-the-art de novo genome assemblers and brings both runtime and quality improvements to them on both real-world and simulated datasets.Ph.D

    Data distribution and task scheduling for distributed computing of all-to-all comparison problems

    Get PDF
    This research studied distributed computing of all-to-all comparison problems with big data sets. The thesis formalised the problem, and developed a high-performance and scalable computing framework with a programming model, data distribution strategies and task scheduling policies to solve the problem. The study considered storage usage, data locality and load balancing for performance improvement in solving the problem. The research outcomes can be applied in bioinformatics, biometrics and data mining and other domains in which all-to-all comparisons are a typical computing pattern

    Benchtop sequencing on benchtop computers

    Get PDF
    Next Generation Sequencing (NGS) is a powerful tool to gain new insights in molecular biology. With the introduction of the first bench top NGS sequencing machines (e.g. Ion Torrent, MiSeq), this technology became even more versatile in its applications and the amount of data that are produced in a short time is ever increasing. The demand for new and more efficient sequence analysis tools increases at the same rate as the throughput of sequencing technologies. New methods and algorithms not only need to be more efficient but also need to account for a higher genetic variability between the sequenced and annotated data. To obtain reliable results, information about errors and limitations of NGS technologies should also be investigated. Furthermore, methods need to be able to cope with contamination in the data. In this thesis we present methods and algorithms for NGS analysis. Firstly, we present a fast and precise method to align NGS reads to a reference genome. This method, called NextGenMap, was designed to work with data from Illumina, 454 and Ion Torrent technologies, and is easily extendable to new upcoming technologies. We use a pairwise sequence alignment in combination with an exact match filter approach to maximize the number of correctly mapped reads. To reduce runtime (mapping a 16x coverage human genome data set within hours) we developed an optimized banded pairwise alignment algorithm for NGS data. We implemented this algorithm using high performance programing interfaces for central processing units using SSE (Streaming SIMD Extensions) and OpenCL as well as for graphic processing units using OpenCL and CUDA. Thus, NextGenMap can make maximal use of all existing hardware no matter whether it is a high end compute cluster or a standard desktop computer or even a laptop. We demonstrated the advantages of NextGenMap based on real and simulated data over other mapping methods and showed that NextGenMap outperforms current methods with respect to the number of correctly mapped reads. The second part of the thesis is an analysis of limitations and errors of Ion Torrent and MiSeq. Sequencing errors were defined as the percentage of mismatches, insertion and deletions per position given a semi-global alignment mapping between read and reference sequence. We measured a mean error rate for MiSeq of 0.8\% and for Ion Torrent of 1.5\%. Moreover we identified for both technologies a non-uniform distribution of errors and even more severe of the corresponding nucleotide frequencies given a difference in the alignment. This is an important result since it reveals that some differences (e.g. mismatches) are more likely to occur than others and thus lead to a biased analysis. When looking at the distribution of the reads accross the sample carrier of the sequencing machine we discovered a clustering of reads that have a high difference (>30%> 30\%) compared to the reference sequence. This is unexpected since reads with a high difference are believed to origin either from contamination or errors in the library preparation, and should therefore be uniformly distributed on the sample carrier of the sequencing machine. Finally, we present a method called DeFenSe (Detection of Falsely Aligned Sequences) to detect and reduce contamination in NGS data. DeFenSe computes a pairwise alignment score threshold based on the alignment of randomly sampled reads to the reference genome. This threshold is then used to filter the mapped reads. It was applied in combination with two widely used mapping programs to real data resulting in a reduction of contamination of up to 99.8\%. In contrast to previous methods DeFenSe works independently of the number of differences between the reference and the targeted genome. Moreover, DeFenSe neither relies on ad hoc decisions like identity threshold or mapping quality thresholds nor does it require prior knowledge of the sequenced organism. The combination of these methods may lead to the possibility of transferring knowledge from model organisms to non model organisms by the usage of NGS. In addition, it enables to study biological mechanisms even in high polymorphic regions.Next Generation Sequencing (NGS) is a powerful tool to gain new insights in molecular biology. With the introduction of the first bench top NGS sequencing machines (e.g. Ion Torrent, MiSeq), this technology became even more versatile in its applications and the amount of data that are produced in a short time is ever increasing. The demand for new and more efficient sequence analysis tools increases at the same rate as the throughput of sequencing technologies. New methods and algorithms not only need to be more efficient but also need to account for a higher genetic variability between the sequenced and annotated data. To obtain reliable results, information about errors and limitations of NGS technologies should also be investigated. Furthermore, methods need to be able to cope with contamination in the data. In this thesis we present methods and algorithms for NGS analysis. Firstly, we present a fast and precise method to align NGS reads to a reference genome. This method, called NextGenMap, was designed to work with data from Illumina, 454 and Ion Torrent technologies, and is easily extendable to new upcoming technologies. We use a pairwise sequence alignment in combination with an exact match filter approach to maximize the number of correctly mapped reads. To reduce runtime (mapping a 16x coverage human genome data set within hours) we developed an optimized banded pairwise alignment algorithm for NGS data. We implemented this algorithm using high performance programing interfaces for central processing units using SSE (Streaming SIMD Extensions) and OpenCL as well as for graphic processing units using OpenCL and CUDA. Thus, NextGenMap can make maximal use of all existing hardware no matter whether it is a high end compute cluster or a standard desktop computer or even a laptop. We demonstrated the advantages of NextGenMap based on real and simulated data over other mapping methods and showed that NextGenMap outperforms current methods with respect to the number of correctly mapped reads. The second part of the thesis is an analysis of limitations and errors of Ion Torrent and MiSeq. Sequencing errors were defined as the percentage of mismatches, insertion and deletions per position given a semi-global alignment mapping between read and reference sequence. We measured a mean error rate for MiSeq of 0.8\% and for Ion Torrent of 1.5\%. Moreover we identified for both technologies a non-uniform distribution of errors and even more severe of the corresponding nucleotide frequencies given a difference in the alignment. This is an important result since it reveals that some differences (e.g. mismatches) are more likely to occur than others and thus lead to a biased analysis. When looking at the distribution of the reads accross the sample carrier of the sequencing machine we discovered a clustering of reads that have a high difference (>30%> 30\%) compared to the reference sequence. This is unexpected since reads with a high difference are believed to origin either from contamination or errors in the library preparation, and should therefore be uniformly distributed on the sample carrier of the sequencing machine. Finally, we present a method called DeFenSe (Detection of Falsely Aligned Sequences) to detect and reduce contamination in NGS data. DeFenSe computes a pairwise alignment score threshold based on the alignment of randomly sampled reads to the reference genome. This threshold is then used to filter the mapped reads. It was applied in combination with two widely used mapping programs to real data resulting in a reduction of contamination of up to 99.8\%. In contrast to previous methods DeFenSe works independently of the number of differences between the reference and the targeted genome. Moreover, DeFenSe neither relies on ad hoc decisions like identity threshold or mapping quality thresholds nor does it require prior knowledge of the sequenced organism. The combination of these methods may lead to the possibility of transferring knowledge from model organisms to non model organisms by the usage of NGS. In addition, it enables to study biological mechanisms even in high polymorphic regions

    Knowledge Extraction from Textual Resources through Semantic Web Tools and Advanced Machine Learning Algorithms for Applications in Various Domains

    Get PDF
    Nowadays there is a tremendous amount of unstructured data, often represented by texts, which is created and stored in variety of forms in many domains such as patients' health records, social networks comments, scientific publications, and so on. This volume of data represents an invaluable source of knowledge, but unfortunately it is challenging its mining for machines. At the same time, novel tools as well as advanced methodologies have been introduced in several domains, improving the efficacy and the efficiency of data-based services. Following this trend, this thesis shows how to parse data from text with Semantic Web based tools, feed data into Machine Learning methodologies, and produce services or resources to facilitate the execution of some tasks. More precisely, the use of Semantic Web technologies powered by Machine Learning algorithms has been investigated in the Healthcare and E-Learning domains through not yet experimented methodologies. Furthermore, this thesis investigates the use of some state-of-the-art tools to move data from texts to graphs for representing the knowledge contained in scientific literature. Finally, the use of a Semantic Web ontology and novel heuristics to detect insights from biological data in form of graph are presented. The thesis contributes to the scientific literature in terms of results and resources. Most of the material presented in this thesis derives from research papers published in international journals or conference proceedings

    Skaalautuvat laskentamenetelmät suuren kapasiteetin sekvensointidatan analytiikkaan populaatiogenomiikassa

    Get PDF
    High-throughput sequencing (HTS) technologies have enabled rapid DNA sequencing of whole-genomes collected from various organisms and environments, including human tissues, plants, soil, water, and air. As a result, sequencing data volumes have grown by several orders of magnitude, and the number of assembled whole-genomes is increasing rapidly as well. This whole-genome sequencing (WGS) data has revealed the genetic variation in humans and other species, and advanced various fields from human and microbial genomics to drug design and personalized medicine. The amount of sequencing data has almost doubled every six months, creating new possibilities but also big data challenges in genomics. Diverse methods used in modern computational biology require a vast amount of computational power, and advances in HTS technology are even widening the gap between the analysis input data and the analysis outcome. Currently, many of the existing genomic analysis tools, algorithms, and pipelines are not fully exploiting the power of distributed and high-performance computing, which in turn limits the analysis throughput and restrains the deployment of the applications to clinical practice in the long run. Thus, the relevance of harnessing distributed and cloud computing in bioinformatics is more significant than ever before. Besides, efficient data compression and storage methods for genomic data processing and retrieval integrated with conventional bioinformatics tools are essential. These vast datasets have to be stored and structured in formats that can be managed, processed, searched, and analyzed efficiently in distributed systems. Genomic data contain repetitive sequences, which is one key property in developing efficient compression algorithms to alleviate the data storage burden. Moreover, indexing compressed sequences appropriately for bioinformatics tools, such as read aligners, offers direct sequence search and alignment capabilities with compressed indexes. Relative Lempel-Ziv (RLZ) has been found to be an efficient compression method for repetitive genomes that complies with the data-parallel computing approach. RLZ has recently been used to build hybrid-indexes compatible with read aligners, and we focus on extending it with distributed computing. Data structures found in genomic data formats have properties suitable for parallelizing routine bioinformatics methods, e.g., sequence matching, read alignment, genome assembly, genotype imputation, and variant calling. Compressed indexing fused with the routine bioinformatics methods and data-parallel computing seems a promising approach to building population-scale genome analysis pipelines. Various data decomposition and transformation strategies are studied for optimizing data-parallel computing performance when such routine bioinformatics methods are executed in a complex pipeline. These novel distributed methods are studied in this dissertation and demonstrated in a generalized scalable bioinformatics analysis pipeline design. The dissertation starts from the main concepts of genomics and DNA sequencing technologies and builds routine bioinformatics methods on the principles of distributed and parallel computing. This dissertation advances towards designing fully distributed and scalable bioinformatics pipelines focusing on population genomic problems where the input data sets are vast and the analysis results are hard to achieve with conventional computing. Finally, the methods studied are applied in scalable population genomics applications using real WGS data and experimented with in a high performance computing cluster. The experiments include mining virus sequences from human metagenomes, imputing genotypes from large-scale human populations, sequence alignment with compressed pan-genomic indexes, and assembling reference genomes for pan-genomic variant calling.Suuren kapasiteetin sekvensointimenetelmät (High-Throughput Sequencing, HTS) ovat mahdollistaneet kokonaisten genomien nopean ja huokean sekvensoinnin eri organismeista ja ympäristöistä, mukaan lukien kudos-, maaperä-, vesistö- ja ilmastonäytteet. Tämän seurauksena sekvensointidatan ja koostettujen kokogenomien määrät ovat kasvaneet nopeasti. Kokogenomin sekvensointi on lisännyt ihmisen ja muiden lajien geneettisen perimän tietämystä ja edistänyt eri tieteenaloja ympäristötieteistä lääkesuunnitteluun ja yksilölliseen lääketieteeseen. Sekvensointidatan määrä on lähes kaksinkertaistunut puolivuosittain, mikä on luonut uusia mahdollisuuksia läpimurtoihin, mutta myös suuria datankäsittelyn haasteita. Nykyaikaisessa laskennallisessa biologiassa käytettävät monimutkaiset analyysimenetelmät vaativat yhä enemmän laskentatehoa HTS-datan kasvaessa, ja siksi HTS-menetelmien edistyminen kasvattaa kuilua raakadatasta lopullisiin analyysituloksiin. Useat tällä hetkellä käytetyistä genomianalyysityökaluista, algoritmeista ja ohjelmistoista eivät hyödynnä hajautetun laskennan tehoa kokonaisvaltaisesti, mikä puolestaan ​​hidastaa uusimpien analyysitulosten saamista ja rajoittaa tieteellisten ohjelmistojen käyttöönottoa kliinisessä lääketieteessä pitkällä aikavälillä. Näin ollen hajautetun ja pilvilaskennan hyödyntämisen merkitys bioinformatiikassa on tärkeämpää kuin koskaan ennen. Genomitiedon suoraa hakua ja käsittelyä tukevat pakkaus- ja tallennusmenetelmät mahdollistavat nopean ja tilatehokkaan genomianalytiikan. Uusia hajautettuihin järjestelmiin soveltuvia tietorakenteita tarvitaan, jotta näitä suuria datamääriä voidaan hallita, käsitellä, hakea ja analysoida tehokkaasti. Genomidata sisältää runsaasti toistuvia sekvenssejä, mikä on yksi keskeinen ominaisuus kehitettäessä tehokkaita pakkausalgoritmeja tiedontallennustaakkaa ja analysointia keventämään. Lisäksi pakattujen sekvenssien indeksointi yhdistettynä sekvenssilinjausmenetelmiin mahdollistaa sekvenssien satunnaishaun ja suoran linjauksen pakattuihin sekvensseihin. Relative Lempel-Ziv (RLZ) pakkausmenetelmä on todettu tehokkaaksi toistuville genomisekvensseille rinnakkaislaskentaa hyödyntäen. RLZ-menetelmää on viime aikoina sovellettu sekvenssilinjaukseen yhteensopiviin hybridi-indekseihin, joita tässä työssä on nopeutettu hajautetulla laskennalla. Genomiikan dataformaateista löytyvillä tietorakenteilla on ominaisuuksia, jotka soveltuvat hajautettuun sekvenssihakuun, sekvenssilinjaukseen, genomien koostamiseen, genotyyppien imputointiin ja varianttien havaitsemiseen. Pakattu indeksointi sovellettuna hajautetulla laskennalla tehostettuihin menetelmiin vaikuttaa lupaavalta lähestymistavalta populaatiogenomiikan analyysiohjelmistojen mukauttamiseksi suuriin datamääriin. Erilaisia ​​tiedon osittamis- ja muunnosstrategioita hyödynnetään suorituskyvyn tehostamiseen monivaiheisessa hajautetussa genomidatan prosessoinnissa. Näitä uusia skaalautuvia hajautettuja laskentamenetelmiä tutkitaan tässä väitöskirjassa ja demonstroidaan yleisluontoisella bioinformatiikan analyysiohjelmiston arkkitehtuurilla. Tässä työssä johdatellaan genomiikan ja DNA-sekvensointitekniikoiden peruskäsitteisiin ja esitellään rutiininomaisia ​​bioinformatiikan menetelmiä perustuen hajautetun ja rinnakkaislaskennan periaatteille. Väitöskirjassa edetään kohti täysin hajautettujen ja skaalautuvien bioinformatiikan ohjelmistojen suunnittelua keskittyen populaatiogenomiikan ongelmiin, joissa syötedatan määrät ovat suuria ja analyysitulosten saavuttaminen on hidasta tai jopa mahdotonta tavanomaisella laskennalla. Lopuksi tutkittuja menetelmiä sovelletaan tässä työssä kehitettyihin skaalautuviin populaatiogenomiikan sovelluksiin, joita koestetaan kokogenomidatalla supertietokoneen laskentaklusterissa. Kokeet sisältävät virussekvenssien louhintaa ihmisten metagenominäytteistä, genotyyppien täydentämistä (imputointia) suurista ihmispopulaatioista ja pan-genomisen indeksin pakkaamista sekvenssilinjauksen nopeuttamista varten. Lisäksi pakattua pan-genomia kokeillaan referenssigenomin koostamiseen populaatioon perustuvien varianttien havaitsemista varten

    Foundations of Multi-Paradigm Modelling for Cyber-Physical Systems

    Get PDF
    This open access book coherently gathers well-founded information on the fundamentals of and formalisms for modelling cyber-physical systems (CPS). Highlighting the cross-disciplinary nature of CPS modelling, it also serves as a bridge for anyone entering CPS from related areas of computer science or engineering. Truly complex, engineered systems—known as cyber-physical systems—that integrate physical, software, and network aspects are now on the rise. However, there is no unifying theory nor systematic design methods, techniques or tools for these systems. Individual (mechanical, electrical, network or software) engineering disciplines only offer partial solutions. A technique known as Multi-Paradigm Modelling has recently emerged suggesting to model every part and aspect of a system explicitly, at the most appropriate level(s) of abstraction, using the most appropriate modelling formalism(s), and then weaving the results together to form a representation of the system. If properly applied, it enables, among other global aspects, performance analysis, exhaustive simulation, and verification. This book is the first systematic attempt to bring together these formalisms for anyone starting in the field of CPS who seeks solid modelling foundations and a comprehensive introduction to the distinct existing techniques that are multi-paradigmatic. Though chiefly intended for master and post-graduate level students in computer science and engineering, it can also be used as a reference text for practitioners

    A cross-stack, network-centric architectural design for next-generation datacenters

    Get PDF
    This thesis proposes a full-stack, cross-layer datacenter architecture based on in-network computing and near-memory processing paradigms. The proposed datacenter architecture is built atop two principles: (1) utilizing commodity, off-the-shelf hardware (i.e., processor, DRAM, and network devices) with minimal changes to their architecture, and (2) providing a standard interface to the programmers for using the novel hardware. More specifically, the proposed datacenter architecture enables a smart network adapter to collectively compress/decompress data exchange between distributed DNN training nodes and assist the operating system in performing aggressive processor power management. It also deploys specialized memory modules in the servers, capable of performing general-purpose computation and network connectivity. This thesis unlocks the potentials of hardware and operating system co-design in architecting application-transparent, near-data processing hardware for improving datacenter's performance, energy efficiency, and scalability. We evaluate the proposed datacenter architecture using a combination of full-system simulation, FPGA prototyping, and real-system experiments
    corecore