529 research outputs found

    Lossless seeds for searching short patterns with high error rates

    Get PDF
    International audienceWe address the problem of approximate pattern matching using the Levenshtein distance. Given a text T and a pattern P , find alllocations in T that differ by at most k errors from P . For that purpose, we propose a filtration algorithm that is based on a novel type of seeds,combining exact parts and parts with a fixed number of errors. Experimental tests show that the method is specifically well-suited for short patterns with a large number of error

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

    Full text link
    Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming outperforms today's best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work

    RazerS - Fast Read Mapping with Sensitivity Control

    Get PDF
    Second-generation sequencing technologies deliver DNA sequence data at unprecedented high throughput. Common to most biological applications is a mapping of the reads to an almost identical or highly similar reference genome. Due to the large amounts of data, eïŹƒcient algorithms and implementations are crucial for this task. We present an eïŹƒcient read mapping tool called RazerS. It allows the user to align sequencing reads of arbitrary length using either the Hamming distance or the edit distance. Our tool can work either lossless or with a user-deïŹned loss rate at higher speeds. Given the loss rate, we present an approach that guarantees not to lose more reads than speciïŹed. This enables the user to adapt to the problem at hand and provides a seamless tradeoïŹ€ between sensitivity and running time

    Read alignment using deep neural networks

    Get PDF
    2019 Spring.Includes bibliographical references.Read alignment is the process of mapping short DNA sequences into the reference genome. With the advent of consecutively evolving "next generation" sequencing technologies, the need for sequence alignment tools appeared. Many scientific communities and the companies marketing the sequencing technologies developed a whole spectrum of read aligners/mappers for different error profiles and read length characteristics. Among the most recent successfully marketed sequencing technologies are Oxford Nanopore and PacBio SMRT sequencing, which are considered top players because of their extremely long reads and low cost. However, the reads may contain error up to 20% that are not generally uniformly distributed. To deal with that level of error rate and read length, proximity preserving hashing techniques, such as Minhash and Minimizers, were utilized to quickly map a read to the target region of the reference sequence. Subsequently, a variant of global or local alignment dynamic programming is then used to give the final alignment. In this research work, we train a Deep Neural Network (DNN) to yield a hashing scheme for the highly erroneous long reads, which is deemed superior to Minhash for mapping the reads. We implemented that idea to build a read alignment tool: DNNAligner. We evaluated the performance of our aligner against the popular read aligners in the bioinformatics community currently — minimap2, bwa-mem and graphmap. Our results show that the performance of DNNAligner is comparable to other tools without any code optimization or integration of other advanced features. Moreover, DNN exhibits superior performance in comparison with Minhashon neighborhood classification

    Computational Methods for Gene Expression and Genomic Sequence Analysis

    Get PDF
    Advances in technologies currently produce more and more cost-effective, high-throughput, and large-scale biological data. As a result, there is an urgent need for developing efficient computational methods for analyzing these massive data. In this dissertation, we introduce methods to address several important issues in gene expression and genomic sequence analysis, two of the most important areas in bioinformatics.Firstly, we introduce a novel approach to predicting patterns of gene response to multiple treatments in case of small sample size. Researchers are increasingly interested in experiments with many treatments such as chemicals compounds or drug doses. However, due to cost, many experiments do not have large enough samples, making it difficult for conventional methods to predict patterns of gene response. Here we introduce an approach which exploited dependencies of pairwise comparisons outcomes and resampling techniques to predict true patterns of gene response in case of insufficient samples. This approach deduced more and better functionally enriched gene clusters than conventional methods. Our approach is therefore useful for multiple-treatment studies which have small sample size or contain highly variantly expressed genes.Secondly, we introduce a novel method for aligning short reads, which are DNA fragments extracted across genomes of individuals, to reference genomes. Results from short read alignment can be used for many studies such as measuring gene expression or detecting genetic variants. Here we introduce a method which employed an iterated randomized algorithm based on FM-index, an efficient data structure for full-text indexing, to align reads to the reference. This method improved alignment performance across a wide range of read lengths and error rates compared to several popular methods, making it a good choice for community to perform short read alignment.Finally, we introduce a novel approach to detecting genetic variants such as SNPs (single nucleotide polymorphisms) or INDELs (insertions/deletions). This study has great significance in a wide range of areas, from bioinformatics and genetic research to medical field. For example, one can predict how genomic changes are related to phenotype in their organism of interest, or associate genetic changes to disease risk or medical treatment efficacy. Here we introduce a method which leveraged known genetic variants existing in well-established databases to improve accuracy of detecting variants. This method had higher accuracy than several state-of-the-art methods in many cases, especially for detecting INDELs. Our method therefore has potential to be useful in research and clinical applications which rely on identifying genetic variants accurately

    Indices and Applications in High-Throughput Sequencing

    Get PDF
    Recent advances in sequencing technology allow to produce billions of base pairs per day in the form of reads of length 100 bp an longer and current developments promise the personal $1,000 genome in a couple of years. The analysis of these unprecedented amounts of data demands for efficient data structures and algorithms. One such data structures is the substring index, that represents all substrings or substrings up to a certain length contained in a given text. In this thesis we propose 3 substring indices, which we extend to be applicable to millions of sequences. We devise internal and external memory construction algorithms and a uniform framework for accessing the generalized suffix tree. Additionally we propose different index-based applications, e.g. exact and approximate pattern matching and different repeat search algorithms. Second, we present the read mapping tool RazerS, which aligns millions of single or paired-end reads of arbitrary lengths to their potential genomic origin using either Hamming or edit distance. Our tool can work either lossless or with a user-defined loss rate at higher speeds. Given the loss rate, we present a novel approach that guarantees not to lose more reads than specified. This enables the user to adapt to the problem at hand and provides a seamless tradeoff between sensitivity and running time. We compare RazerS with other state-of-the-art read mappers and show that it has the highest sensitivity and a comparable performance on various real-world datasets. At last, we propose a general approach for frequency based string mining, which has many applications, e.g. in contrast data mining. Our contribution is a novel and lightweight algorithm that is faster and uses less memory than the best available algorithms. We show its applicability for mining multiple databases with a variety of frequency constraints. As such, we use the notion of entropy from information theory to generalize the emerging substring mining problem to multiple databases. To demonstrate the improvement of our algorithm we compared to recent approaches on real-world experiments of various string domains, e.g. natural language, DNA, or protein sequences

    Discovering Regularity in Point Clouds of Urban Scenes

    Full text link
    Despite the apparent chaos of the urban environment, cities are actually replete with regularity. From the grid of streets laid out over the earth, to the lattice of windows thrown up into the sky, periodic regularity abounds in the urban scene. Just as salient, though less uniform, are the self-similar branching patterns of trees and vegetation that line streets and fill parks. We propose novel methods for discovering these regularities in 3D range scans acquired by a time-of-flight laser sensor. The applications of this regularity information are broad, and we present two original algorithms. The first exploits the efficiency of the Fourier transform for the real-time detection of periodicity in building facades. Periodic regularity is discovered online by doing a plane sweep across the scene and analyzing the frequency space of each column in the sweep. The simplicity and online nature of this algorithm allow it to be embedded in scanner hardware, making periodicity detection a built-in feature of future 3D cameras. We demonstrate the usefulness of periodicity in view registration, compression, segmentation, and facade reconstruction. The second algorithm leverages the hierarchical decomposition and locality in space of the wavelet transform to find stochastic parameters for procedural models that succinctly describe vegetation. These procedural models facilitate the generation of virtual worlds for architecture, gaming, and augmented reality. The self-similarity of vegetation can be inferred using multi-resolution analysis to discover the underlying branching patterns. We present a unified framework of these tools, enabling the modeling, transmission, and compression of high-resolution, accurate, and immersive 3D images

    Routing on the Channel Dependency Graph:: A New Approach to Deadlock-Free, Destination-Based, High-Performance Routing for Lossless Interconnection Networks

    Get PDF
    In the pursuit for ever-increasing compute power, and with Moore's law slowly coming to an end, high-performance computing started to scale-out to larger systems. Alongside the increasing system size, the interconnection network is growing to accommodate and connect tens of thousands of compute nodes. These networks have a large influence on total cost, application performance, energy consumption, and overall system efficiency of the supercomputer. Unfortunately, state-of-the-art routing algorithms, which define the packet paths through the network, do not utilize this important resource efficiently. Topology-aware routing algorithms become increasingly inapplicable, due to irregular topologies, which either are irregular by design, or most often a result of hardware failures. Exchanging faulty network components potentially requires whole system downtime further increasing the cost of the failure. This management approach becomes more and more impractical due to the scale of today's networks and the accompanying steady decrease of the mean time between failures. Alternative methods of operating and maintaining these high-performance interconnects, both in terms of hardware- and software-management, are necessary to mitigate negative effects experienced by scientific applications executed on the supercomputer. However, existing topology-agnostic routing algorithms either suffer from poor load balancing or are not bounded in the number of virtual channels needed to resolve deadlocks in the routing tables. Using the fail-in-place strategy, a well-established method for storage systems to repair only critical component failures, is a feasible solution for current and future HPC interconnects as well as other large-scale installations such as data center networks. Although, an appropriate combination of topology and routing algorithm is required to minimize the throughput degradation for the entire system. This thesis contributes a network simulation toolchain to facilitate the process of finding a suitable combination, either during system design or while it is in operation. On top of this foundation, a key contribution is a novel scheduling-aware routing, which reduces fault-induced throughput degradation while improving overall network utilization. The scheduling-aware routing performs frequent property preserving routing updates to optimize the path balancing for simultaneously running batch jobs. The increased deployment of lossless interconnection networks, in conjunction with fail-in-place modes of operation and topology-agnostic, scheduling-aware routing algorithms, necessitates new solutions to solve the routing-deadlock problem. Therefore, this thesis further advances the state-of-the-art by introducing a novel concept of routing on the channel dependency graph, which allows the design of an universally applicable destination-based routing capable of optimizing the path balancing without exceeding a given number of virtual channels, which are a common hardware limitation. This disruptive innovation enables implicit deadlock-avoidance during path calculation, instead of solving both problems separately as all previous solutions

    Computational biology in the 21st century

    Get PDF
    Computational biologists answer biological and biomedical questions by using computation in support of—or in place of—laboratory procedures, hoping to obtain more accurate answers at a greatly reduced cost. The past two decades have seen unprecedented technological progress with regard to generating biological data; next-generation sequencing, mass spectrometry, microarrays, cryo-electron microscopy, and other highthroughput approaches have led to an explosion of data. However, this explosion is a mixed blessing. On the one hand, the scale and scope of data should allow new insights into genetic and infectious diseases, cancer, basic biology, and even human migration patterns. On the other hand, researchers are generating datasets so massive that it has become difficult to analyze them to discover patterns that give clues to the underlying biological processes.National Institutes of Health. (U.S.) ( grant GM108348)Hertz Foundatio
    • 

    corecore