804 research outputs found

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    RLZAP: Relative Lempel-Ziv with Adaptive Pointers

    Full text link
    Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is selected and then the other genomes are greedily parsed into phrases exactly matching substrings of the reference. Deorowicz and Grabowski (Bioinformatics, 2011) pointed out that letting each phrase end with a mismatch character usually gives better compression because many of the differences between individuals' genomes are single-nucleotide substitutions. Ferrada et al. (SPIRE 2014) then pointed out that also using relative pointers and run-length compressing them usually gives even better compression. In this paper we generalize Ferrada et al.'s idea to handle well also short insertions, deletions and multi-character substitutions. We show experimentally that our generalization achieves better compression than Ferrada et al.'s implementation with comparable random-access times

    Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools

    Get PDF
    This dissertation focuses on two fundamental sorting problems: string sorting and suffix sorting. The first part considers parallel string sorting on shared-memory multi-core machines, the second part external memory suffix sorting using the induced sorting principle, and the third part distributed external memory suffix sorting with a new distributed algorithmic big data framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie (2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author

    Processing and indexing large biological datasets using the Burrows-Wheeler Transform of string collections

    Get PDF
    In the last few decades, the advent of next-generation sequencing technologies (NGS) has dramatically reduced the cost of DNA sequencing. This has made it possible to sequence many genomes in very little time, paving the way for projects which aim at the creation of large and repetitive collections of genomic sequences. The abundance of biological data is driving the development of new memory-efficient algorithms and data structures that can scale for large datasets, thus tackling the high computational burden related to processing these data. This trend has a strong impact on the text algorithms area. In this thesis, we will study the Burrows-Wheeler Transform for processing, indexing, and compressing collections of strings. Data compression addresses the problem of encoding the input to reduce the space needed for storing it, while text indexing focuses on finding ways to efficiently process and extract information from the data. In bioinformatics, these two concepts have been frequently used together since they allow the design of data structures that can efficiently process biological data while keeping the input compressed. The Burrows-Wheeler Transform (BWT) is a reversible transformation on strings introduced by Michael Burrows and David J. Wheeler in 1994 that plays a central role in this area. It is the key component of several compressed data structures for text processing, like the FM-index [Ferraggina and Manzini, SODA, 2000] or the r-index [Gagie et al., SODA, 2018], and some of the most important software in bioinformatics, such as the well-known Bowtie [Langmead et al., Genome Biology, 2009] and BWA [Li and Durbin, Bioinformatics, 2010]. The BWT was originally defined for individual strings, so when the focus moved from single sequences to string collections, there was the need to extend this transform. Over the years, several different tools and algorithms for computing BWT of string collections were introduced. However, even if the transforms generated by these tools frequently differ from each other, the problem of characterizing the BWT variants was never addressed properly. In this thesis, we close this gap by presenting the first systematic study of the BWT of string collections. We identified five non-equivalent variants computed by the tools in current use and analyzed their properties to show how exactly they differ. We complete our theoretical analysis by comparing the five BWT variants on several real-life biological datasets. We show that not only the differences among the resulting transforms can be extensive, but they also lead to significant changes in the compressibility of the BWT of the underlying string collection. As a further complication, the BWT variants in use often depend on the input order of the sequences. This significantly impacts the number of runs r, which defines the size of BWT-based compressed data structures. In this thesis, we address the problem of reordering the input sequences by providing the first implementation of the algorithm of Bentley et al. [ESA 2020], which computes the order minimizing the number of runs of the BWT. This leads to the creation of the first tool for computing the optimal BWT, i.e., the BWT variant which guarantees the minimum number of runs. We show experimentally that the input order can dramatically affect the final result: on our real-life datasets, the optimal BWT had up to 31 times fewer runs than the BWT computed without reordering the input sequences. The extended BWT (eBWT) of Mantaci et al. [Theor. Comput. Sci. 2007] is one of the first BWT variants explicitly designed to process string collections. Even though this transform is mathematically sound and has useful properties, its construction has been a problem for more than a decade. In this thesis, we present two linear-time algorithms for computing the eBWT of large string collections. The first is an improvement of the Bijective BWT construction algorithm of Bannai et al. [CPM 2019], while the second uses the Prefix-free parsing (PFP) method [Boucher et al., Algorithms Mol. Biol., 2019] to specifically process large and repetitive genomic sequence collections. In the final part of the thesis, we conclude by studying, for the first time, how to index string collections using the eBWT. We present the extended r-index, an extension of the r-index to the eBWT, which maintains the same performance as the original r-index while inheriting the properties of the eBWT. We implemented this data structure using a variant of the PFP algorithm and tested it on real-life biological datasets containing circular bacterial genomes and plasmids. We show experimentally that our index has competitive query times compared to the r-index on different pattern lengths while supporting advanced pattern matching functionalities on circular sequences

    Parallel String Sample Sort

    Get PDF
    We discuss how string sorting algorithms can be parallelized on modern multi-core shared memory machines. As a synthesis of the best sequential string sorting algorithms and successful parallel sorting algorithms for atomic objects, we propose string sample sort. The algorithm makes effective use of the memory hierarchy, uses additional word level parallelism, and largely avoids branch mispredictions. Additionally, we parallelize variants of multikey quicksort and radix sort that are also useful in certain situations.Comment: 34 pages, 7 figures and 12 table

    A practical and efficient algorithm for the k-mismatch shortest unique substring finding problem

    Get PDF
    This thesis revisits the k-mismatch shortest unique substring (SUS) finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n logk n), while maintaining a practical space complexity at O(kn), where n is the string length. When k \u3e 0, which is the hard case, the new proposal significantly improves the any-case O(n2) time complexity of the prior best method for k-mismatch SUS finding. Experimental study shows that the new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution\u27s implementation when k is small relative to n. For example, the proposed method processes a 200KB sample DNA sequence with k = 1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel resulting in further significant practical performance improvement. As an example, when using 8 cores to process a 10MB sample DNA sequence with k = 2, two parallel implementations each achieved processing times less than 1=4 that of the serial implementation. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that trading additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may require years to process a 200MB DNA sample for any k \u3e 0, while this new proposal, using 24 cores, finished processing a sample of this size with k = 1 in 206:376 seconds with a peak memory usage of 46GB, which is easily available and affordable for many users. It is expected that this new practical and efficient algorithm for k-mismatch SUS finding will prove useful to those using the measure on long sequences in fields such as computational biology

    USING THE MULTI-STRING BURROW-WHEELER TRANSFORM FOR HIGH-THROUGHPUT SEQUENCE ANALYSIS

    Get PDF
    The throughput of sequencing technologies has created a bottleneck where raw sequence files are stored in an un-indexed format on disk. Alignment to a reference genome is the most common pre-processing method for indexing this data, but alignment requires a priori knowledge of a reference sequence, and often loses a significant amount of sequencing data due to biases. Sequencing data can instead be stored in a lossless, compressed, indexed format using the multi-string Burrows Wheeler Transform (BWT). This dissertation introduces three algorithms that enable faster construction of the BWT for sequencing datasets. The first two algorithms are a merge algorithm for merging two or more BWTs into a single BWT and a merge-based divide-and-conquer algorithm that will construct a BWT from any sequencing dataset. The third algorithm is an induced sorting algorithm that constructs the BWT from any string collection and is well-suited for building BWTs of long-read sequencing datasets. These algorithms are evaluated based on their efficiency and utility in constructing BWTs of different types of sequencing data. This dissertation also introduces two applications of the BWT: long-read error correction and a set of biologically motivated sequence search tools. The long-read error correction is evaluated based on accuracy and efficiency of the correction. Our analyses show that the BWT of almost all sequencing datasets can now be efficiently constructed. Once constructed, we show that the BWT offers significant utility in performing fast searches as well as fast and accurate long read corrections. Additionally, we highlight several use cases of the BWT-based web tools in answering biologically mo- tivated problems.Doctor of Philosoph
    corecore